## Logistic Regression

When we looked at multiple regression, we dealt with the situation in which we had a slew of independent variables that we wanted to use to predict a continuous dependent variable (DV). Sometimes, though, we want to predict a DV that has only two states, such as dead or alive, or pregnant versus not pregnant. If we coded the DV as 0 for one state and 1 for the other and threw this into a multiple regression, we'd end up in some cases with predicted values that were greater than 1 or less than 0. Statisticians (and real people, too) have difficulty interpreting answers that say that some people are better than alive or worse than dead. So they had to come up with a way to constrain the answer to be between 0 and 1, and the method was logistic regression.

Let's start off by setting the scene. Several years ago, when my daughter was still at home, she taught me more about 1960s rock music than I (Geoffrey Norman) ever knew in the '60s. I grew to recognize the Steve

Miller Band, Pink Floyd, Led Zeppelin, Jefferson Airplane, and the Rolling Stones. (I was okay on the Beatles and on the Mamas and the Papas, honest!) The intervening years have been unkind to rock stars. Buddy Holly, the Big Bopper, and Richie Valens were the first to go, in a plane crash way back then. Hendrix, Jim Morrison, Janis Joplin, and Mama Cass all went young, mostly to drugs. Elvis succumbed to a profligate excess of everything. Nasty diseases got Zappa and George Harrison, and Lennon met an untimely death outside the Dakota Hotel in Manhattan. Indeed, at one point, I commented to my daughter that all her heroes were 5 years older than me or had pegged off two decades before she was born.

Why is the risk of death so high for rock stars? Lifestyle, of course, comes to mind. Hard drugs take their toll, as do too many late nights with too much booze. Travel in small planes from one gig to the next is another risk. Celebrity has its risks. (In passing, the riskiest peacetime occupation we know is not that of lumberjack or miner—it's that of president of the United States. Four of 43 US presidents have been assassinated, and four have died in office—not to mention those who were brain-dead before entering office—for a fatality rate of 18.6%.)

Suppose we wanted to put it together and assemble a Rock Risk Ratio, or R3—a prediction of the chances of a rock star's dying prematurely, based on lifestyle factors. We assemble a long list of rock stars, present and past, dead and alive, and determine what lifestyle factors might have contributed to the demise of some of them. Some possibilities are shown in Table 8-1.

Now, if R3 were a continuous variable, we would have a situation in which we're trying to predict it from a bunch of continuous and discrete variables. As we mentioned, that's a job for multiple regression. We would just make up a regression equation to predict R3 from a combination of DRUG, BOOZE, CONC, and GRAM, like this:

The trouble is that death is not continuous (although aging makes it feel that way); it's just 1 or 0, dead or alive. But the equation would dump out some number that could potentially go from — infinity to + infinity. We could re-interpret it as some index on some arbitrary scale of being dead or alive, but that's not so good because the starting data are still 0s and 1s, and that's

Table 8-1

Lifestyle Factors Contributing to the Demise of Rock Stars

Table 8-1

Lifestyle Factors Contributing to the Demise of Rock Stars

 Variable Description Type Values DRUG Hard drug use Nominal Yes / No BOOZ Excess alcohol use Nominal Yes / No GRAM Number of Grammies Interval 0 - 20 CONC Average no. of live concerts / year Interval 0 - 50

what we're trying to fit. It would be nice if we could interpret these fitted numbers as probabilities, but probabilities don't go in a straight line forever; they are bounded by 0 and 1 whereas our fitted data don't have these bounds.

Maybe, with a bit of subterfuge, we could get around the problem by transforming things so that the expression for R3 ranges smoothly only between 0 and 1. One such transformation is the logistic transformation:

Looks complicated, huh? (For our Canadian colleagues, "Looks complicated, eh?") Keep in mind that what we have here is the probability of dying at a given value of R3. If you remember some high school math, when R3 = 0, p is 1 / (1 + exp(^)) = 1 / (1 + 1) = 0.5. When R3 goes to infinity (^), it becomes 1 / (1 + exp-(^)) = 1, and when R3 goes to —it becomes 1 / (1 + exp(^)) = 0. So it describes a smooth curve that approaches 0 for large negative values of R3, goes to 1 when R3 is large and positive, and is 0.5 when R3 is 0. A graph of the resulting logistic equation is shown in Figure 8-1.

So when the risk factors together give you a large negative value of R3, your chance of dying is near zero; when they give you a large positive value, up your life insurance because the probability gets near 1. Figure 8-1 The logistic function.

So far, so good. But all the risk factors are now buried in the middle of a messy equation. However, with some chicanery, which we will spare you, we can make it all look a bit better:

loB ^ = 1». + »i DRUG + b, BOOZE + »,GBAM + ».COW

Lo and behold, the linear regression is back (except that the expression on the left looks a bit strange). But there is a subtle difference. We're computing some continuous expression from the combination of risk factors, but what we're predicting is still 1 or 0. So the error in the prediction amounts to the difference between the predicted value of log [p / (1 — p)] and 1 if the singer in question has croaked or between log [p / (1 — p)] and

### 0 if he or she is still alive.

The issue now is how we actually compute the coefficients. In ordinary regression, we calculate things in such a way as to minimize the summed squared error (called the method ofleast squares). For arcane reasons, that's a no-no here. Instead, we must use an alternative and computationally intensive method called maximum likelihood estimation (MLE). Why you use this is of interest only to real statisticians, so present company (you and we) are exempt from having to undergo explanation. Suffice it to say that at the end, we have a set of parameters, the bs, with associated significance tests, just like we did with multiple regression.

If you have the misfortune of actually doing the analysis, you might, to your amazement, find that the printout is not too weird. Just as in multiple regression, we end up with a list of coefficients and their standard errors, a statistical test for each, and a p value. However, usually, the last column is another left hook, labeled EXP(b).

A cautionary note: the next bit involves some high school algebra around exponents and logs. If you don't remember this arcane stuff, either go read about it or skip the next paragraph. Likely, you'll be no worse off at the end.

So here we go. What's happening is this: we began with something that had all the b coefficients in a linear equation inside an exponential; now we're working backwards to that original equation. Suppose the only variable that was predictive was drug use (DRUG), which has only two values,

1 (present) or 0 (absent). Now the probability of death, given drug use, means that DRUG = 1, and the equation looks like the following:

And if there is no drug use, then DRUG = 0, and the formula is:

If we subtract the second from the first, then we get the difference between the two logs on the left, and bj, all by itself, on the right. But the difference between logs is the log of the ratio. So we end up with the following:

And as a final step, we get rid of the logs and end up with:

Now the ratio p/(1 — p) is an odds (see p. 105), so that mysterious last column, EXP(b), is just the odds ratio. So, for example, if the coefficient for DRUG was 2.5, then e+2 5 = 12.18, and the odds of the rock star's dying if he did drugs is 12.18 when compared to drug-free rock stars (both of them).

Another thing that is commonly done with logistic regression is to see how well it did in predicting eventual status (which is easy if you wait long enough). All you do is call all predicted probabilities of 0.49999 or less a prediction of ALIVE, and of 0.5 or greater a prediction of DEAD, then compare the prediction to the truth in a 2 X 2 table. There are also statistical tests of goodness of fit related to the likelihood function, tests on which you can put a p value if the mood strikes you.

In summary, logistic regression analysis is a powerful extension of multiple regression for use when the dependent variable is categorical (0 and 1). It works by computing a logistic function from the predictor variables and then comparing the computed probabilities to the 1s and 0s.