r:logistic-regression

The following example is based on:

Pedhazur, E. J. (1997). *Multiple regression in behavioral research: Explanation and prediction* (3rd ed.). Wadsworth.

From the chapter: “Categorical Dependent Variable: Logistic Regression” (pp. 714-725).

The toy example here involves the gender difference in admission to an engineering program. Create the data in R, add as grouped data.

yes <- c(7, 3) no <- c(3, 7) log.ex <- rbind(yes, no) dimnames(log.ex) <- list("Admit" = c("Yes", "No"), "Gender" = c("Male", "Female")) # Display the table log.ex

The table:

Gender | ||
---|---|---|

Admit | Male | Female |

Yes | 7 | 3 |

No | 3 | 7 |

*where*

X | ||
---|---|---|

Y | 1 | 0 |

1 | a | b |

0 | c | d |

Pedhazur shows how to calculate the odds ratio using two methods. First, using the table above:

$OR = \dfrac{\dfrac{a}{c}}{\dfrac{b}{d}} = \dfrac{ad}{bc}$ $= \dfrac{(7)(7)}{(3)(3)} = \dfrac{49}{9} = 5.44$

Second,

$OR = \dfrac{P}{1 - P}$ (or $\dfrac{P}{Q}$)

Where $odds(M)$ equals $P$ and $odds(F)$ equals $1 - P$ (or $Q$), then:

$odds(M) = \dfrac{.7}{.3} = 2.333$; $odds(F) = \dfrac{.3}{.7} = 0.429$; and;

$OR = \dfrac{2.333}{.429} = 5.44$

The logistic regression provides the parameters for the model. When collecting data, it may be natural to enter the data by individual observations (i.e., ungrouped). Because the data was added as grouped data (since it's taken from a textbook), entering it in R as such results in a matrix:

class(log.ex) [1] "matrix" log.ex Gender Admit Male Female Yes 7 3 No 3 7

Based on a tip (http://stats.stackexchange.com/questions/27400/logistic-regression-grouped-and-ungrouped-variables-using-r), the following code works on grouped data (note the order of the Admit coding (1,0), which matches the order in the *log.ex* table):

fit.1 <- glm(as.table(log.ex) ~ c(1,0), family = binomial) fit.1 Call: glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial) Coefficients: (Intercept) c(1, 0) -0.8473 1.6946 Degrees of Freedom: 1 Total (i.e. Null); 0 Residual Null Deviance: 3.291 Residual Deviance: 8.882e-16 AIC: 9.285

A summary of the model:

summary(fit.1) Call: glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.8473 0.6901 -1.228 0.2195 c(1, 0) 1.6946 0.9759 1.736 0.0825 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3.2913e+00 on 1 degrees of freedom Residual deviance: 8.8818e-16 on 0 degrees of freedom AIC: 9.2846 Number of Fisher Scoring iterations: 3

Exponentiate the coefficients:

exp(coef(fit.1)) (Intercept) c(1, 0) 0.4285714 5.4444444

Now we have the parameters for the odds ratio (5.44) we calculated above. The interpretation of this model is: “the odds of being admitted to the program, rather than being denied, are about 5.44 times greater (more favorable) for males than they are for females” (Pedhazur, p. 718).

On p. 729, Pedhazur describes a set of data where “the independent variable consists of more than two categories.” He presents a table on page 730 that looks like this:

Training | ||||
---|---|---|---|---|

Dependent Variable | Treatment 1 | Treatment 2 | Control | Totals |

Success | 30 | 40 | 10 | 80 |

Failure | 20 | 10 | 40 | 70 |

Totals | 50 | 50 | 50 | 150 |

Let's call the dependent variable the **outcome** variable and the **training** variable the **treatment** variable. The cell counts are frequencies, and so we create the data in **R** like so:

treatment <- c(rep(1, 50), rep(2, 50), rep(3, 50)) outcome <- c(rep(1, 30), rep(0, 20), rep(1, 40), rep(0, 10), rep(1, 10), rep(0, 40)) test.set <- as.data.frame(cbind(outcome, treatment))

Next we make sure our **treatment** variable is a factor and is properly leveled, such that the **Control** level is the reference level:

test.set$treatment <- factor(test.set$treatment, levels = c(1, 2, 3), labels = c("T1", "T2", "Control")) test.set$treatment <- relevel(test.set$treatment, ref = "Control")

Here's a glimpse of some of the rows in the data frame (e.g. `test.set[51:55, ]`

):

outcome | treatment | |
---|---|---|

1 | 1 | T1 |

2 | 1 | T1 |

… | … | … |

51 | 1 | T2 |

52 | 1 | T2 |

… | … | … |

101 | 1 | Control |

102 | 1 | Control |

We regress **outcome** $Y$ on **treatment** $X$ and take a look at the model and its parameters:

fit.1 <- glm(outcome ~ treatment, data = test.set, family = "binomial") summary(fit.1)

The model (below) corresponds to Table 17.1 (p. 731) in Pedhazur, but neither `summary`

nor `anova`

provide all the needed information. Pedhazur also reports the Wald chi-square test and the degrees of freedom for the coefficients in the model. To do a Wald test, we need the `aod`

package and then run the Wald test on each coefficient and on all the levels of the treatment variable (see http://stats.idre.ucla.edu/r/dae/logit-regression/ for more details):

library("aod") wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 1) # test constant wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2) # test treatmentT1 wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 3) # test treatmentT2 wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2:3) # test treatment entirely

Pedhazur also reports the exponentiated coefficients for the treatment levels. I'll add the exponentiated confidence intervals too, at the 95% level:

round(exp(cbind(OR = coef(fit.1), confint(fit.1))), 3)

We can now reconstruct Pedhazur's table. **R** rounds differently, but the results are the same:

Variable | Estimate B | S.E. | Wald | df | Sig | Exp(B) | 2.5% | 97.5% |
---|---|---|---|---|---|---|---|---|

Intercept | -1.3863 | 0.3536 | 15.4 | 1 | .0000 | |||

treatmentT1 | 1.7918 | 0.4564 | 15.4 | 1 | .0000 | 6.00 | 2.523 | 15.260 |

treatmentT2 | 2.7726 | 0.5000 | 30.7 | 1 | .0000 | 16.00 | 6.268 | 44.988 |

treatment | 31.9 | 2 | .0000 |

The `anova`

function, as well as the `summary`

function, provide us with the log likelihoods in the form of the residual deviances and the chi-square improvement in the form of the deviance residual for the **treatment** variable.

anova(fit.1)

Df | Deviance Residual | Df | Resid. Dev | |
---|---|---|---|---|

NULL | 149 | 207.28 | ||

treatment | 2 | 39.895 | 147 | 167.38 |

Field, Miles, and Field (p. 332, 2012) provide another way of reporting this. By adding the treatment variable, the reduction in deviance results in the chi-square statistic, which if significant, means we can reject the null hypothesis that the model is not better than chance at predicting the outcome. We can do this using the following code:

# The reduction in the deviance; results in the chi square statistic fit.chi <- fit.0$null.deviance - fit.0$deviance # The degrees of freedom for the chi square statistic chi.df <- fit.0$df.null - fit.0$df.residual # The probability associated with the chi-square statistic chisq.prob <- 1 - pchisq(fit.chi, chi.df) # Display the results fit.chi; chi.df; chisq.prob

Note: Europe is the baseline/reference group and Low is the baseline/reference mean score. The objective is to make sense of the odds ratio of any region compared to the odds ratio of Europe, given this is how R's output should be interpreted. Given the following table:

first_auth_geog | |||||||
---|---|---|---|---|---|---|---|

mean.score | Europe | Africa | Asia | Latin Amer | North Amer | Oceania | U.K. |

Low | 220 | 4 | 43 | 23 | 217 | 65 | 65 |

Middle | 260 | 11 | 64 | 30 | 221 | 68 | 43 |

High | 124 | 5 | 44 | 23 | 79 | 30 | 20 |

The odds are calculated by: Low / (Middle + High). For Europe, that means:

$Odds = \dfrac{220}{260+124} = 0.573$

And for Asia, that means:

$Odds = \dfrac{43}{64+44} = 0.398$

Thus, the odds ratio is:

$OR = \dfrac{0.573}{0.398} = 1.439$

Interpretation: **first_auth_geog** in Europe are 1.439x more likely to score low than middle or high compared to X in Asia.

It works out a little differently when North America is considered the treatment group:

$Odds = \dfrac{217}{221+79} = 0.723$

Thus this odds ratio is:

$OR = \dfrac{0.573}{0.723} = 0.792$

Interpretation: **first_auth_geog** in Europe are 0.792x more likely to score low than middle or high compared to **first_auth_geog** in North America. This, of course, means **first_auth_geog** in Europe are less likely because $OR < 1.00$.

The information is derived from the `ftable`

command and also compared to the output of the ordered logistic regression:

ftable(xtabs(~ mean.rs + first_auth_geog, data = dec_sent)) m <- polr(mean.rs ~ first_auth_geog, data = dec_sent, Hess = TRUE) round(exp(cbind(OR = coef(m), ci)), 3)

Field, A., Miles, Jeremy, & Field, Zoë. (2012). *Discovering Statistics Using R*. Los Angeles: Sage.

r/logistic-regression.txt · Last modified: 2017/02/23 09:30 by seanburns