In social science research, control variables are often included out of concerns about inducing bias into the coefficients of interest [1, 2]. However, short of knowing the true data-generating process—an unlikely situation—the inclusion of even relevant controls may in fact aggravate the problem. This is shown for the case of linear (OLS) and logit (GLM) models, where the true model includes three covariates. The first misspecified model omits the second and third covariates, and the second misspecified model omits only the third covariate. According to the logic of including controls, the bias on the expected value of the coefficient for the first covariate should always be larger in the first misspecified model, unless covariates are uncorrelated. This is not true for many GLM link functions, where coefficients may be biased even if included and excluded covariates are uncorrelated [3, 4]. At the red contour line no difference in bias exists between the first and second misspecified models. In regions where dashed contour lines indicate positive values, the inclusion of controls would indeed reduce bias. (Hover the mouse over the contour line to see the tooltip.) The lighter the region, the larger the reduction. In regions where solid contour lines indicate negative values, however, the inclusion of controls would induce bias. The darker the region, the larger the induction. For exact identification of coordinates, drag the cross-hairs locator to the desired position. The notation follows [1].
- Contributed by: Alrik Thiem
- After work by: Kevin A. Clarke, University of Rochester (USA)
For the case of OLS, let  be the true model,  be first misspecified model, and  be the second misspecified model.  :  , and  :  . If for  we have  , then the bias  of the expected values of  for  and  for  are given by (1)  , (2)  . According to the logic of including controls in order to reduce bias, the following weak inequality should always hold. For the case of GLM, as before let  be the true model, let  be the first misspecified model, and let  be the second misspecified model.  :  , and  :  . The normalized values of  and  are given by [1] as  and  . According to the logic of including controls in order to reduce bias, the following weak inequality should always hold.  . [1] K. A. Clarke, "Return of the Phantom Menace: Omitted Variable Bias in Political Research," Conflict Management and Peace Science, 26(1), 2009 pp. 46–66. [2] K. A. Clarke, "The Phantom Menace: Omitted Variable Bias in Econometric Research," Conflict Management and Peace Science, 22(4), 2005 pp. 341–352. [3] M. H. Gail, S. Wieand, and S. Piantadosi, "Biased Estimates of Treatment Effect in Randomized Experiments with Nonlinear Regressions and Omitted Covariates," Biometrika, 71(3), 1984 pp. 431–444. [4] J. S. Cramer, Logit Models from Economics and Other Fields, Cambridge: Cambridge University Press, 2003.
After work by: Kevin A. Clarke, University of Rochester (USA)
|