Oh my! Within R-squared in the IV model - Causal Book: Design Patterns in Causal Inference

In the naïve model shown in [[Statistical modeling of IV|Chapter 1.2.2]] and [[(Double) Machine learning using IV|Chapter 1.2.3]], 89% of the variation in the cigarette sales is explained by the cigarette prices **combined with the fixed effects**. The within R-squared represents the proportion of variance explained by the model after accounting for the fixed effects. So, 6% of the variation in the sales is explained by the cigarette slaes. Note that this is not an additive comparison: it is **not** that 6% of 89% is explained by the cigarette prices and the rest is explained by the fixed effects. Instead, it's a multiplicative comparison. The (C++) code used by the library for the calculation is as follows: ```cpp res[i] = 1 - cpp_ssq(resid(x)) / ssr_fe_only ``` This code translates to the following function: $wR^2 = 1 - \frac{SSR_{model}}{SSR_{fe\_only}}$ In other words, 6% is the amount of variation explained by the model including the cigarette prices compared to the total variation explained by the (state and year) fixed effects alone. This puts our predictor in a tough spot, but it is not surprising given how much of the prices can be explained by the location and time pair alone (see [[Predicting retail sales using time alone]] (Coming soon!) for an example). Now that we have clarified the R-squared calculations in the naïve model, let's turn to the IV model. In the IV model, the model reports the same adjusted and within R-squared pair. What is going on here? In an IV model, these calculations require caution. The adjusted R-squared is roughly within the range of the naïve model and quantitatively lower. In an ideal world, this could be explained by the removal of endogenous variation: the IV removes the endogenous variation in the predictor/explanatory variable that may have artificially inflated the R-squared in the naïve model in the first place. This is a plausible overall explanation, but there is potentially more to it, and the negative *within* R-squared underscores the caution needed here. Remember that the IV model uses the predictions from the first stage model, not the actual observations. This is the reason why $R^2$ in an IV model should not be read too much into (if not completely ignored!). Here is why: $Stage\:1: X = Z\alpha + \text{error}$ $Stage\:2: y = \hat{X}\beta + \text{error}$ Since $X\neq\hat{X}$, the error that is minimized in the second stage is not the same as the error used to compute the residual sum of squares. Therefore, the residual sum of squares doesn't have to be less than the total sum of squares anymore: $R^2 = 1 - \frac{RSS}{TSS}$ This in turn means that $R^2$ can be negative, as it is in the example, and the meaning of R-squared is inherently questionable in IV models. > [!info]- Last updated: January 27, 2025