Statistical modeling of IV - Causal Book: Design Patterns in Causal Inference

#### How to implement the design pattern using statistics: The most common way to implement an instrumental variable estimation is arguably through regression modeling. ###### Using R Whenever possible, we will use the _[[fixest]]_ library for the regression implementations in R. Prior to the instrumental variable model, we fit a naïve model to seek the effect of the per pack cigarette prices on sales. Given the panel structure of the data, we also include state and month fixed effects. This allows us to control for time-invariant differences across states and changes in the trends over time. In other words, any inherent differences between the states in the data, as well as overall changes over time that are unrelated to the treatment of interest (taxes), are controlled for in the model. The code and results are shown below: Let's load and model the data:[^1] ```r df <- readr::read_csv('tobacco-tax-as-iv-extended_v100.csv') fixest::etable(fixest::feols(pack_sales_per_capita ~ price_per_pack | state + year, data = df)) fixest::feols(pack_sale.. Dependent Var.: pack_sales_per_capita price_per_pack -6.625** [-10.69; -2.565] (2.022) Fixed-Effects: ------------------------- as.factor(state) Yes as.factor(year) Yes ________________ _________________________ VCOV: Clustered by: state) Observations 2,550 R2 0.89377 Within R2 0.06331 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` The naïve results show that a $1 increase in the price of a pack of cigarettes reduces annual per capita cigarette consumption by $6.63 on average. This is equivalent to about 7% of the cigarette pack sales based on both the mean and the median. We cannot trust these results because of the potential endogeneity discussed in [[Design Pattern I - Instrumental Variable (IV)]], such as the presence of confounders or even reverse causation (e.g., increasing sales may lead to price reductions due to economies of scale). > [!NOTE] > In the code above, we use all of the default parameters. Instead of using the defaults, you may want to specify some of the model parameters. Two of these are important to note: > - `vcov = 'cluster'` explicitly clusters the standard errors by state. > - `coefstat = 'se'` explicitly reports the clustered standard error. Use `coefstat = 'confint'` to get the confidence interval reported. > > In addition, it's helpful to note that the standard error calculations (and confidence intervals) may not be the same as the results from other statistical software packages. This is due to the small sample correction in this model. The default is `ssc = fixest::ssc(adj = TRUE, cluster.adj = TRUE)`. You can add the parameter `ssc = fixest::ssc(adj = FALSE, cluster.adj = FALSE)` to disable: > - the adjustment for degrees of freedom, which is applied by default when fixed effects are present (`adj = FALSE`). > - the cluster adjustment that is applied by default to the sandwich estimator of the variance-covariance matrix (`cluster.adj = FALSE`). > > For details on standard error differences, see [[Oh my! Different standard errors everywhere]]. Next, we turn to our instrumental variable to isolate the effect from endogeneity. The following function regresses the cigarette sales on the price per pack of cigarettes, using the state tax per pack as the instrument variable (as discussed in [[Data and conceptual model]]). We continue to control for state and month fixed effects to eliminate any bias due to inherent differences across states or overall time trends. ```r fixest::etable(fixest::feols(pack_sales_per_capita ~ 1 | state + year | price_per_pack ~ state_tax_per_pack, data = df)) ``` The revised results are as follows: ```r Dependent Var.: pack_sales_per_capita Instrumental Var.: state_tax_per_pack price_per_pack -7.815*** [-11.87; -3.760] (2.019) Fixed-Effects: -------------------------- as.factor(state) Yes as.factor(year) Yes ________________ __________________________ VCOV: Clustered by: state) Observations 2,550 R2 0.89353 Within R2 0.06127 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # IN MORE DETAIL: TSLS estimation - Dep. Var.: pack_sales_per_capita Endo. : price_per_pack Instr. : state_tax_per_pack Second stage: Dep. Var.: pack_sales_per_capita Observations: 2,550 Fixed-effects: as.factor(state): 51, as.factor(year): 50 Standard-errors: Clustered (as.factor(state)) Estimate Std. Error t value Pr(>|t|) fit_price_per_pack -7.81482 2.01862 -3.87136 0.00031467 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 RMSE: 13.6 Adj. R2: 0.889188 Within R2: 0.061269 F-test (1st stage), price_per_pack: stat = 15,983.2, p < 2.2e-16 , on 1 and 2,499 DoF. Wu-Hausman: stat = 34.6, p = 4.628e-9, on 1 and 2,448 DoF. ``` A $1 increase in the price of a pack of cigarettes reduces annual per capita cigarette consumption by $7.82 on average (18% higher than the naïve estimate). This corresponds to a decrease in sales of about 8%, a larger effect than in the naïve model, so what is the *truer* effect size here? Based on our experience and the literature, an inflated effect size after instrumenting the model is not uncommon. Read [[Oh my! Larger effect size in the IV model]] for details. What is the within R-squared? Why is it so small? Could it be negative? Yes, it could! Again, this is not unexpected, and the reason is surprisingly simple. See [[Oh my! Within R-squared in the IV model]] for details. *Side note: [[Clustered standard errors|Why cluster the standard errors?]]* Now, click [[Replication of IV in Python|here]] to implement the same model in Python! [^1]: All data files used in this book can be found in the book's [GitHub repository](https://github.com/causalbook/). > [!info]- Last updated: January 27, 2025