(Double) Machine learning using IV - Causal Book: Design Patterns in Causal Inference

#### How to implement the design pattern using machine learning: One implementation of the recent (and powerful) double machine learning (Double ML) uses instrumental variables. ###### Using Python We will be using the wonderful *[[DoubleML]]* library in Python. This is an object-oriented implementation of double machine learning with a wide range of model specifications. As we did in [[Statistical modeling of IV|Chapter 1.2.2]], we first fit a naïve model to seek the effect of the health bill on restaurant prices (CPI for eating out). Given the panel structure of the data, we continue to include state and month fixed effects (to control for time-invariant differences across states and changes in the trends over time). In addition, we fit an IV model using the same statistical method but using Python. The code and results for these replication models are in [[Replication of IV in Python|Chapter 1.2.2.1]] and confirm the results from the *fixest* library in the R implementation. Now to the Double ML: The implementation in this chapter and the reported results are for partially linear IV regression.[^1] Partially linear IV regression models take the following form: $Y - T\theta_0 = g_0(X) + \zeta, \quad \mathbb{E}(\zeta|Z, X) = 0,$$Z = m_0(X) + V, \quad \mathbb{E}(V|X) = 0,$ where Y is the outcome variable, T is the treatment variable and Z is one or more instrumental variables. $X = (X₁, ..., Xₚ)$ denotes the covariates, and ζ and V denote the errors at each stage. The main difference between the 2SLS model in [[Statistical modeling of IV|Chapter 1.2.2]] and the Double ML model here is that, as shown in the two steps above, the nonparametric functions $g₀(X)$ and $m₀(X)$ allow for flexible relationships between the covariates X and the outcome Y and the instrumental variable Z. In contrast, 2SLS assumes a fully linear relationship. We will use the `DoubleMLPLIV()` function to implement the partial linear IV regression. This requires some thought and planning because we have not only an instrumental variable, but also state and year fixed effects. The two-stage solution implemented in DoubleML complicates the solution if we want to replicate the model built in [[Statistical modeling of IV]]. The complication is due to the fact that we do not even estimate state and year effects, but they are controlled as fixed effects in our original model. They can be treated as covariates, but with some delicate attention, since we also have an instrumental variable in the model. Let's start with a model where we do not cluster the standard variables. As discussed in [[Clustered standard errors]], the way clustering of standard errors is applied differs between languages and packages, so we will keep it as simple as possible and assume [[Independent and identically distributed random variables|i.i.d.]] for now. We will come back to clustering later. The following `R/fixest` model estimates our IV model, but without clustering: ```r fixest::etable(fixest::feols(pack_sales_per_capita ~ 1 | state + year | price_per_pack ~ state_tax_per_pack, data = df, vcov = "iid")) fixest::feols(pack_.. Dependent Var.: pack_sales_per_capita price_per_pack -7.815*** (0.5544) Fixed-Effects: --------------------- state Yes year Yes _______________ _____________________ S.E. type IID Observations 2,550 R2 0.89353 Within R2 0.06127 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` In DoubleML, we will be using the `DoubleMLPLIV` function which takes an instrumental variable. We'll keep assuming [[Independent and identically distributed random variables|i.i.d.]] for now. `Python/DoubleML` with the instrumental variable added: ```python # Starting with the code above until here # Demean the variables by state df['price_per_pack_demeaned'] = df['price_per_pack'] - df.groupby('state')['price_per_pack'].transform('mean') df['pack_sales_per_capita_demeaned'] = df['pack_sales_per_capita'] - df.groupby('state')['pack_sales_per_capita'].transform('mean') df['state_tax_per_pack_demeaned'] = df['state_tax_per_pack'] - df.groupby('state')['state_tax_per_pack'].transform('mean') # Create DoubleMLClusterData object with demeaned variables and year_factor as a control year_dummies = [col for col in df.columns if col.startswith('year_')] dml_data = DoubleMLData(df, y_col = 'pack_sales_per_capita_demeaned', d_cols = 'price_per_pack_demeaned', z_cols = 'state_tax_per_pack_demeaned', # Added the instrumental variable x_cols = year_dummies) # Set up the DoubleMLPLIV model with the specified regressors np.random.seed(314159) model = DoubleMLPLIV(dml_data, ml_l = LinearRegression(), ml_m = LinearRegression(), ml_r = LinearRegression()) # Fit the model and print the results model.fit() print(model.summary) DoubleMLPLIV Model Summary ========================== Outcome Variable: pack_sales_per_capita_demeaned Treatment Variable: price_per_pack_demeaned Instrumental Variable: state_tax_per_pack_demeaned Coefficient: -7.778535 Standard Error: 0.527599 t-statistic: -14.743263 p-value: 3.399562e-49 Confidence Interval: [-8.812611, -6.744459] ``` So far, so good. Our results are identical to the first decimal (the effect size is `-7.8` in both models with very close standard errors). This is because we used OLS to define all models (outcome, predictor, and treatment models) in the DoubleML model. We can assume that our DoubleML model also follows the [[Independent and identically distributed random variables|i.i.d.]] assumption. The slight variation in the standard error can be attributable to the incorrect [[Degrees of freedom|degrees of freedom]] used in the standard error calculation of the model with the demeaned variables, because the calculation does not take into account the degrees of freedom used in the demeaning process. Next, we add clustering. Remember that we expect clustering to change only the standard errors in our original estimation (while this won't be the case in DoubleML because of the reason discussed below). The results of the model with both the instrumental variable and the clustering of the standard errors in `R/fixest` are already reported in [[Statistical modeling of IV]]. As a reminder, the results are copied below: ```r price_per_pack -7.815*** [-11.87; -3.760] (2.019) ``` The effect size is identical to the model with the [[Independent and identically distributed random variables|i.i.d.]] assumption above, but the standard error is adjusted. Now to DoubleML. We will use `DoubleMLClusterData` instead of `DoubleMLData` and add the `cluster_cols='state'` to the `dml_data` object: ```python # Starting with the code above until here # Create DoubleMLClusterData object with demeaned variables and year_factor as a control year_dummies = [col for col in df.columns if col.startswith('year_')] dml_data = DoubleMLClusterData(df, y_col = 'pack_sales_per_capita_demeaned', d_cols = 'price_per_pack_demeaned', z_cols = 'state_tax_per_pack_demeaned', # Instrumental variable also demeaned x_cols = year_dummies, cluster_cols = 'state') # Clustering by state # Set up the DoubleMLPLIV model with the specified regressors np.random.seed(314159) model = DoubleMLPLIV(dml_data, ml_l = LinearRegression(), ml_m = LinearRegression(), ml_r = LinearRegression()) # Fit the model and print the results model.fit() print(model.summary) DoubleMLPLIV Model Summary ========================== Outcome Variable: pack_sales_per_capita_demeaned Treatment Variable: price_per_pack_demeaned Instrumental Variable: state_tax_per_pack_demeaned Coefficient: -8.225588 Standard Error: 2.002086 t-statistic: -4.108508 p-value: 0.00004 Confidence Interval: [-12.149605, -4.301571] ``` Now the standard error is almost identical to the results of the model in [[Statistical modeling of IV|Chapter 1.2.2]] and its replication in [[Replication of IV in Python|Chapter 1.2.2.1]]. The power of the IV design pattern using Double ML is its flexibility, which we have completely ignored so far. We will discuss that next. Let's first focus on what clustering really means here and why the effect sizes would change in addition to the standard error in the DoubleML model. DoubleML uses cross fitting. So, clustering in DoubleML has as much to do with the n-fold cross validation as an adjustment to the variance covariance matrix and resulting standard error. With the cluster robust option, sample splitting for cross-fitting takes the clusters into account. Instead of splitting observations into folds, clusters are split into folds. This way, observations from the same cluster are not used in the sample for estimating the covariate function and at the same time in the sample for solving the outcome function. Note that this is a slightly different objective than our initial goal of clustering because DoubleML has two challenges when it comes to clustering as explained in its documentation [here](https://docs.doubleml.org/stable/examples/py_double_ml_multiway_cluster.html): 1. a necessary adjustment of the formulae used for estimation of the variance covariance matrix, standard errors, p-values etc., and, 2. an adjusted resampling scheme for the cross-fitting algorithm. So far, we have only focused on #1 because we have not used a method with cross-fitting. The sample splitting procedure we use here ensures that the hold out samples do not contain observations from the same clusters as those used for training. This changes the estimated effect size after clustering. But it is also reflected as an adjustment to the standard errors, so that they are the same as the result of `R/fixest`. > [!note] > If you want to explore the impact of cross validation on the results of a DoubleML model, the easiest way to do so is to downgrade the package to version `0.5.0` (first uninstall the current version on your computer and run `pip install doubleml==0.5.0` or `conda install doubleml==0.5.0`). Then you can either change `n_folds = X` (5 by default) or set `apply_cross_fitting=False` to disable cross fitting. You'll notice that the results will only slightly change for this model. `n_folds` is still a parameter in the most recent version but its value is restricted to a minimum of `2`. `apply_cross_fitting=False` is depreciated and replaced with an encouragement to input external predictions to the estimation, which makes it more difficult to experiment with it. We have yet to utilize the full power of Double ML. In our original model with no controls, the use of nonparametric alternatives to OLS may not be as useful because we do not have as much risk of model misspecification. In fact, changing the regressors to Random Forest barely changes the results: ```python # Starting with the code above until here from sklearn.ensemble import RandomForestRegressor # Set up the DoubleMLPLIV model with the specified regressors np.random.seed(314159) model = DoubleMLPLIV(dml_data, ml_l = RandomForestRegressor(), ml_m = RandomForestRegressor(), ml_r = RandomForestRegressor()) # Fit the model and print the results model.fit() print(model.summary) DoubleMLPLIV Model Summary ========================== Outcome Variable: pack_sales_per_capita_demeaned Treatment Variable: price_per_pack_demeaned Instrumental Variable: state_tax_per_pack_demeaned Coefficient: -8.24801 Standard Error: 1.999228 t-statistic: -4.125597 p-value: 0.000037 Confidence Interval: [-12.166425, -4.329594] ``` In data with more variables and forms, however, avoiding a model specification error would be a challenge. Now, let's add `income`, `age`, and `education` to our model and use a linear estimation. Let's start with `R/fixest` first and estimate the model with the additional variables: ```r fixest::etable(fixest::feols(pack_sales_per_capita ~ income + age + education | state + year | price_per_pack ~ state_tax_per_pack, data = df, vcov = "cluster")) fixest::feols(pack_.. Dependent Var.: pack_sales_per_capita price_per_pack -5.848* (2.346) income, age, educ. *Included Fixed-Effects: --------------------- state Yes year Yes _______________ _____________________ S.E.: Clustered by: state Observations 2,550 R2 0.90004 Within R2 0.11860 ``` After controlling for `income`, `age` and `education`, a $1 increase in the price of a pack of cigarettes is estimated to reduce annual per capita cigarette consumption by $5.8. This is a stark difference from our original estimate of $7.8. How much can we trust this model? This is a difficult question. The effect of income can certainly be nonlinear. At very low levels, the cost of smoking may be a deterrent. As income increases, smoking may become a more viable activity. But as income continues to rise, it may be replaced by other, more expensive substitutes. Similarly, age and education may have non-linear effects on smoking. For example, education may reduce sales at an increasing rate. That is, smoking may be more adversely affected as the level of education increases. Deciding how to model these relationships is difficult, but failing to model them correctly leads to misspecification bias, as may be the case in the results above. Is there a solution? Yes, DoubleML can reduce the model misspecification bias. Let's estimate the same model with the control variables using DoubleML and a `Random Forest` estimator: ```python import pandas as pd import numpy as np from doubleml import DoubleMLPLIV, DoubleMLClusterData from sklearn.ensemble import RandomForestRegressor # Load the data df = pd.read_csv('d:/causalbook/data/tobacco-tax-as-iv-extended_v100.csv') # Create categorical dummies for year df = pd.get_dummies(df, columns=['year']) # Demean the variables by state df['price_per_pack_demeaned'] = df['price_per_pack'] - df.groupby('state')['price_per_pack'].transform('mean') df['pack_sales_per_capita_demeaned'] = df['pack_sales_per_capita'] - df.groupby('state')['pack_sales_per_capita'].transform('mean') df['state_tax_per_pack_demeaned'] = df['state_tax_per_pack'] - df.groupby('state')['state_tax_per_pack'].transform('mean') df['income_demeaned'] = df['income'] - df.groupby('state')['income'].transform('mean') df['age_demeaned'] = df['age'] - df.groupby('state')['age'].transform('mean') df['education_demeaned'] = df['education'] - df.groupby('state')['education'].transform('mean') # Create DoubleMLClusterData object with demeaned variables and year_factor as a control year_dummies = [col for col in df.columns if col.startswith('year_')] dml_data = DoubleMLClusterData(df, y_col = 'pack_sales_per_capita_demeaned', d_cols = 'price_per_pack_demeaned', z_cols = 'state_tax_per_pack_demeaned', # Instrumental variable also demeaned x_cols = year_dummies + ['income_demeaned', 'age_demeaned', 'education_demeaned'], cluster_cols = 'state') # Clustering by state # Set up the DoubleMLPLIV model with the specified regressors np.random.seed(314159) model = DoubleMLPLIV(dml_data, ml_l = RandomForestRegressor(), ml_m = RandomForestRegressor(), ml_r = RandomForestRegressor()) # Fit the model and print the results model.fit() print(model.summary) DoubleMLPLIV Model Summary ========================== Outcome Variable: pack_sales_per_capita_demeaned Treatment Variable: price_per_pack_demeaned Instrumental Variable: state_tax_per_pack_demeaned Coefficient: -7.731163 Standard Error: 2.286306 t-statistic: -3.381509 p-value: 0.000721 Confidence Interval: [-12.21224, -3.250086] ``` The DoubleML model continues to consistently estimate the reduction in annual per capita cigarette consumption in response to $1 as $7.7 by mitigating the model's misspecification bias through the two-step estimation process that builds random forests at each step. Note that using DoubleML with linear regressors does not produce the same level of consistency. You can try this by replacing the learners with `LinearRegression()` above. Blandhol et al. (2022) explore how different problems and their associated models and datasets suffer from model misspecification bias by showing the improvement of a DoubleML approach.[^2] Their results, summarized in Figure 4 and Table SA.4 of [the paper](https://www.nber.org/papers/w29709), show that the bias varies greatly across models and datasets, and that the bias increases as the models become more complex with a larger number of variables. They show that the only specifications that have a local average treatment effect (LATE) interpretation are those that control for covariates nonparametrically, implying that such specifications are both sufficient and necessary for a LATE interpretation without additional parametric assumptions. In other words, a proper causal interpretation of the results may depend on the absence of model misspecification, which underlies the importance of DoubleML as a robust treatment effect estimator. [^1]: All the data files used in this book can be found in the [GitHub repository](https://github.com/causalbook/) of the book. [^2]: Blandhol, C., Bonney, J., Mogstad, M., & Torgovitsky, A. (2022). _When is TSLS actually late?_ (No. w29709). Cambridge, MA: National Bureau of Economic Research. > [!info]- Last updated: January 29, 2025