(Double) Machine learning using RD - Causal Book: Design Patterns in Causal Inference

#### How to implement the design pattern using machine learning: Double machine learning (Double ML) extends to regression discontinuity through the `RDFlex` class in the *[[DoubleML]]* library. `RDFlex` pairs the local polynomial estimator from `rdrobust` with a machine-learning step that flexibly adjusts for covariates.[^1] Unlike the [[(Double) Machine learning using IV|IV design]], DoubleML does not change *what* RD identifies, which still rests on continuity at the cutoff, as we discussed in [[Statistical modeling of RD|Chapter 2.2.2]]. What it can change is *how precisely* we estimate the [[Local Average Treatment Effect|treatment effect at the cutoff]]. The mechanism is straightforward: estimate a flexible adjustment function $\eta(X)$ from the covariates, subtract it from the outcome, and run `rdrobust` on the residualized outcome.[^2] The tighter the residual variance, the tighter the standard error. ###### Using Python We apply DoubleML to the OTA loyalty data from [[Statistical modeling of RD|Chapter 2.2.2]] with `LGBMRegressor` as the outcome learner (the default learner used by the [DoubleML documentation for Regression Discontinuity](https://docs.doubleml.org/stable/examples/py_double_ml_rdflex.html)). We also fit `rdrobust` with linear covariate adjustment (via the `covs=` argument) as a midpoint between the no-adjustment baseline and the flexible-adjustment estimator: ```python import numpy as np import pandas as pd from lightgbm import LGBMRegressor from rdrobust import rdrobust from doubleml import DoubleMLRDDData from doubleml.rdd import RDFlex df = pd.read_csv('synthetic-loyalty-data-for-rd_v100.csv') df.columns = df.columns.str.lower() df['treat'] = df['status_platinum'].astype(int) covariates = ['age_years', 'total_bookings', 'avg_booking_value', 'pct_international_trips'] # rdrobust with linear covariate adjustment rd_covs = rdrobust(y = df['future_spending'], x = df['past_spending'], c = 10000, covs = df[covariates]) # RDFlex with LightGBM dml_data = DoubleMLRDDData(df, y_col = 'future_spending', d_cols = 'treat', score_col = 'past_spending', x_cols = covariates) np.random.seed(314159) rd_lgbm = RDFlex(dml_data, ml_g = LGBMRegressor(n_estimators = 500, learning_rate = 0.01, verbose = -1), fuzzy = False, cutoff = 10000) rd_lgbm.fit(n_iterations = 2) ``` Comparing the no-covariate baseline from [[Statistical modeling of RD|Chapter 2.2.2]] with the linear-adjusted and the LGBM-adjusted estimates: | Method | Point estimate | Std. Error | 95% CI (Robust) | | ---------------------------- | -------------: | ---------: | ----------------------- | | rdrobust (no covariates) | 1567.178 | 38.508 | [1496.639, 1667.855] | | rdrobust (linear covariates) | 1567.224 | 37.940 | [1497.391, 1666.519] | | RDFlex (LGBM) | 1561.339 | 39.286 | [1486.736, 1663.013] | The point estimate is stable across all three methods (\$1,561 – \$1,567), which is what we expect from the RD design: continuity at the cutoff does the identifying work, and adding predetermined covariates (linearly or flexibly) does not change *what* we estimate. Note that the standard error moves by less than 2%, and the LGBM adjustment is actually *slightly worse* than no adjustment at all. This is surprising: we hoped to see a tighter standard error from DoubleML. After all, the example in the [DoubleML documentation for Regression Discontinuity](https://docs.doubleml.org/stable/examples/py_double_ml_rdflex.html) reports a six-fold reduction in SE using the same setup. The gap traces to the structure of our data-generating process. To see the details of why this is happening, visit [[Oh my! DoubleML is worse for the RD design]]. In summary, DoubleML in RD does not change what we are estimating, but we expected it to change how precisely we estimate. In our data, we did not observe an improvement in the precision. When the covariates carry substantial nonlinear information independent of the running variable, the precision gain can be large; when they do not, the linear `rdrobust(covs=...)` adjustment is already at or near the efficiency frontier. Our OTA loyalty data falls in the second regime, and the chapter's headline estimate remains the [[Statistical modeling of RD|Chapter 2.2.2]] number: \$1,567 with a 95% robust confidence interval of [\$1,497, \$1,668]. We discuss where DoubleML makes a difference in [[Oh my! DoubleML is worse for the RD design]]. > [!NOTE] > `RDFlex` takes only one learner (`ml_g`) in a sharp design, unlike `DoubleMLPLIV` in the [[(Double) Machine learning using IV|IV chapter]], which requires three (`ml_l`, `ml_m`, `ml_r`). This is because the covariate adjustment in sharp RD has a single role: estimate the outcome adjustment functions on the two sides of the cutoff. There is no treatment model to estimate separately because the treatment is mechanically determined by the running variable at the cutoff, and no instrument model to estimate because the running variable itself is the source of local exogeneity. > > For fuzzy RD, where the cutoff determines only eligibility rather than treatment, set `fuzzy = True` and pass an additional learner via `ml_m` for the treatment probability model. The identification logic then mirrors the local Wald estimator discussed in [[Statistical modeling of RD]]. [^1]: Noack, C., Olma, T., & Rothe, C. (2024). Flexible covariate adjustments in regression discontinuity designs. arXiv:2107.07942. The `doubleml.rdd` module requires the `rdrobust` Python package; install with `pip install doubleml[rdd] lightgbm`. [^2]: DoubleML also uses `rdrobust` as a helper library in the background: after the ML adjustment step, `RDFlex` produces the final treatment-effect estimate by running `rdrobust` on the residualized outcome at the cutoff. > [!info]- Last updated: May 13, 2026