Oh my! DoubleML is worse for the RD design - Causal Book: Design Patterns in Causal Inference

The [[(Double) Machine learning using RD|DoubleML result]] on our OTA loyalty data is interesting. The same DoubleML setup that the [DoubleML documentation](https://docs.doubleml.org/stable/examples/py_double_ml_rdflex.html) advertises as a tool for tightening confidence intervals – and demonstrates with a six-fold reduction in standard error in its canonical example – produces *no improvement* on ours, with the LightGBM-based variant landing slightly *worse* than no covariate adjustment at all. The standard intuition says a more flexible adjustment should produce tighter (or at worst equal) estimates. So what is going on? Nothing is going wrong. The DoubleML docs' synthetic data is engineered to reward flexible covariate adjustment; ours is not. The contrast clarifies what `RDFlex` actually requires of the data. **Replicating the docs result.** The DoubleML documentation's example uses `make_simple_rdd_data` to generate $n = 1,000$ observations with three covariates and true $\tau = 2.0$. Running the same comparison the docs present: ```python from doubleml.rdd.datasets import make_simple_rdd_data from doubleml.rdd import RDFlex from doubleml import DoubleMLRDDData from lightgbm import LGBMRegressor from rdrobust import rdrobust np.random.seed(1234) data_dict = make_simple_rdd_data(n_obs = 1000, fuzzy = False, tau = 2.0) cov_names = ['x0', 'x1', 'x2'] df_docs = pd.DataFrame( np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])), columns = ['y', 'd', 'score'] + cov_names, ) rd_noadj = rdrobust(y = df_docs['y'], x = df_docs['score'], fuzzy = df_docs['d'], c = 0.0) rd_lin = rdrobust(y = df_docs['y'], x = df_docs['score'], fuzzy = df_docs['d'], covs = df_docs[cov_names], c = 0.0) dml_data_docs = DoubleMLRDDData(df_docs, y_col = 'y', d_cols = 'd', x_cols = cov_names, score_col = 'score') ml_g_lgbm = LGBMRegressor(n_estimators = 500, learning_rate = 0.01, verbose = -1) rd_lgbm = RDFlex(dml_data_docs, ml_g_lgbm, cutoff = 0, fuzzy = False, n_folds = 5, n_rep = 1) rd_lgbm.fit(n_iterations = 2) ``` | Method | Point estimate | Std. Error | 95% CI (Conv.) | | ---------------------------- | -------------: | ---------: | -------------- | | rdrobust (no covariates) | 2.407 | 0.634 | [1.164, 3.650] | | rdrobust (linear covariates) | 2.207 | 0.433 | [1.359, 3.056] | | DoubleML / RDFlex (LGBM) | 2.014 | 0.103 | [1.812, 2.217] | A six-fold reduction in standard error from no adjustment to flexible adjustment, with the point estimate also moving toward the true value. The DoubleML's documentation celebrates the improvement: *"[T]he flexible adjustment decreases the standard error in the estimation and therefore provides tighter confidence intervals."* **Compare with our OTA loyalty data** from [[(Double) Machine learning using RD|Chapter 2.2.3]]: | Method | Point estimate | Std. Error | 95% CI (Robust) | | ---------------------------- | -------------: | ---------: | -------------------- | | rdrobust (no covariates) | 1567.178 | 38.508 | [1496.639, 1667.855] | | rdrobust (linear covariates) | 1567.224 | 37.940 | [1497.391, 1666.519] | | DoubleML / RDFlex (LGBM) | 1561.339 | 39.286 | [1486.736, 1663.013] | The standard error moves by less than 2% across all three methods, and DoubleML using the LGBM-adjusted estimator is the worst of the three. Why? **The docs' dataset is engineered.** The `make_simple_rdd_data` source reveals the structure: $Y = \tau\, D + 0.1 \cdot \text{score}^2 - 0.5 \cdot D \cdot \text{score}^2 + \sum_{i=1}^{3} \text{Polynomial}_4(X_i) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 0.2^2)$ The three covariates enter as **degree-4 polynomials**, and the running variable enters as a small quadratic. By construction, the covariates carry the bulk of the variance in $Y$, and the relationship is highly nonlinear. This is exactly the structure that flexible ML can exploit but linear adjustment cannot. The [[Synthetic data for the RDD|OTA dataset]] has no analogous structure: the covariates correlate with `future_spending` in approximately linear ways through their joint distribution with `past_spending`. To quantify the contrast, we fit $Y$ on covariates alone (excluding the running variable), once linearly and once with LightGBM, on each dataset: | Diagnostic | Docs dataset | OTA dataset | | ------------------------------------------------------------ | -------: | ------: | | Var($Y$) explained by linear adjustment on score | 0.9% | 39.5% | | Var($Y$) explained by linear adjustment on covariates | 45.9% | 34.7% | | Var($Y$) explained by LGBM adjustment on covariates | 97.1% | 38.7% | | **Nonlinearity gain** ($R^2_{\text{LGBM}} - R^2_{\text{linear}}$) | **0.51** | **0.04** | > [!NOTE] > Two cautions on this diagnostic. First, it is global, while `rdrobust` is local: the variance decomposition uses the full sample, whereas `RDFlex` residualizes within the kernel-weighted neighborhood of the cutoff. The local gain from nonlinear adjustment can therefore differ from the global gain, and is often smaller because local windows also compress covariate variation. Second, the first three rows use in-sample residual variance, while the nonlinearity gain uses cross-validated $R^2$. In the docs dataset, LightGBM moves the covariates' $R^2$ from 0.46 (linear) to 0.97 (flexible) — recovering an *additional* 51 percentage points of outcome variance that linear adjustment misses. That additional variance is exactly the residual variance that reduces the `rdrobust` SE six-fold. In the OTA dataset, the corresponding gain is 4 percentage points; the linear adjustment already captures essentially all the predictive content the covariates have, and the flexible learner has almost nothing additional to discover. A third factor reinforces this: the [[MSE-optimal bandwidth]] uses about ~6,800 effective observations per side in our data versus ~290 per side in the DoubleML docs example, so even no-covariate inference is already efficient in absolute terms. **When DoubleML helps in RD.** Three conditions that need to hold: **1. Substantial nonlinear or interactive covariate-outcome structure.** Linear adjustment captures linear effects fully; a flexible learner only adds value to the extent the relationship is nonlinear. The diagnostic table above is the way to check: if the linear and flexible $R^2$ are close, `RDFlex` does not have much to add. **2. Covariate information independent of the running variable.** If the covariates correlate with the outcome only through their correlation with the score, the score itself already does the work and the covariates are redundant. Linear adjustment can detect this; flexible adjustment cannot magic up information that is not there. **3. Small effective sample at the cutoff.** With $n_{\text{eff}}$ in the thousands, no-covariate inference is already efficient and there is little marginal precision left to absorb. `RDFlex` pays off most when the effective sample on each side is small. None of these conditions hold strongly in our OTA dataset; all hold strongly in the example dataset of the DoubleML docs. `RDFlex` can be valuable when these conditions hold. As always, the choice of model specification is conditional on the data and context. > [!info]- Last updated: May 13, 2026