Statistical modeling of RD - Causal Book: Design Patterns in Causal Inference

#### How to implement the design pattern using statistics: The standard approach to estimating the treatment effect in a regression discontinuity design is local polynomial regression at the cutoff. Unlike the parametric models used in the [[Statistical modeling of IV|IV design pattern]], the RD design pattern benefits from a nonparametric approach that avoids assumptions about the functional form of the relationship between the running variable and the outcome far from the cutoff. We will use the _rdrobust_ library, the workhorse package for nonparametric RD estimation with data-driven bandwidth selection.[^1] ###### Using R Prior to the RD model, we fit a naïve model to seek the raw difference in future spending between customers who received Platinum Status and those who did not. Let's load and model the data generated in [[Data and conceptual model (RD)]]:[^2] ```r df <- readr::read_csv('synthetic-loyalty-data-for-rd_v100.csv') names(df) <- tolower(names(df)) # standardize column names fixest::etable(fixest::feols(future_spending ~ status_platinum, data = df)) fixest::feols(fu.. Dependent Var.: future_spending Constant 8,414.4*** (8.213) status_platinumTRUE 2,101.0*** (10.12) ___________________ __________________ S.E. type IID Observations 50,000 R2 0.46285 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` The naïve results show that Platinum customers spend approximately \$2,101 more than non-Platinum customers in the following year on average. We cannot trust this estimate because of the [[Selection bias|selection bias]] discussed in [[Design Pattern II - Regression Discontinuity (RD)]]: customers with higher past spending are systematically different from those with lower past spending (e.g., higher income, stronger brand loyalty), and these same characteristics independently drive future spending. The naïve estimate bundles the true effect of Platinum Status with the inherent spending differences between the two populations. Note that the intercept (\$8,414) represents the average future spending of non-Platinum customers. Next, we turn to the RD design pattern to *locally* isolate the causal effect at the cutoff, as discussed in [[Local identification by a cutoff]]. Local polynomial regression estimates the relationship between past spending and future spending on each side of the cutoff separately, and the treatment effect is the discontinuous jump in the fitted values at the cutoff. The key decision is the **bandwidth**: how far from the cutoff should we look? Too wide, and we include customers who are fundamentally different; too narrow, and we lose statistical power. The `rdrobust` library solves this by selecting an [[MSE-optimal bandwidth]] automatically: ```r library(rdrobust) rd <- rdrobust(y = df$future_spending, x = df$past_spending, c = 10000) summary(rd) ``` The results are as follows: ```r Sharp RD estimates using local polynomial regression. Number of Obs. 50000 BW type mserd Kernel Triangular VCE method NN Number of Obs. 17081 32919 Eff. Number of Obs. 6697 6838 Order est. (p) 1 1 Order bias (q) 2 2 BW est. (h) 1849.278 1849.278 BW bias (b) 3385.648 3385.648 rho (h/b) 0.546 0.546 Unique Obs. 16816 32487 ===================================================================== Point Robust Inference Estimate z P>|z| [ 95% C.I. ] --------------------------------------------------------------------- RD Effect 1567.178 36.225 0.000 [1496.639 , 1667.855] ===================================================================== ``` The estimated [[Local Average Treatment Effect|treatment effect at the cutoff]] is approximately \$1,567: Platinum Status increases future spending by this amount for customers near the \$10,000 threshold. It is close to the ground truth $\tau = \$1,500$ defined in the [[Synthetic data for the RDD|data generation process]]. Compared to the naïve estimate of \$2,101, the RD estimate is substantially lower because it removes the bias from comparing inherently different populations. > [!NOTE] > The output reports the point estimate (the conventional local polynomial estimate) and robust inference (the z-statistic, p-value, and confidence interval that account for the additional uncertainty introduced by bias correction). The robust confidence interval is the recommended one for inference.[^3] > > The [[MSE-optimal bandwidth]] selected here is approximately \$1,849 on each side of the cutoff. This means the estimation uses customers who spent between roughly \$8,151 and \$11,849 in the past year. The effective number of observations within this window is about 13,535 (out of 50,000). You can override the bandwidth with `rdrobust(..., h = 1500)` or change the bandwidth selection method with `rdrobust(..., bwselect = 'cerrd')` for a coverage error rate (CER) optimal bandwidth, which is typically narrower. For details on how the bandwidth affects the estimates, see [[Oh my! Bandwidth sensitivity in the RD model]]. The following plot provides a visual summary of the RD design. The binned scatter plot shows the average future spending within evenly spaced bins of past spending, along with fitted local polynomials on each side of the cutoff: ```r p <- rdplot(y = df$future_spending, x = df$past_spending, c = 10000, hide = TRUE) bins <- p$vars_bins bins$side <- ifelse(bins$rdplot_mean_x < 10000, "below", "above") poly <- p$vars_poly poly$side <- ifelse(poly$rdplot_x < 10000, "below", "above") ggplot2::ggplot() + geom_point(data = bins, aes(x = rdplot_mean_x, y = rdplot_mean_y, color = side), size = 2, alpha = 0.7) + geom_line(data = poly, aes(x = rdplot_x, y = rdplot_y, color = side), linewidth = 1) + geom_vline(xintercept = 10000, linetype = "dashed", color = "grey50") + scale_color_manual(values = c("below" = "#E74C3C", "above" = "#2980B9")) + labs(x = "Past Spending ($)", y = "Future Spending ($)") + guides(color = "none") + theme_minimal(base_size = 14) + theme(panel.grid.minor = element_blank(), legend.position = "top", text = element_text(size = 20)) ``` ![[statistical-modeling-of-RD-2026-04-12-20-24-47.png]] The visible jump in the fitted values at \$10,000 is the treatment effect. The dots on each side follow a smooth upward trend, confirming that the relationship between past and future spending is continuous except at the cutoff, where the Platinum Status is assigned. **Diagnostic tests** To validate the results, we will run two diagnostics tests. First, the McCrary density test checks whether there is an unusual concentration of observations right at the cutoff, which would suggest that customers manipulated their spending to qualify for Platinum Status (discussed in [[Local identification by a cutoff]]): ```r library(rddensity) dens <- rddensity(X = df$past_spending, c = 10000) summary(dens) Manipulation testing using local polynomial density estimation. Number of obs = 50000 Model = unrestricted Kernel = triangular BW method = estimated VCE method = jackknife c = 10000 Left of c Right of c Number of obs 17081 32919 Eff. Number of obs 5462 6344 Order est. (p) 2 2 Order bias (q) 3 3 BW est. (h) 1492.155 1726.459 Method T P > |T| Robust -0.9576 0.3382 ``` The p-value of the manipulation test (0.34) is well above conventional significance levels, so we fail to reject the null hypothesis stating that the density of the running variable is continuous at the cutoff. The results pass the first test. Second, we check whether observable covariates exhibit a discontinuity at the cutoff. If the RD design is valid, pre-treatment characteristics should be balanced across the cutoff, meaning there should be no jump in age, bookings, or travel patterns at \$10,000: ```r covariates <- c('age_years', 'total_bookings', 'avg_booking_value', 'pct_international_trips') balance <- do.call(rbind, lapply(covariates, function(var) { rd_cov <- rdrobust(y = df[[var]], x = df$past_spending, c = 10000) data.frame(Covariate = var, Estimate = round(rd_cov$coef[1], 3), P_value = round(rd_cov$pv[1], 3)) })) balance Covariate Estimate P_value 1 age_years 0.387 0.326 2 total_bookings 0.028 0.582 3 avg_booking_value -42.269 0.223 4 pct_international_trips -0.007 0.255 ``` None of the covariates show a statistically significant discontinuity at the cutoff. This is consistent with the validity of the RD design: customers just below and just above \$10,000 are comparable in their observable characteristics, and the only systematic difference is the Platinum Status assignment. > [!NOTE] > We estimated the Sharp RD above. For the Fuzzy RD case, where the cutoff determines only *eligibility* rather than guaranteed treatment (discussed in [[Local identification by a cutoff]]), `rdrobust` can be extended with the `fuzzy` argument: `rdrobust(..., fuzzy = df$actual_treatment)`. The fuzzy RD estimate is numerically equivalent to a local Wald estimator, which is the ratio of the reduced-form jump to the first-stage jump at the cutoff, the same identification strategy as in the [[Design Pattern I - Instrumental Variable (IV)|IV design pattern]]. Now, click [[Replication of RD in Python|here]] to implement the same model in Python! [^1]: Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295-2326. And Calonico, S., Cattaneo, M. D., Farrell, M. H., & Titiunik, R. (2017). rdrobust: Software for regression-discontinuity designs. The Stata Journal, 17(2), 372-404. [^2]: All data files used in this book can be found in the book's [GitHub repository](https://github.com/causalbook). [^3]: For a concise guide to the bias-correction and robust inference procedure, see Cattaneo, M. D., Idrobo, N., & Titiunik, R. (2020). A practical introduction to regression discontinuity designs: Foundations. Cambridge University Press. > [!info]- Last updated: April 12, 2026