Oh my! Polynomial order and overfitting in RD - Causal Book: Design Patterns in Causal Inference

The default in [[rdrobust]] is `p = 1`: a *local linear* fit on each side of the cutoff. Why local linear and not local quadratic, or a global polynomial of order four or six fit to all 50,000 customers? Surely a higher-order polynomial would explain more of the variation in `future_spending` and give a more precise estimate of the jump. We'll try the local version first, varying `p` from 1 to 4 in `rdrobust` (the bandwidth re-optimizes for each $p$): ```r local_rows <- do.call(rbind, lapply(1:4, function(p) { rd <- rdrobust(y = df$future_spending, x = df$past_spending, c = 10000, p = p) data.frame(p = p, h = rd$bws[1, 1], coef_conv = rd$coef[1], coef_rob = rd$coef[3], se_rob = rd$se[3], ci_lo = rd$ci[3, 1], ci_hi = rd$ci[3, 2]) })) ``` ```r p h coef_conv coef_rob se_rob ci_lo ci_hi 1 1849.3 1567.18 1582.25 43.68 1496.64 1667.85 2 3397.7 1582.18 1595.07 45.62 1505.66 1684.49 3 2817.6 1546.98 1535.08 64.25 1409.15 1661.01 4 3716.0 1543.92 1537.48 68.67 1402.88 1672.08 ``` Results are stable across the four orders. The point estimate sits in a \$60 band across all four polynomial orders, and every robust CI covers the ground truth $\tau = \$1,500$. The standard error grows with $p$ because the higher-order fit has to estimate more parameters, but the local approach handles the loss of degrees of freedom gracefully. So far, we're good. We'll now fit the global version using the specification `lm(future_spending ~ poly(past_spending, k) * status_platinum)` on the full 50,000-row dataset for $k \in \{1, 2, 4, 6\}$, and read the jump at the cutoff via `predict()`: ```r global_jump <- function(k) { m <- lm(future_spending ~ poly(past_spending, k) * status_platinum, data = df) newd <- data.frame(past_spending = c(10000, 10000), status_platinum = c(TRUE, FALSE)) fit <- predict(m, newd, se.fit = TRUE) jump <- fit$fit[1] - fit$fit[2] jump_se <- sqrt(fit$se.fit[1]^2 + fit$se.fit[2]^2) data.frame(k = k, jump = jump, se = jump_se, ci_lo = jump - 1.96 * jump_se, ci_hi = jump + 1.96 * jump_se) } ``` ```r k jump se ci_lo ci_hi 1 1467.67 16.83 1434.68 1500.65 2 1508.74 23.55 1462.59 1554.90 4 1515.00 36.57 1443.32 1586.68 6 1556.83 49.33 1460.14 1653.52 ``` The point estimates are reasonable: \$1,468 to \$1,557. But look at the standard errors: at $k = 1$ the global polynomial reports SE = \$16.83, less than half of `rdrobust`'s [[MSE-optimal bandwidth|MSE-optimal]] SE of \$43.68. At $k = 2$ it is \$23.55. The 95% CI at $k = 1$ is [\$1,435, \$1,501] – *narrower* than the local-polynomial CI by a factor of 2.6, and the point estimate \$1,467.67 misses the truth by nearly twice its standard error, with the CI clearing \$1,500 by only sixty-five cents. This is a potential issue worth understanding more carefully. Let's see what's going on in our data: A global polynomial uses every observation in the dataset to fit the function $f(X)$, including the customer who spent \$3,000 (deep in the control region) and the customer who spent \$45,000 (deep in the treated region). Their data points pull on the fitted curve at their respective $X$ values, but they also shift the polynomial coefficients and thereby influence the fitted value at $X = \$10,000$. The standard error reflects that wide-sample contribution, even though the customer at \$3,000 has nothing useful to say about what happens to a customer who spends \$10,001. The global model assumes its functional form holds throughout the entire range, and it gets its precision by leveraging that assumption. The local polynomial estimator does not make that assumption, and it pays for the missing information in wider standard errors. Gelman and Imbens (2019) offer a formal critique.[^1] Their three points: (1) global high-order polynomials place strong, often hidden, weights on observations far from the cutoff (weights that are not justified by any RD assumption); (2) inference is sensitive to the polynomial order, and the data cannot tell us which order is correct; (3) the resulting confidence intervals can have very poor coverage. The consensus is to use `rdrobust`'s default: **local linear with a data-driven bandwidth**. `rdrobust` addresses these via a four-part solution: (i) the fit is **local** (only observations within bandwidth $h$ of the cutoff contribute); (ii) within that window the kernel **down-weights** distant points (triangular by default, decaying to zero at the boundary); (iii) the order is **low** ($p = 1$ default, optionally $p = 2$); and (iv) the leading polynomial-approximation bias is **estimated and subtracted**, with the standard error inflated to absorb the resulting variance. Local quadratic is a reasonable robustness check, especially with abundant data. Anything beyond `p = 2` is rarely worth the complexity, and global polynomials of any order trade transparent assumptions for hidden ones. > [!NOTE] > The data generation process for the [[Synthetic data for the RDD|OTA loyalty dataset]] is in fact globally quadratic in `past_spending`, which is why the global polynomial estimates do not diverge wildly across $k \in \{1, 2, 4, 6\}$ (i.e., the polynomial is well-specified by construction). In other, real data, where the true functional form is unknown, the instability across $k$ that Gelman and Imbens warn about shows up more visibly than it does here. The point of this note is therefore the *standard error*, not the *point estimate*. One takeaway is to exercise caution with global polynomials. They may look more precise than they are. A confidence interval narrower than the local-polynomial interval is not a sign of a better estimator; it is a sign of a stronger assumption. RD designs are built around the idea that we should not have to make that assumption. [^1]: Gelman, A., & Imbens, G. (2019). Why high-order polynomials should not be used in regression discontinuity designs. *Journal of Business & Economic Statistics*, 37(3), 447-456. > [!info]- Last updated: May 13, 2026