Data and conceptual model (RD) - Causal Book: Design Patterns in Causal Inference

### Data The data used in this chapter is synthetic, designed to simulate a real-world customer loyalty program for a major Online Travel Agency (OTA). The causal link we are interested in is the effect of receiving **Platinum Status** on a customer's **Future Spending**. Unlike the cigarette data example which relies on public records, business data containing individual transaction histories and loyalty tiers is highly proprietary. To overcome this limitation while maintaining the necessary statistical properties, we generated a dataset using a **structural simulation followed by a generative machine learning model** (i.e., multivariate Gaussian Copula synthesizer trained on residual errors).[^1] The data generation process involved two main steps:[^2] 1. **Creating the seed data:** We created a "ground truth" seed dataset ($N=1,000$) where the relationships were explicitly defined. * **Running Variable ($X$):** `Past_Spending` was generated using a Gamma distribution to mimic the right-skewed nature of travel expenditures. * **Cutoff ($c$):** A strict threshold was set at **\$10,000**. Any customer spending above this amount in the previous year was assigned `Status_Platinum`. * **Outcome ($Y$):** `Future_Spending` is defined a function of past spending, a latent "affluence" factor, random noise, and a fixed treatment effect ($\tau$) of **\$2,000** for those crossing the cutoff (i.e., 20% increase in spending). * **Covariates:** We generated the following confounders: `age`, `number of bookings`, `average booking value`, `percentage of international trips`, and `primary region`. These variables were correlated with spending but **continuous** at the cutoff, satisfying the RDD validity assumption. 2. **Scaling the seed data:** To scale the dataset to $N=50,000$ while preserving the existing joint distributions, we used the **[[Synthetic Data Vault]]** algorithm. We trained a Gaussian Copula Synthesizer on the seed data. To ensure the discontinuity was not smoothed out by the generative model, we used a residual-based reconstruction method, modeling the error terms by independently and deterministically re-calculating the treatment status post-generation.[^3] See [[Synthetic data for the RDD]] for the detailed code (in Python). Now, let's load the data and see the summary statistics:[^4] ```r df <- readr::read_csv('synthetic-loyalty-data-for-rd_v100.csv') skimr::skim(df) ── Data Summary ──────────────────────── Values Name Piped data Number of rows 50000 Number of columns 9 _______________________ Column type frequency: factor 1 logical 1 numeric 7 ________________________ Group variables None ── Variable type: factor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── skim_variable n_missing complete_rate ordered n_unique top_counts 1 primary_region 0 1 FALSE 3 Nor: 27055, Eur: 14531, Asi: 8414 ── Variable type: logical ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── skim_variable n_missing complete_rate mean count 1 status_platinum 0 1 0.658 TRU: 32919, FAL: 17081 ── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist 1 customer_id 0 1 8393515. 4839816. 244 4176748 8433730. 12570896. 16777151 ▇▇▇▇▇ 2 past_spending 0 1 12799. 5403. 1457. 8762. 12147. 16160. 38962. ▅▇▃▁▁ 3 age_years 0 1 47.0 10.2 25 40 47 54 75 ▃▇▇▅▁ 4 total_bookings 0 1 3.30 1.58 2 2 3 4 11 ▇▂▁▁▁ 5 avg_booking_value 0 1 6413. 3051. 172. 4108. 6038. 8332. 20203. ▅▇▃▁▁ 6 pct_international_trips 0 1 0.746 0.201 0.063 0.618 0.792 0.914 1 ▁▁▃▅▇ 7 future_spending 0 1 9798. 1465. 5135. 8763. 9876. 10862. 14866. ▁▅▇▃▁ ``` The variables are defined as follows: * **`Customer_ID`**: Unique identifier for the traveler. * **`Past_Spending`**: The running variable. Total USD spent on bookings in the prior 12-month qualification period. * **`Future_Spending`**: The outcome variable. Total USD spent in the 12 months following the status qualification. * **`Status_Platinum`**: The treatment variable. Binary (1/0) indicating if the customer qualified for the top tier. * **`Age_Years`**, **`Total_Bookings`**, **`Avg_Booking_Value`**: Covariates representing customer demographics and booking volume. * **`Pct_International_Trips`**: Continuous variable (0.0 to 1.0) representing the percentage of the customer's trips that included an international destination. * **`Primary_Region`**: Categorical variable indicating the customer's home market (North America, Europe, Asia-Pacific).[^5] ### Conceptual model The causal effect we would like to estimate is the "lift" in customer value generated specifically by the benefits of the Platinum Status. We basically hypothesize that perks such as free lounge access, priority boarding, complimentary room upgrades, discounts, and 2x point accumulation incentivize customers to book exclusively with the OTA, thereby increasing its share-of-wallet by an average of $2,000. This relationship is endogenous. High-spending customers in the past (`Past_Spending`) are statistically likely to be high-spending customers in the future (`Future_Spending`), regardless of their loyalty status. If we simply compared the average spending of Platinum members against non-Platinum members, our estimate would be heavily biased by **selection**: Platinum members are inherently "high spend" customers to begin with. To isolate the causal effect of the status itself, we use the **Regression Discontinuity Design (RDD)**. The intuition is that while a customer spending \$500 is very different from one spending \$50,000, a customer spending **\$9,999** is essentially identical to a customer spending **\$10,001**. The only meaningful difference between these two customers is that one received the Platinum status and the other did not. The conceptual model for our Sharp RDD is: $ Y_i = \alpha + f(X_i) + \tau D_i + \epsilon_i $ Where: * $Y_i$ is `Future_Spending`. * $X_i$ is `Past_Spending` (the running variable). * $D_i$ is the treatment indicator (`Status_Platinum`), which is 1 if $X_i \ge \$10,000$ and 0 otherwise. * $f(X_i)$ is a flexible function modeling the baseline relationship between past and future spending. * $\tau$ is the local average treatment effect we hope to estimate. **Disclaimer:** A valid RDD requires two key assumptions to hold: 1. **Continuity of potential outcomes:** In the absence of the treatment, the average outcome would change smoothly across the cutoff. There should be no "jumps" in future spending at \$10,000 other than the one caused by the status. We validated this during data generation by ensuring all covariates (e.g., `Age`, `Pct_International_Trips`) are balanced and smooth at the cutoff. We will check the balance again when we model the data. 2. **No Manipulation:** Customers cannot precisely manipulate their spending to fall just above the line. In a real-world OTA context, this is plausible as customers often cannot perfectly predict their final annual spend down to the dollar, though some "bunching" is possible. In our synthetic data, we assume no sorting occurred. The next step is to put the data into a mathematical model to estimate the discontinuity $\tau$. [^1]: Because we generated this data, we know the "ground truth." In a real-world analysis, the true $τ$ is unknown. Here, we know exactly that $τ=\$2,000$. This allows us to benchmark our statistical models: if our RDD regression estimates a lift of $2,000, we know our model is unbiased (which won't be happening!). [^2]: Do we really need to follow these two steps here? The answer is clearly no. In this case, because our seed data is also simulated, the value of using the generative model is limited ("estimation error" is introduced as an added perturbation, and interdependencies are inferred rather than hard-coded). This approach is ideally suited to creating a semi-synthetic dataset by starting with a real seed dataset and expanding it. We use it here only to demonstrate the value of generative models in synthetic data creation, In addition, the generative model helps with covariate balance because the Gaussian Copula smooths the distributions. [^3]: Standard generative models tend to learn smooth distributions and would likely "bridge the gap" at the cutoff, erasing the causal jump that makes the RD design pattern feasible. By modeling the residuals (spending minus the expected status effect) instead of the raw values, we preserved the error structure while allowing us to manually re-inject the sharp discontinuity later. [^4]: All data files used in this book can be found in the book's [GitHub repository](https://github.com/causalbook). [^5]: Useful for testing if the treatment effect varies by region (i.e., heterogeneity in the causal effect). > [!info]- Last updated: January 25, 2026