Synthetic data for the RDD - Causal Book: Design Patterns in Causal Inference

[![[images/colab-badge.svg]]](https://colab.research.google.com/gist/gtozer/e355275e67ca403761e63fdc71dcd321/causalbook-com-synthetic-data-generation-for-regression-discontinuity-design.ipynb) ## OTA Loyalty Program: Platinum Status ### Small parametric seed data Using the following code, we first generate a small **seed dataset** (N=1,000) using a parametric specification. This data is then used to train a generative model (Synthetic Data Vault's Gaussian Copula) to produce a larger synthetic dataset (N=50,000). We explicitly engineer the seed data generation process to include a known **causal discontinuity** (the treatment effect, $\tau$) at a set **spending threshold** ($c$). In other words, we define the causal effect as a **Sharp RDD** on customer loyalty status: * **Running Variable ($X$):** `past_spending` (Total spend in the last 12 months) * **Cutoff ($c$):** $\$10,000.00$ * **Treatment Variable ($T$):** `status_platinum` (1 if $X \ge c$, 0 otherwise) * **Outcome Variable ($Y$):** `future_spending` (Total spend in the next 12 months) * **Ground Truth Causal Effect ($\tau$):** $\$1,500.00$ (The jump in expected future spending at $X=c$) We define the functional form as follows: $ Y = \alpha + \beta_1 X + \beta_2 X^2 + \tau \cdot T + \epsilon $ where * $T = \mathbb{I}[\text{past\_spending} \ge 10000]$ * $\tau = 1500$ * $\epsilon \sim \mathcal{N}(0, 1000^2)$ ### Large synthetic data using the generative model In this part, we follow three steps to generate the synthetic data that preserves the sharp RDD we parametrically defined earlier. We train a Gaussian Copula on a **residualized** version of the seed data created before. Then we use the training to produce the larger dataset (with 50,000 observations). /1. **Residualization:** The known deterministic RDD functional form ($f(X) = \alpha + \beta_1 X + \beta_2 X^2 + \tau \cdot T$) is used to calculate the predicted outcome $\hat{Y}$ for each record. The outcome variable $Y$ is then replaced by the **residual** $\epsilon = Y - \hat{Y}$. * The residual $\epsilon$ is a smooth, continuous variable without the artificial discontinuity, as the jump $\tau$ has been explicitly modeled and removed. * See the following line in the code: `seed_data_sdv['Future_Spending_RESIDUAL'] = seed_data_sdv['Future_Spending'] - deterministic_part`. /2. **Training:** We train the Synthetic Data Vault (SDV) Gaussian Copula on the residualized dataset. The model learns the joint distribution and correlations between all covariates and the *smooth* residual ($\epsilon$). /3. **Synthesis and reconstruction:** * We then use the trained model to generate a large synthetic dataset, including synthetic covariates ($X'$, $T'$, etc.) and a synthetic residual ($\epsilon'$). * Crucially, the synthetic outcome $Y'$ is **reconstructed** by re-applying the *original ground truth RDD equation* to the synthetic covariates and residual: $ Y' = \alpha + \beta_1 X' + \beta_2 X'^2 + \tau \cdot T' + \epsilon' $ * This reconstruction process ensures that the fundamental causal structure ($\tau$ = \$1,500) is embedded into the large synthetic dataset, while the synthetic covariates and error term preserve the statistical properties of the seed data. Without these additional steps, the generative model is likely to smooth out the discontunity around the cutoff! ```python !pip install sdv !pip install statsmodels import numpy as np import pandas as pd from scipy.stats import gamma, norm, poisson, beta import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn as sns ``` ## **Small parametric seed data** ### Define the functional form and create the data ```python # --- RDD & Simulation Parameters --- N_SEED = 1000 CUTOFF = 10000.0 TAU = 1500.0 # Causal Effect NOISE_STD = 1000.0 # --- Ground Truth RDD Formula Parameters --- ALPHA = 8000.0 BETA_1 = 0.02 BETA_2 = 0.000001 # --- Data Generation --- np.random.seed(314159) # 1. Running Variable (X): past_spending past_spending = gamma.rvs(a=5, scale=2500, size=N_SEED) past_spending = np.maximum(1000, past_spending) # 2. Treatment Variable (T): status_platinum status_platinum = (past_spending >= CUTOFF).astype(int) # --- Latent Affluence Factor (Used for balanced covariates) --- aff_factor = past_spending * 0.1 + norm.rvs(loc=0, scale=200, size=N_SEED) aff_factor = np.maximum(100, aff_factor) # 3. Covariate Z1: age_years age_base = norm.rvs(loc=45, scale=10, size=N_SEED) age_years = age_base + (past_spending / 5000) age_years = np.clip(age_years, 25, 75).astype(int) # 4. Covariate Z2: total_bookings aff_contribution = aff_factor / 15000 poisson_lambda = np.maximum(1, 4 + aff_contribution) total_bookings = poisson.rvs(mu=poisson_lambda) total_bookings = np.maximum(1, total_bookings) # 5. Covariate Z3: avg_booking_value avg_booking_value = aff_factor * 5 + norm.rvs(loc=0, scale=400, size=N_SEED) avg_booking_value = np.maximum(100, avg_booking_value).round(2) # 6. Covariate Z4: pct_international_trips MIN_BETA_PARAM = 0.01 alpha_beta = 1 + avg_booking_value / 1000 beta_beta = 5 - (avg_booking_value / 2500) alpha_beta = np.maximum(alpha_beta, MIN_BETA_PARAM) beta_beta = np.maximum(beta_beta, MIN_BETA_PARAM) pct_international_trips = np.clip(beta.rvs(a=alpha_beta, b=beta_beta, size=N_SEED), 0, 1).round(3) # 7. Covariate Z5: primary_region regions = ['North America', 'Europe', 'Asia-Pacific'] region_weights = [0.55, 0.30, 0.15] primary_region = np.random.choice(regions, size=N_SEED, p=region_weights) # 8. Outcome Variable (Y): future_spending - RDD Structure f_x = ALPHA + (BETA_1 * past_spending) + (BETA_2 * past_spending**2) treatment_effect = TAU * status_platinum error_term = norm.rvs(loc=0, scale=NOISE_STD, size=N_SEED) future_spending = f_x + treatment_effect + error_term future_spending = np.maximum(100, future_spending).round(2) # --- Create the DataFrame --- seed_data = pd.DataFrame({ 'customer_id': np.arange(1, N_SEED + 1), 'past_spending': past_spending.round(2), 'age_years': age_years, 'total_bookings': total_bookings, 'avg_booking_value': avg_booking_value, 'pct_international_trips': pct_international_trips, 'primary_region': primary_region, 'status_platinum': status_platinum, 'future_spending': future_spending }) print("\n--- Seed Data Created ---") ``` --- Seed Data Created --- ### Check the seed data using a local bandwidth and report the descriptives ```python # Define the local window (Customers near the cutoff) LOCAL_BANDWIDTH = 1000.0 # Bandwidth to define locality around the cutoff lower_bound_below = CUTOFF - LOCAL_BANDWIDTH upper_bound_below = CUTOFF lower_bound_above = CUTOFF upper_bound_above = CUTOFF + LOCAL_BANDWIDTH data_below_local = seed_data[ (seed_data['past_spending'] >= lower_bound_below) & (seed_data['past_spending'] < upper_bound_below) ] data_above_local = seed_data[ (seed_data['past_spending'] >= lower_bound_above) & (seed_data['past_spending'] <= upper_bound_above) ] # Calculate the local means mean_below_local = data_below_local['future_spending'].mean() mean_above_local = data_above_local['future_spending'].mean() local_crude_jump = mean_above_local - mean_below_local # Calculate the global means mean_below_global = seed_data[seed_data['past_spending'] < CUTOFF]['future_spending'].mean() mean_above_global = seed_data[seed_data['past_spending'] >= CUTOFF]['future_spending'].mean() global_observed_jump = mean_above_global - mean_below_global print(f"Seed Data Generated: N={N_SEED}") print(f"Cutoff (c): ${CUTOFF:,.2f}") print(f"Target Causal Effect (τ): ${TAU:,.2f}") print("--- Sanity Check ---") print(f"Mean Future Spend (Below Cutoff): ${mean_below_global:,.2f}") print(f"Mean Future Spend (Above Cutoff): ${mean_above_global:,.2f}") print(f"Difference in the population (all customers): ${global_observed_jump:,.2f}") print(f"Local difference (customers near the cutoff +/- ${LOCAL_BANDWIDTH:,.0f}): ${local_crude_jump:,.2f}") # Display the first few rows print("\n--- Seed Data Head ---") print(seed_data.head()) ``` Seed Data Generated: N=1000 Cutoff (c): $10,000.00 Target Causal Effect (τ): $1,500.00 --- Sanity Check --- Mean Future Spend (Below Cutoff): $8,204.26 Mean Future Spend (Above Cutoff): $10,030.16 Difference in the population (all customers): $1,825.90 Local difference (customers near the cutoff +/- $1,000): $1,413.38 --- Seed Data Head --- customer_id past_spending age_years total_bookings avg_booking_value \ 0 1 12839.86 51 1 5969.18 1 2 20267.75 37 7 11871.20 2 3 21290.67 48 4 10764.34 3 4 24167.55 48 6 12501.67 4 5 7498.23 60 2 4107.04 pct_international_trips primary_region status_platinum future_spending 0 0.622 North America 1 8157.59 1 0.994 North America 1 10699.14 2 0.834 North America 1 9961.24 3 1.000 Asia-Pacific 1 11173.66 4 0.479 North America 0 8415.50 # **Large synthetic data using the generative model** ### Copy the parametric seed data, redefine the parameters, and calculate the residual ```python # --- RDD Parameters (Redefined for Consistent Reconstruction) --- TAU = 1500.0 ALPHA = 8000.0 BETA_1 = 0.02 BETA_2 = 0.000001 CUTOFF = 10000.0 # 1. Clean and Prepare Data seed_data_sdv = seed_data.copy() cols_to_drop = ['past_spending_c', 'treated_c', 'is_na'] seed_data_sdv = seed_data_sdv.drop(columns=cols_to_drop, errors='ignore') seed_data_sdv['status_platinum'] = seed_data_sdv['status_platinum'].astype(bool) # 2. Calculate the deterministic part of the RDD model X = seed_data_sdv['past_spending'].values T = seed_data_sdv['status_platinum'].astype(int).values f_x = ALPHA + (BETA_1 * X) + (BETA_2 * X**2) treatment_effect = TAU * T deterministic_part = f_x + treatment_effect # 3. Create the new outcome variable: The Residual (ε) seed_data_sdv['future_spending_residual'] = seed_data_sdv['future_spending'] - deterministic_part # 4. Drop the original future_spending column (Y) seed_data_sdv = seed_data_sdv.drop(columns=['future_spending']) ``` ### Extract the metadata from the seed data with the residual and use it for training ```python from sdv.metadata import SingleTableMetadata from sdv.single_table import GaussianCopulaSynthesizer # 1. Define Metadata for the Residuals Dataset metadata_res = SingleTableMetadata() # Detect all columns first. This handles the residual column as a float. metadata_res.detect_from_dataframe(data=seed_data_sdv) # Only update the columns that need explicit definition (Customer ID and Platinum Status) metadata_res.set_primary_key(column_name='customer_id') metadata_res.update_column(column_name='status_platinum', sdtype='boolean') # 2. Train the synthesizer on the smooth, continuous residual data synthesizer_res = GaussianCopulaSynthesizer(metadata=metadata_res) print("\n--- Training Gaussian Copula on Smooth Residuals (ε) ---") synthesizer_res.fit(seed_data_sdv) print("Training Complete.") ``` --- Training Gaussian Copula on Smooth Residuals (ε) --- Training Complete. ### Generate the large synthetic data using the trained Gaussian Copula ```python # 1. Generate the synthetic data (includes X', T' and ε') N_SYNTHETIC = 50000 synthetic_data_res = synthesizer_res.sample(num_rows=N_SYNTHETIC) # 2. Rebuild the Outcome for consistency Y' = f(X') + τ*T' + ε' X_prime = synthetic_data_res['past_spending'].values E_prime = synthetic_data_res['future_spending_residual'].values # Recalculate T_prime deterministically from X_prime T_prime = (X_prime >= CUTOFF).astype(int) # Calculate the deterministic part (f(X') + τ*T') f_x_prime = ALPHA + (BETA_1 * X_prime) + (BETA_2 * X_prime**2) treatment_effect_prime = TAU * T_prime # This applies the jump sharply at the cutoff so the causal effect remains # Rebuild Y' Y_prime = f_x_prime + treatment_effect_prime + E_prime # 3. Finalize the Synthetic Data Frame synthetic_data = synthetic_data_res.drop(columns=['future_spending_residual']).copy() synthetic_data['future_spending'] = np.maximum(100, Y_prime).round(2) # Use the boolean version of status_platinum column for consistency with the metadata synthetic_data['status_platinum'] = T_prime.astype(bool) print("\n--- Reconstruction Complete. Synthetic Data Ready. ---") ``` --- Reconstruction Complete. Synthetic Data Ready. --- ### Check the large synthetic data by creating a local bandwidth and reporting the descriptives ```python # Define the local window (Customers near the cutoff) LOCAL_BANDWIDTH = 1000.0 # Bandwidth to define locality around the cutoff lower_bound_below = CUTOFF - LOCAL_BANDWIDTH upper_bound_below = CUTOFF lower_bound_above = CUTOFF upper_bound_above = CUTOFF + LOCAL_BANDWIDTH data_below_local = synthetic_data[ (synthetic_data['past_spending'] >= lower_bound_below) & (synthetic_data['past_spending'] < upper_bound_below) ] data_above_local = synthetic_data[ (synthetic_data['past_spending'] >= lower_bound_above) & (synthetic_data['past_spending'] <= upper_bound_above) ] # Calculate the local means mean_below_local = data_below_local['future_spending'].mean() mean_above_local = data_above_local['future_spending'].mean() local_crude_jump = mean_above_local - mean_below_local # Calculate the global means mean_below_global = synthetic_data[synthetic_data['past_spending'] < CUTOFF]['future_spending'].mean() mean_above_global = synthetic_data[synthetic_data['past_spending'] >= CUTOFF]['future_spending'].mean() global_observed_jump = mean_above_global - mean_below_global print(f"\n--- Reconstruction Complete. Synthetic Data Ready. ---") print(f"Synthetic Data Generated: N={N_SYNTHETIC}") print(f"Cutoff (c): ${CUTOFF:,.2f}") print(f"Target Causal Effect (τ): ${TAU:,.2f}") print("--- Synthetic Sanity Check ---") print(f"Difference in the population (all customers): ${global_observed_jump:,.2f}") print(f"Local difference (customers near the cutoff +/- ${LOCAL_BANDWIDTH:,.0f}): ${local_crude_jump:,.2f}") # Display the first few rows print("\n--- Synthetic Data Head ---") print(synthetic_data.head()) ``` --- Reconstruction Complete. Synthetic Data Ready. --- Synthetic Data Generated: N=50000 Cutoff (c): $10,000.00 Target Causal Effect (τ): $1,500.00 --- Synthetic Sanity Check --- Difference in the population (all customers): $1,849.37 Local difference (customers near the cutoff +/- $1,000): $1,530.26 --- Synthetic Data Head --- customer_id past_spending age_years total_bookings avg_booking_value \ 0 1940815 16780.42 58 5 8753.69 1 12657432 7422.26 52 5 3930.28 2 10174833 12193.61 50 4 7178.08 3 16497775 26016.98 46 3 14365.88 4 6164702 19644.89 56 6 9337.91 pct_international_trips primary_region status_platinum future_spending 0 0.889 Europe True 9225.23 1 0.682 North America False 6712.83 2 0.979 Asia-Pacific True 10800.52 3 1.000 North America True 9950.01 4 0.897 North America True 11450.07 # **Save the data to your computer** ```python # from google.colab import files # # 1. Save the data to a file in the Colab virtual machine (temporary) # filename = 'synthetic-loyalty-data-for-rd_v100.csv' # synthetic_data.to_csv(filename, index=False) # # 2. Trigger the download of the data to your local computer # files.download(filename) ```