Data and conceptual model - Causal Book: Design Patterns in Causal Inference

##### Data The data is on the price elasticity of demand for cigarettes. The causal link we are interested in here is the effect of cigarette prices (per pack) on per capita consumption.[^1] Cigarette consumption data are from the CDC for 50 states and Washington, D.C. from 1970 to 2019. The data on income, age, and education come from several sources, including the U.S. Census Bureau, the IPSUM National Historical Geographic Information System (NHGIS), and the National Center for Education Statistics (NCES). Let's load the data and see the summary statistics: ```r df <- readr::read_csv('processed-tobacco-tax-as-iv_v101.csv') skimr::skim(df) ── Data Summary ──────────────────────── Values Name df Number of rows 2550 Number of columns 8 _______________________ Column type frequency: character 1 numeric 7 ________________________ Group variables None ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── skim_variable n_missing complete_rate min max empty n_unique whitespace 1 state 0 1 2 2 0 51 0 ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist 1 year 0 1 1994. 14.4 1970 1982 1994. 2007 2019 ▇▇▇▇▇ 2 state_tax_per_pack 0 1 0.586 0.769 0.02 0.13 0.24 0.675 4.94 ▇▁▁▁▁ 3 price_per_pack 0 1 2.77 2.31 0.287 0.807 1.82 4.48 10.5 ▇▂▂▁▁ 4 pack_sales_per_capita 0 1 94.0 41.5 9.1 60.8 95.6 122. 296. ▅▇▃▁▁ 5 income 0 1 24822. 15558. 2736 11404 22376 36467. 83664 ▇▆▅▁▁ 6 age 0 1 79.6 3.69 69.4 77.8 80.6 82.2 89.2 ▂▃▇▇▁ 7 education 0 1 42.4 14.8 14 29 44.1 55.6 77.7 ▆▆▇▇▁ ``` The variables, `state_tax_per_pack`, `price_per_pack`, `pack_sales_per_capita`, and `income` are in USD. `Age` and `education` are in percentage points. `Age` shows the percentage of the population that is at least 18 years old. `Education` is the percentage of the population with a college degree or higher. ##### Conceptual model The causal effect we would like to estimate is the effect of cigarette prices (per pack) on per capita consumption. This is clearly an endogenous relationship with several potential confounders. For example, the introduction of a new brand may lower prices while increasing consumption through marketing efforts. Since we are interested in the effect of price on consumption, we need to isolate this effect. See [[Confounding by intention]] for the underlying problem and the solution we apply here. We will use the state tax per pack as the instrumental variable, assuming that taxes can only affect demand through price changes. The next step is to put the data into a mathematical model in the light of the conceptual model: - [[Statistical modeling of IV]] - [[(Double) Machine learning using IV]] - [[IV the Bayesian way]] **Disclaimer:** A valid instrumental variable must satisfy three key conditions: relevance (strong correlation with the endogenous variable), exclusion restriction (affecting the outcome only through the endogenous variable), and independence (uncorrelated with the error term). While established use in the literature is a good starting point, theoretical justification and context-specific empirical testing are still important. Note that an instrument that works well in one setting may not be appropriate in another due to differences in underlying mechanisms or data structures. Empirical validation includes F-tests for instrument strength in the first stage (relevance), overidentification tests when multiple instruments are available, and sensitivity analyses to assess robustness. Our example is intended to illustrate the use of IV for causal effect estimation. Ideally, the strength of the instrument and the potential violation of the exclusion restriction should be considered in more detail (e.g., use multiple instruments, perform placebo tests...). Our instrument passes the sniff test. For the sake of brevity, we leave further theoretical justifications and empirical tests to the reader. [^1]: Note that the treatment in this example is continuous, and this is a potential problem from the perspective of identifying causal effects. Continuous treatments imply an infinite number of possible outcomes, and only a limited number of them are observed for each state and across states. That is, there are theoretically an infinite number of values that cigarette prices can take, with a correspondingly infinite number of possible values of cigarette sales. Thus, the positivity assumption is likely to be violated, since each state cannot really have a non-zero probability of being assigned to each and every cigarette price level. Similarly, the exchangeability assumption must be addressed because the data have many treatment values for which unconfoundedness must be achieved. > [!info]- Last updated: January 27, 2025