So far, we have modeled the same conceptual model on the same data using an instrumental variable in [[Statistical modeling of IV|two stage regression model]] and [[(Double) Machine learning using IV|using double machine learning]]. We obtained qualitatively the same results but the Double ML approach provided us with a solution to the model misspecification problem. Now, what if we model the same problem using Bayesian estimation? As long as the underlying conceptual model and the data are the same, we would expect consistent results (unless we are dealing with a case of [[Lindley's paradox]]).
Speaking of "consistent results", we tend to make the mistake of saying that the frequentist and Bayesian results are the same (especially when Bayesian inference is used as supporting evidence). This is a mistake because the output of a frequentist analysis is not comparable to the output of a Bayesian analysis. Let's unpack this before we dive into the example data and code.
The IV models we have used so far in statistics and machine learning produced point estimates with confidence intervals. In the Bayesian solution to the same problem, the output will be a posterior distribution. We could report the mean of the posterior distribution, but that is not the goal here. We will report the mean only to make sense of the results in comparison to the frequentist models we used earlier. We'll then explore the posterior distribution in more detail, since that's the purpose of using a Bayesian model for this problem. If Double ML helps us solve the model misspecification problem, Bayesian inference helps us understand our results better by going beyond point estimates and confidence intervals, and as we will see, this can be useful in exploring the heterogeneity in the causal effect.
We will temporarily change our data but keep the same conceptual model here.[^1] Let's first motivate the problem and see the solution using a frequentist model:
**Price elasticity of demand for cigarettes**
The causal link we are interested in here is the effect of cigarette prices (per pack) on per capita consumption.[^2] This is clearly an endogenous relationship with several potential confounders. For example, the introduction of a new brand may lower prices while increasing consumption through marketing efforts. Since we are interested in the effect of price on consumption, we need to isolate this effect. See [[Confounding by intention]] for the underlying problem and the solution we apply here. We will use the state tax per pack as the instrumental variable by assuming that taxes can affect demand only through price changes. The dataset is from the CDC for 50 states and Washington, D.C. from 1970 to 2019. Let's load the data:[^3]
```r
df <- readr::read_csv('processed-tobacco-tax-as-iv_v101.csv')
```
Here are the summary statistics (all numeric variables are in USD):
```r
── Data Summary ────────────────────────
Values
Name df
Number of rows 2550
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None
── Variable type: factor ──────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique
1 state 0 1 FALSE 51
2 year 0 1 FALSE 50
── Variable type: numeric ─────────────────────────────────────
skim_variable n_missing complete_rate mean sd
1 state_tax_per_pack 0 1 0.586 0.769
2 price_per_pack 0 1 2.77 2.31
3 pack_sales_per_capita 0 1 94.0 41.5
```
***The results of the frequentist analysis***
Since we have changed our data here, let's see the results of a frequentist model:
```r
fixest::etable(fixest::feols(pack_sales_per_capita ~ 1 | as.factor(state) + as.factor(year) | price_per_pack ~ state_tax_per_pack, data = df), coefstat = c('confint'))
fixest::feols(pack_sales..
Dependent Var.: pack_sales_per_capita
price_per_pack -7.815*** [-11.87; -3.760]
(2.019)
Fixed-Effects: --------------------------
as.factor(state) Yes
as.factor(year) Yes
________________ __________________________
VCOV: Clustered by: state)
Observations 2,550
R2 0.89353
Within R2 0.06127
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
On average, a $1 increase in the price of a pack of cigarettes results in a $7.82 reduction in annual per capita cigarette consumption.
**Bayesian IV model**
Let's model the same instrumental variable model Bayesian. We will use the [[brms]] package in R for our Bayesian models. Recently, Python packages such as `Bambi (BAyesian Model-Building Interface in Python)` have begun to catch up with `brms`, so you may want to check out `Bambi` if you have a strong preference for using Python. `brms` is much more mature at this point. In this book, we will continue to replicate the statistical and machine learning models in Python, but we will not replicate the Bayesian models in Python due to the high computational cost.
The IV model is defined as a multivariate model in `brms` where we first formulate the endogenous variable (price per pack) and the outcome (per capita consumption) separately. We also set `rescor` to `TRUE` to model the correlations between the residuals of the two response variables (price and sales). This can be useful if unexplained variation in the specified response variables is expected. The rest is setting the seed for replicability, setting the number of iterations, and boosting speed by increasing the number of CPU cores to be used.
```r
model_1 <- brms::bf(price_per_pack ~ state_tax_per_pack + as.factor(year) + (1 | state))
model_2 <- brms::bf(pack_sales_per_capita ~ price_per_pack + as.factor(year) + (1 | state))
model_multivar <- brms::brm(model_1 + model_2 + brms::set_rescor(rescor = TRUE), data = df, seed = 2024, iter = 4000, cores = 4)
summary(model_multivar)
```
We continue to control for state and year effects. Ideally, we should help our Bayesian estimator by providing it with some informative priors. Our goal in this book is comparison rather than precision. To avoid any bias from prior selection, we use [[Weakly informative prior distributions|default (weakly informative) priors]] to compute posterior sampling distributions using a Hamiltonian Monte Carlo method ([[No U-Turn Sampler - NUTS]]). In the `brms` defaults, this means a half Student's t-distribution with 3 degrees of freedom for the intercepts and standard deviations, and a normal distribution for the coefficient of the pack price, other coefficients, and the fixed effects.
> [!NOTE]
> ***Some details about the estimator***
>
> Before we get to the results, let's unpack the number of iterations defined by the `iter` parameter above. The default is `2000` and we set it to `4000`, but what does that really mean?
>
> Well, we are trying to sample the posterior distribution and we need a large enough sample to make reliable inference about the posterior. Depending on the model, the required size of the posterior sample may change. The way Monte Carlo sampling works is: 1. start a [[Markov chain|chain]] with some initial parameters (randomly sampled from a uniform distribution in [[Stan]]) 2. let the chain run for a while to get into the region of interest (called the warm-up period and is calculated as `iter/2` by default in `brms`) 3. sample the distribution.
>
> Each chain contributes to the sample as many as the difference `iter - warm-up` and the total sample size is `chains x (iter - warm-up)`. Increasing the number of iterations in each chain is useful up to the point where the estimates converge (beyond that, more samples do not really help). Not shown in the code is the number of chains (`4` by default in `brms`). Adding more chains can increase the size of the useful sample and increase the credibility of the estimate, but this comes at the expense of computational power and speed.
>
> Chains are run [[Pleasingly parallel|pleasingly parallel]], so ideally we benefit most from making the number of chains equal to the number of cores, so that each chain is run in parallel and independently of each other using one of the available cores.[^4]
***Bayesian results***
Now for the results. After computing the posterior samples, we report below the model summary, where each parameter is summarized using the mean and standard deviation of the posterior distribution and two-sided 95% credible intervals (l-95% CI and u-95% CI) based on quantiles:
```r
Family: MV(gaussian, gaussian)
Links: mu = identity; sigma = identity
mu = identity; sigma = identity
Formula: price_per_pack ~ state_tax_per_pack + as.factor(year) + (1 | state)
pack_sales_per_capita ~ price_per_pack + as.factor(year) + (1 | state)
Data: df (Number of observations: 2550)
Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup draws = 8000
Multilevel Hyperparameters:
~state (Number of levels: 51)
Estimate Est.Error l-95% CI u-95% CI Rhat
sd(priceperpack_Intercept) 0.12 0.01 0.10 0.16 1.00
sd(packsalespercapita_Intercept) 21.50 2.21 17.79 26.53 1.01
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat
priceperpack_Intercept 0.26 0.03 0.19 0.32 1.02
packsalespercapita_Intercept 124.12 3.61 116.28 130.65 1.01
priceperpack_state_tax_per_pack 1.21 0.01 1.19 1.23 1.00
packsalespercapita_price_per_pack -7.90 0.54 -8.95 -6.84 1.01
Year dummies Included
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat
sigma_priceperpack 0.20 0.00 0.19 0.21 1.00
sigma_packsalespercapita 13.84 0.20 13.46 14.23 1.00
Residual Correlations:
Estimate Est.Error l-95% CI u-95% CI Rhat
rescor(priceperpack,packsalespercapita) 0.13 0.02 0.09 0.17 1.00
Draws were sampled using sampling(NUTS). For each parameter, Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```
Note that the point estimate (mean of the posterior distribution) for our focal effect is very close to the frequentist estimate above ($7.90 reduction in sales versus $7.82 above). This looks "consistent" (see the discussion at the beginning of this chapter). Let's pause for a moment to understand the quality metric of the estimate:
> [!NOTE]
> `Rhat` is the Gelman-Rubin convergence diagnostic, the scale reduction factor for split chains. It compares the variance between multiple chains to the variance within each chain. An Rhat value of 1 indicates perfect convergence, but values up to 1.05 (recently the more stringent threshold of 1.01) are acceptable. `Rhat` depends on the number of iterations, so increasing the number of iterations *may* help convergence and lead to a value closer to 1.
So where do we go from here? Clearly we want to see the posterior distribution. Otherwise, what is the point of using a Bayesian estimator for this problem? (bait question)
```r
ggplot(filter(ggmcmc::ggs(model_multivar),
Parameter == "b_packsalespercapita_price_per_pack",
Iteration > 2000), aes(x = value)) +
geom_density(fill = "skyblue", alpha = .5) +
scale_x_continuous(limits = c(-9.9, -5.9)) +
scale_y_continuous(limits = c(0, .75)) +
geom_vline(xintercept = c(-7.9), col = "#1181D0", linetype = 2) +
tidybayes::stat_pointinterval(aes(y = 0), point_interval = ggdist::mode_hdi, .width = .95) +
xlab("Change in sales (USD)") + ylab("Density") +
theme_classic() + theme(legend.position = "right", text = element_text(size = 25)) +
labs(title = "Posterior density plot of the effect of price increase on sales")
```
![[posterior-for-price-on-sales.png]]
The mean reported in the summary output above is shown as a dashed blue line on the plot. The darker line on the x-axis is the highest density interval at the 95% probability level, with the mode marked in the middle. In this case, the distribution of the sample is well behaved and the mode is only slightly lower than the mean.
> [!NOTE]
> The two measures of the central tendency of the distribution (the mode and the mean) agree here, but this is not always the case. Consider a binomial distribution with p = 0.7, where 70% of the draws would be 1 and 30% would be 0. The mode, 1, is more informative than the mean here because the mean is 0.7, and that is not even a value observed in the hypothetical sample.
>
One way to use posterior density plots is to visualize the heterogeneity of the effect. Note that we use a fixed effects model to account for time-invariant confounders, so the average effect on smoking per dollar increase in price is assumed to be the same across states. The only difference is the amount of the deviation from the population mean for each state. In other words, the price elasticity estimated by the model is the same, but with different starting points of demand in each state.
We will argue that the effect of an initial $1 increase in the price of a pack of cigarettes on the demand for cigarettes can vary quite substantially depending on the overall level of education in a state.[^5] We can expect more educated people, on average, to allocate their resources better. That is, if the price of cigarettes rises by a dollar, we can expect the educated group to reduce their consumption (or even quit smoking) so as not to cut back on other, potentially more important expenditures, such as leisure activities or nutritious food. Following this seemingly plausible hypothesis, we will examine how the posterior samples for the effect of the $1 price increase on consumption differ across the following states:
- Massachusetts - MA (highest educational attainment)[^6]
- Georgia - GA (medium educational attainment)
- Mississippi and West Virginia - MS and WV (lowest educational attainment)
```r
selected_states <- c('MA', 'GA', 'MS', 'WV')
model_multivar %>%
tidybayes::spread_draws(b_packsalespercapita_price_per_pack, r_state__packsalespercapita[state,]) %>%
mutate(state_specific = (1 * b_packsalespercapita_price_per_pack) + r_state__packsalespercapita) %>%
filter(state %in% selected_states) %>%
ungroup() %>%
mutate(state = factor(state)) %>%
ggplot(aes(y = state, x = state_specific, fill = stat(x < 0))) + guides(fill=FALSE) +
tidybayes::stat_halfeyeh(point_interval = tidybayes::mode_hdi, .width = .95, interval_size_range = c(0.66, 0.66)) +
geom_vline(xintercept = 0, linetype = "dashed") +
scale_fill_manual(values = c("gray80", "skyblue")) +
scale_y_discrete(breaks = selected_states) +
labs(y = 'Density per state',
x = 'Change in sales (USD)') +
theme_classic() + theme(legend.position = "right", text = element_text(size = 25)) +
labs(title = "Posterior density plot of the effect of price increase on sales")
```
![[posterior-for-price-on-sales-per-state.png]]
According to the state-specific effects of price, the demand for cigarettes is inelastic to the initial $1 increase in price in the two states with the lowest levels of education (Mississippi - MS and West Virginia - WV). In other words, the $1 increase does not necessarily lead to a decrease in smoking. We can clearly see that the highest density interval covers zero at the 95% probability level in these two states.
In Massachusetts - MA, we observe a stronger negative effect than the average due to the state's negative offset from the population mean. People in MA reduce their consumption more drastically (and possibly reallocate their resources) in response to the $1 price increase. In Georgia - GA, which is at the median level of educational attainment, the posterior density plot is closer to the average elasticity.
Of course, the posterior density plots look quite similar here (with some variation due to the sampling process) because we are looking at the effect of the causal effect of the price increase on the state averages, which we assumed to follow the same distribution with identical priors. For other data with different assumptions, the distributions may be more different. Nevertheless, one of our favorite use cases for Bayesian posterior density plots is to visually show the heterogeneity in the effect using posterior distributions instead of what would be different point estimates.
Finally, if you are interested in better understanding the Bayesian model dynamics, see [[Oh my! Priors in the Bayesian IV model]] and the upcoming section [[Oh my! Multimodal posterior in the Bayesian IV model]].
[^1]: This section uses different data. Earlier, we examined the effect of a health care bill aimed at protecting health care workers on the price of eating out. We had to be creative and fictionalize the data to fit the example. We ended up with a small data set and a weak instrument that served our purpose up to this point. Now, with Bayesian posterior sampling, we got a warning signal. Our estimator computed a multimodal posterior distribution: multiple plausible parameter values may be estimated due to the weak instrument or limited data. This will be discussed in more detail in the upcoming section [[Oh my! Multimodal posterior in the Bayesian IV model]].
[^2]: Note that the treatment in this example is continuous, and this is a potential problem from the perspective of identifying causal effects. Continuous treatments imply an infinite number of possible outcomes, and only a limited number of them are observed for each state and across states. That is, there are theoretically an infinite number of values that cigarette prices can take, with a correspondingly infinite number of possible values of cigarette sales. Thus, the positivity assumption is likely to be violated, since each state cannot really have a non-zero probability of being assigned to each and every cigarette price level. Similarly, the exchangeability assumption must be addressed because the data have many treatment values for which unconfoundedness must be achieved.
[^3]: All data files used in this book can be found in the book's [GitHub repository](https://github.com/causalbook/).
[^4]: In R, use `parallel::detectCores(logical=FALSE)` to check the number of cores available. In Python, use `psutil.cpu_count(logical=False)` (`True` to get the number of virtual hyperthreaded cores instead of the physical cores).
[^5]: The following code and visualization applies to the effect of the initial $1 increase in cigarette prices on smoking. This is because we are only visualizing the effect of the contribution of the slope effect for $1 (fixed for all states) to the varying offset (the intercept) for each state (excl. the population mean in the calculation for clarity). Note that in the calculation, the first term varies with the price change, but the second term is fixed.
[^6]: Based on the percentage of the population with a bachelor's degree or higher in 2021.