For a definition of data centricity that we use here, see [datacentricity.org](https://datacentricity.org/). If staying true to the data is making the correct assumptions about the data, then no other method could demonstrate the importance of data centricity than the IV design. As we covered when discussing the data in [[Statistical modeling of IV]], the IV pattern makes two fundamental assumptions: relevance and exclusion. In the model$Y_i = \beta_0 + \beta_1 X_{1i} + \dots + \beta_k X_{ki} + \beta_{k+1} W_{1i} + \dots + \beta_{k+r} W_{ri} + u_i$ where - $Y_i$ is the dependent variable - $X_{1i}, \dots, X_{ki}$ are $k$ endogenous regressors (e.g., treatment) - $W_{1i}, \dots, W_{ri}$ are $r$ exogenous regressors (e.g., control variables) - $Z_{1i}, \dots, Z_{mi}$ are $m$ instrumental variables /1. **Relevance** requires that the instruments are correlated with the endogenous regressors (treatment): $\text{Cov}(Z_j, X_k) \neq 0 \quad \text{for } j = 1,\dots,m \text{ and } k = 1,\dots,k$ This is the only assumption that can be tested empirically by testing the strength of the relationship between the instrumental variable and the treatment variable. If the IV model is a two-stage least squares model, an F-test is usually also reported to assess the validity of the relevance assumption. /2. **Exclusion** requires that the instrument affect the outcome only through its effect on the treatment variable: $Y_i \perp Z_j | X_1,\dots,X_k,W_1,\dots,W_r \quad \text{for } j = 1,\dots,m$ Unfortunately, this assumption cannot be tested empirically. Therefore, our discussion focused on the theoretical justification of the instrument in [[Data and conceptual model|Chapter 1.2.1]] for the example used in [[Statistical modeling of IV|Chapters 1.2.2]], and [[(Double) Machine learning using IV|1.2.3]]. In addition to the two main assumptions, there are other subtle details and requirements: /3. **Instrument independence/uncertainty/exogeneity**: Instruments must be uncorrelated with the error term: $\text{Cov}(Z_j, u_i) = 0 \quad \text{for } j = 1,\dots,m$ While this assumption is similar to the exclusion restriction, there is a slight difference. The independence assumption includes the omitted variables as well (so, any variable other than $Xs$ and $Ws$ above). An omitted variable that causes both $Z$ and $Y$ would violate this assumption! The assumptions do not end here. Monotonicity (the instrument always affects the treatment in the same direction) and homogeneity (the effect of the treatment on the outcome must be homogeneous across instrument levels and units of analysis) should be considered, depending on the problem. Note that our intention here is to highlight the importance of the assumptions rather than to be comprehensive. More importantly, what does it mean to violate these assumptions? Violating each assumption has different implications with potential remedies. For example, violating the relevance assumption would lead to biased estimates (potentially worse than the naïve model). For the sake of brevity, we will not go into the details of assumption violations here. > [!info]- Last updated: October 2, 2024