Identifying suitable instrumental variables (IVs) is a significant challenge, largely due to the expansive search space. Consider a scenario where we have a conceptual model complete in every aspect except for the IV we need for identification. **How can we systematically discover theoretically sound and empirically feasible IVs and locate the actual data consistent with our level of analysis?** Domain expertise often guides this search. For example, state cigarette taxes have served as a valid IV for cigarette sales, as their influence on consumption is widely accepted to operate solely through price. However, in many contexts, an obvious IV may not present itself, necessitating a broader and more systematic search. One of the areas where large language models (LLMs) notably excel is in their capabilities for **search and discovery**. This chapter explores leveraging LLMs for instrumental variable (IV) discovery. Given the multitude of available LLMs and prompting strategies, we will begin by following the guidance outlined in Han (2024). In this experimental chapter, we will use `Gemini 2.5 Pro Preview Version 06-05-2025` to identify an IV for the problem previously addressed in this chapter (see [[Data and conceptual model]] for details).[^1] The biggest challenge, however, often lies in finding data on candidate IVs that match the unit of analysis, particularly data with the necessary variation across entities and/or time. Han (2024) does not cover this aspect, so we're building on their paper to offer a prompt template designed to search for actual, useful, data on these candidates, not just ideas for the IVs themselves. The results look promising and can greatly assist data scientists, especially when their analysis is **problem-driven** rather than **data-driven**. Let's get started. In Han (2024), the recommended prompt for IV search is as follows: ``` you are [agent] who needs to make a [treatment] decision in [scenario]. what are factors that can determine your decision but do not directly affect your [outcome], except through [treatment] (that is, factors that affect your [outcome] only through [treatment])? list [K_0] factors that are quantifiable. explain the answers. ``` where \[treatment] is the treatment D, \[agent] is the economic agent whose decision is D, \[scenario] is the specific setting of interest, \[outcome] is the outcome Y, and \[K_0] is the desired number of variables $K_0$. Let's fill in the blanks for our specific problem: > [!Prompt] > You are an average American who needs to make a decision about whether to consume cigarettes based on price. What are factors that can determine the price but do not directly affect your decision to consume cigarettes, except through price (that is, factors that affect your cigarette consumption only through price)? List ten factors that are quantifiable. Explain the answers. Here is the Gemini's answer: > [!Response] > As an average American weighing the decision to smoke based on price, several factors can drive the cost of a pack of cigarettes up or down. These elements influence my choice to buy cigarettes only by affecting how much I have to pay for them. Here are ten such quantifiable factors: > > **1. State Excise Tax:** > - **Quantifiable Measure:** Dollars per pack of 20 cigarettes. > - **Explanation:** Each state sets its own excise tax on cigarettes, which is a significant portion of the final price. For example, as of early 2024, New York had the highest state excise tax at $5.35 per pack, while Missouri had the lowest at $0.17. This tax directly increases the price I pay at the counter. A higher tax in my state makes smoking more expensive, which might lead me to quit or cut back. > >**2. Federal Excise Tax:** >- **Quantifiable Measure:** Dollars per pack of 20 cigarettes. >- **Explanation:** The federal government imposes a nationwide excise tax on all cigarettes sold in the United States. Currently, this tax is $1.01 per pack. This tax is a fixed amount added to the price of every pack, regardless of the state in which it is sold. Any change in this federal tax rate directly impacts the retail price. > **3. MSA Payments (Tobacco Master Settlement Agreement):** > - **Quantifiable Measure:** Annual payment amount made by tobacco companies to the settling states (in billions of dollars). > - **Explanation:** In 1998, major tobacco companies and 46 states signed the Master Settlement Agreement to settle smoking-related healthcare lawsuits. As part of this agreement, these companies make annual payments to the states. Tobacco companies pass on the cost of these payments to consumers in the form of higher cigarette prices. The specific amount of these payments can fluctuate, affecting the wholesale and, consequently, the retail price. > **4. Crude Oil Prices:** > - **Quantifiable Measure:** Price per barrel of crude oil (e.g., West Texas Intermediate - WTI). > - **Explanation:** The transportation of tobacco from farms to processing plants and the distribution of finished cigarettes to retailers rely heavily on fuel. When the price of crude oil increases, so do transportation costs for the tobacco industry. These increased logistical costs are then factored into the final price of a pack of cigarettes. > > **5. Average Commercial Rent in a Locality:** > - **Quantifiable Measure:** Average price per square foot for commercial real estate. > - **Explanation:** The rent for the retail space where I buy my cigarettes (e.g., a convenience store or gas station) is a significant overhead cost for the business owner. In areas with high commercial rents, retailers may need to charge more for their products, including cigarettes, to maintain their profit margins. > **6. Labor Costs for Agricultural Workers:** > - **Quantifiable Measure:** Average hourly wage for agricultural workers in tobacco-growing states. > - **Explanation:** The cost of labor to plant, cultivate, and harvest tobacco is a direct input cost for cigarette manufacturers. If the wages for these farmworkers increase, the cost of raw tobacco rises, leading to a higher price for the final product. > >**7. Cost of Packaging Materials (Paper and Plastic):** >- **Quantifiable Measure:** Market price of pulp for paper and plastic film. >- **Explanation:** Cigarette packaging, including the paper for the cigarettes themselves, the foil, the plastic wrap, and the cardboard box, is a component of the overall production cost. Fluctuations in the market prices of these raw materials will affect the manufacturing cost and, in turn, the retail price. > >**8. Wholesale Price of Tobacco Leaf:** >- **Quantifiable Measure:** Price per pound of flue-cured or burley tobacco at auction. >- **Explanation:** The primary raw material for cigarettes is the tobacco leaf itself. The price of this leaf can vary due to factors like crop yields, weather conditions in growing regions, and global demand. A poor harvest, for instance, can lead to a higher price for raw tobacco, which manufacturers will pass on to consumers. > **9. Inflation Rate (Consumer Price Index - CPI):** > - **Quantifiable Measure:** The annual percentage change in the CPI. > - **Explanation:** General inflation in the economy affects the price of all goods and services, including cigarettes. As the cost of production, transportation, and doing business rises with inflation, the price of cigarettes will also increase to keep pace. > **10. Health Insurance Surcharges for Retailers:** > - **Quantifiable Measure:** Average percentage increase in health insurance premiums for small businesses. > - **Explanation:** The cost of providing health insurance to employees is a significant expense for the retailers who sell cigarettes. As these insurance costs rise, business owners may need to increase the price of their products to cover this overhead. This is an indirect but quantifiable factor that can contribute to the final price I pay for a pack of cigarettes. The candidate IVs look plausible and the first one is indeed the IV we used. Han (2024) also offers ideas for refining IV searches by accounting for confounders and covariates; we refer you to the paper for those details. We will address two major issues next. First, while the LLM suggests federal cigarette tax as an IV, our identification strategy uses fixed effects for both states and year. Since federal tax is time-variant but not state-variant, it will be absorbed by our fixed effects. This problem arises because the initial prompt does not specify the unit of analysis. Second, the prompt does not address the major challenge of finding actual data for the candidate IVs. These two problems are related: even if data exists, it is not useful for our problem if it is not available at our level of analysis. We'll focus on solving both of these shortcomings in a single follow-up prompt. So, what are we looking for? We need data from a credible source that aligns with our focal level of analysis, ideally including a direct link to the data file and information on any limitations. This means we are seeking information both about the data source (with its URL) and the data itself. After some experimentation, we have developed a prompt template to help locate data for candidate IVs at the desired level of analysis. This template is designed to follow the initial prompt used to generate the candidate IV list. This is how it looks like: ``` For each factor, search for [entity-time] level data for the [entities] from [interval]. For each factor where data is available, provide: - Data source name and brief description - Direct download URL for the data file - File format (preferably CSV or other analysis-ready formats) - Any relevant notes about data quality or limitations ``` where \[entity-time] is the data aggregation level (e.g., store-month or state-year), \[entities] is the set of entities for which the data is needed, and \[interval] is the duration of the data needed for the analysis. Again, let's fill in the blanks for our problem: > [!Prompt] > For each factor, search for state-year level data for the 50 states and Washington, D.C. from 1970 to 2019. For each factor where data is available, provide: > - Data source name and brief description > - Direct download URL for the data file > - File format (preferably CSV or other analysis-ready formats) > - Any relevant notes about data quality or limitations Gemini's answer: > [!Response] > **1. State Excise Tax** > - **Data Source Name and Brief Description:** Campaign for Tobacco-Free Kids. This organization compiles historical data on state cigarette excise tax rates as part of its advocacy and research efforts to reduce tobacco use. The data is widely used by public health researchers. > - **Direct Download URL for the Data File:** A comprehensive table of state cigarette tax rates from 1970 to the present is available in PDF format. [https://www.tobaccofreekids.org/assets/factsheets/0099.pdf](https://www.tobaccofreekids.org/assets/factsheets/0099.pdf) > - **File Format:** PDF. The data is in a table that can be manually transcribed or programmatically extracted. > - **Relevant Notes about Data Quality or Limitations:** This is a high-quality, well-maintained dataset that is considered an authoritative source for historical state cigarette tax rates. The data is presented clearly by state and year, making it straightforward to create a state-year panel dataset. > > **2. Federal Excise Tax** > - **Data Source Name and Brief Description:** U.S. Alcohol and Tobacco Tax and Trade Bureau (TTB). The TTB is the federal agency responsible for setting and collecting excise taxes on tobacco products. Their website provides a historical record of federal tax rates. > - **Direct Download URL for the Data File:** The historical tax rates are listed in a table on the TTB's website. [https://www.ttb.gov/tax-audit/historical-tax-rates](https://www.ttb.gov/tax-audit/historical-tax-rates) > - **File Format:** HTML. The data is presented in a simple table that can be easily copied. > - **Relevant Notes about Data Quality or Limitations:** The data from the TTB is the official and most accurate source for federal excise tax rates. It is important to note that this is a national-level variable and will be constant across all states for any given year. The rates are provided per thousand cigarettes, which will need to be converted to a per-pack rate (divide by 50 for a standard 20-cigarette pack). > >**3. MSA Payments (Tobacco Master Settlement Agreement)** > - **Data Source Name and Brief Description:** National Association of Attorneys General (NAAG). The NAAG Center for Tobacco and Public Health coordinates the implementation of the Master Settlement Agreement (MSA) on behalf of the settling states. They provide information on the annual payments made by tobacco companies. > - **Direct Download URL for the Data File:** NAAG provides summaries of MSA payments. While a single file for all years is not available, individual year summaries and total payment data can be found. [https://www.naag.org/our-work/naag-center-for-tobacco-and-public-health/the-master-settlement-agreement/msa-payment-information/](https://www.naag.org/our-work/naag-center-for-tobacco-and-public-health/the-master-settlement-agreement/msa-payment-information/) > - **File Format:** PDF and HTML. > - **Relevant Notes about Data Quality or Limitations:** The MSA was signed in 1998, so data is only available from that year forward. The exact payment amounts to each state can vary annually due to adjustments based on factors like inflation and tobacco shipment volumes. The NAAG website is the authoritative source for this information. > **4. Crude Oil Prices** > - **Data Source Name and Brief Description:** U.S. Energy Information Administration (EIA). The EIA is the primary source for energy information in the United States. They provide historical data on various energy prices, including crude oil. > - **Direct Download URL for the Data File:** The EIA provides annual average spot prices for West Texas Intermediate (WTI) crude oil, a key benchmark. [https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RWTC&f=A](https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RWTC&f=A) > - **File Format:** HTML table, with an option to download as a CSV file. > - **Relevant Notes about Data Quality or Limitations:** This is a national-level variable and will not vary by state. The EIA data is highly reliable and widely used. The provided link directs to annual data, which is appropriate for a state-year level analysis. > >**5. Average Commercial Rent in a Locality** >- **Data Source Name and Brief Description:** A comprehensive, publicly available dataset for average commercial rent by state from 1970 to 2019 does not exist. This data is typically proprietary and collected by commercial real estate firms (e.g., CBRE, CoStar). >- **Direct Download URL for the Data File:** Not available. >- **File Format:** Not applicable. >- **Relevant Notes about Data Quality or Limitations:** Researchers often have to purchase this data or rely on proxies. The U.S. Bureau of Labor Statistics does produce a Producer Price Index for "Lessors of nonresidential buildings," but this is a national index and does not provide state-level price variations. > >**6. Labor Costs for Agricultural Workers** >- **Data Source Name and Brief Description:** U.S. Department of Agriculture, National Agricultural Statistics Service (NASS). NASS conducts surveys on various aspects of agriculture, including the Farm Labor Survey, which provides data on wage rates for agricultural workers. >- **Direct Download URL for the Data File:** The data can be accessed through the NASS Quick Stats database. Users can specify the commodity (tobacco), data item (wage rate), geographic level (state), and time period. [https://quickstats.nass.usda.gov/](https://quickstats.nass.usda.gov/) >- **File Format:** CSV. >- **Relevant Notes about Data Quality or Limitations:** Data will primarily be available for tobacco-growing states. Coverage for all 50 states and for the entire 1970-2019 period may be incomplete. Users need to carefully select the appropriate parameters in the Quick Stats database to obtain the desired information. > **7. Cost of Packaging Materials (Paper and Plastic)** > - **Data Source Name and Brief Description:** U.S. Bureau of Labor Statistics (BLS). The BLS produces the Producer Price Index (PPI), which measures the average change in selling prices received by domestic producers. > - **Direct Download URL for the Data File:** The data can be accessed through the BLS data portal. Relevant series include those for pulp, paper products, and plastics. [https://data.bls.gov/timeseries/WPU09](https://data.bls.gov/timeseries/WPU09) (for Pulp, paper, and allied products) > - **File Format:** Data can be downloaded as a text file or in other formats. > - **Relevant Notes about Data Quality or Limitations:** This is a national-level dataset and will not vary by state. The PPI provides a reliable measure of price changes for these commodities over time. The provided link is for the main category; more specific product categories are also available. > **8. Wholesale Price of Tobacco Leaf** > - **Data Source Name and Brief Description:** U.S. Department of Agriculture, Economic Research Service (ERS). The ERS provides historical data and analysis on tobacco production and prices, often published in the "Tobacco Situation and Outlook Yearbook." > - **Direct Download URL for the Data File:** The USDA ERS has an archive of these publications. A compiled dataset is not readily available, but the data can be extracted from the yearbooks. [https://usda.library.cornell.edu/concern/publications/k35694332](https://usda.library.cornell.edu/concern/publications/k35694332) > - **File Format:** PDF. > - **Relevant Notes about Data Quality or Limitations:** This data is available only for states that grow tobacco. The yearbooks contain detailed tables with prices by tobacco type and by state or region. The data is of high quality but requires manual extraction from the PDF documents. > **9. Inflation Rate (Consumer Price Index - CPI)** > - **Data Source Name and Brief Description:** U.S. Bureau of Labor Statistics (BLS). The BLS calculates the CPI, the most widely used measure of inflation. > - **Direct Download URL for the Data File:** The BLS provides a database for accessing CPI data. [https://www.bls.gov/cpi/data.htm](https://www.bls.gov/cpi/data.htm) > - **File Format:** Data can be downloaded in various text-based formats. > - **Relevant Notes about Data Quality or Limitations:** The BLS does not produce a CPI for every state. Data is available for the U.S. as a whole, for four census regions, and for several large metropolitan statistical areas (MSAs). For a state-year analysis, researchers often use the regional CPI or the national CPI as a proxy for state-level inflation. > >**10. Health Insurance Surcharges for Retailers** >- **Data Source Name and Brief Description:** A consistent, state-level dataset on small business health insurance premiums for the entire 1970-2019 period is not publicly available. The Kaiser Family Foundation (KFF) Employer Health Benefits Survey is a leading source, but its data begins in 1999. >- **Direct Download URL for the Data File:** KFF provides reports and data from their annual surveys. [https://www.kff.org/health-costs/report/employer-health-benefits-survey/](https://www.google.com/search?q=https://www.kff.org/health-costs/report/employer-health-benefits-survey/) >- **File Format:** PDF and some data visualizations with downloadable data. >- **Relevant Notes about Data Quality or Limitations:** The KFF data is a reliable source for the period it covers (1999-present). Data is often presented by firm size and region rather than at the state level. For years prior to 1999, consistent state-level data is exceptionally scarce. The U.S. Census Bureau's American Community Survey has state-level health insurance coverage data, but this is more recent and does not detail premium costs for businesses. Gemini now includes this note regarding federal tax as a candidate IV: *"It is important to note that this is a national-level variable and will be constant across all states for any given year."* We could use a follow-up prompt to filter out variables from the answer that lack the expected variation in the actual data, or we could combine these prompts. However, our experience with LLMs on these problems favors a step-by-step approach over a single prompt. Also, seeing a full list of related variables gives us ideas about our solution to the problem, even if we don't end up using a listed variable as an IV in our models. Overall, the results are well-structured, and the sources appear accurate, with direct download links provided (the last source was an exception, provided as a Google search query). As a sensitivity test, we also replicated both search templates (for candidate IV search and data search) for three other problems in different domains and found plausibly consistent and useful results. [^1]: Han, S. (2024). Mining causality: Ai-assisted search for instrumental variables. arXiv preprint arXiv:2409.14202. > [!info]- Last updated: June 11, 2025