library(tidyverse)
library(knitr)
library(kableExtra)
library(fixest)
library(ggfixest)
library(bacondecomp)
library(Manu)
This lesson draws heavily from Scott Cunningham’s Causal Inference: The Mixtape, Chapter 9, and Nick Huntington-Klein’s The Effect: An Introduction to Research Design and Causality, Chapter 18.
In 2022, there were 48,202 gun related deaths in the United States, with Black individuals more than 2.5 times likely to die from gun violence, according to CDC data. Seventeen year old Trayvon Martin was one such victim, a young black man who was killed by an attacker in a gated community in Florida. In 2005, Florida passed what has become known as a ‘Stand Your Ground’ law, removing the duty to retreat from an attacker before using deadly force, and removing civil liability for any death or injury inflicted. Between 2000 and 2010, 21 states passed similar ‘Stand Your Ground’ laws. These laws are based on the Castle Doctrine, which specifically removes the duty to retreat from a home intruder when acting in purported self-defense. They effectively expand the “castle doctrine” to public spaces, explains the Brady Campaign, allowing for a person to legally shoot and kill another person for “legally unjustifiable reasons” under a “shoot first, ask questions later” model. Critics argue that these laws perpetuate racial disparities in gun violence that disproportionately harm Black Americans.
Cheng and Hoekstra (2013) evaluated the effects of these ‘Stand Your Ground’ laws on gun-related deaths using a difference-in-differences (DID) design. This particular DID design relies on what is known as a staggered treatment, or differential timing of the treatment across states. This assignment will walk you through some of the key steps of this empirical study, tasking you with writing R code to replicate the analysis and wrestling with the DID design and use of fixed effects. Part I focuses more on descriptive statistics and understanding the natural experiment. Part II focuses on the DID design and estimation.
Variable name | Description |
---|---|
state | state name |
sid | state ID |
year | year |
evertreated | = 1 if state ever expanded Stand Your Ground laws over this period, = 0 otherwise |
D | = 1 if expanded Stand Your Ground law was in effect, = 0 otherwise |
treatment_date | year a Stand Your Ground law was passed for each state, if any |
homicide_c | homicide count |
population | state population |
popwt | mean of state population (2000-2010) |
blackm_15_24 | % of black male ager 15-24 |
whitem_15_24 | % of white male ager 15-24 |
blackm_25_44 | % of black male ager 25-44 |
whitem_25_44 | % of white male ager 25-44 |
l_exp_subsidy | natural log of government spending per capita on subsidy |
l_exp_pubwelfare | natural log of government spending per capita on public welfare |
l_police | natural log of full-time equivalent police per 100,000 state population |
unemployrt | unemployment rate |
poverty | poverty rate |
l_income | natural log of state median income |
l_prisoner | natural log of incarceration rate per 100,000 state population |
l_lagprisoner | natural log of lagged incarceration rate per 100,000 state population |
weighted.mean()
function to report the mean homicide rate and ln(homicide rate) weighted by mean state population (popwt). Only report the weighted mean for these two variables, do not show long and clunky output with summary statistics for every variable in the data frame).evertreated == 1
) and non-adopting states (evertreated == 0
) in 2000?evertreated
and then using summarise
to obtain the population-weighted mean.]binwidth
argument of the geom_histogram()
layer. Based on visual inspection, which distribution has a shorter right hand tail?fill
or color
aesthetics to indicate states that were ever treated. Some other things to consider: axes labels, legend or no legend and legend positioning/title/labels, sorting values, and color and aesthetics. Qualitatively describe the distribution you observe.pivot_wider()
, then you can mutate a new column for \(\Delta\)hom_rate equal to the difference between the two columns containing endyear values. Here is a good tutorial on pivot_wider()
.geom_bar()
along with coord_flip()
, as in the first example here.]Let’s start by focusing on the case of Florida, and compare the change in the log of the homicide rate over time in Florida to all of the never-treated states.
head()
function to print the first few observations of the new collapsed data frame.group_by()
along with summarise()
and weighted.mean()
as a helper function.]geom_line()
is something you did in Assignment 5. Additionally, you can use geom_vline()
to add a vertical line indicating the year before the treatment first took effect in Florida. You will also need to convert evertreated to a factor with appropriate labels for each ‘treatment’ group, Florida and never-treated states.]The plot you just created is the basis for an event study approach (period by period) to a familiar difference-in-difference setup with a treatment and control group that are fixed over time. Recall the Card and Krueger (1994) setup with two groups (a treatment, Pennsylvania, and a control, New Jersey), and two time periods (before and after the minimum wage increase in Pennsylvania). We can think of the Card and Kruger setup as a simple 2x2 DID, with 2 groups and 2 periods. Data permitting, an event study approach extends the analysis to look period-by-period both further ahead and behind the treatment event, as you have shown here.
Remember the plot above only looks at the treatment for Florida and excludes all states that were treated in different years (i.e. restricts the comparison to a single treatment, not the staggered treatment across states). We can also calculate the average log homicide rate for Florida and never treated states over the entire pre-treatment period. Same for the treatment period. That gets us a familiar 2x2 DID which compares the pre-post change for Florida to the same change for all never treated states:
\[\begin{equation} \hat{\delta}^{DID(FL)} = (\bar{y}^{post, FL} - \bar{y}^{pre, FL}) - (\bar{y}^{post, never} - \bar{y}^{pre, never}) \end{equation}\]
where FL refers to Florida and never refers to all never-treated states.
Recall the parallel trends assumption required for the DID estimator to yield an unbiased estimate of the causal effect of the treatment: in the absence of the treatment, average outcomes for the treatment and comparison groups would have changed in parallel over time. Recall why this assumption is necessary: we are using the difference in outcomes over time for the control group (\(\bar{y}^{post, never} - \bar{y}^{pre, never}\)) as an estimate of the counterfactual change in outcomes over time for treatment group, Florida. Treatment and control groups can have different characteristics that lead to different outcome levels, but parallel trends requires the change in outcomes to evolve in a parallel fashion.
Parallel trends during the treatment period can never be observed. It’s a statement about the counterfactual that would have occurred in the absence of the treatment: the change in the log homicide rate for Florida had it not passed a ‘Stand Your Ground’ law. We never know this with certainty, but the event study picture is an ideal place to begin to assess whether or not parallel trends holds over the pre-treatment period (provided that you observe outcomes in multiple pre-treatment periods). If trends for the two groups are parallel prior to the treatment, that is a good sign that the assumption is plausible. Researchers still need to think hard about whether there are any unobserved factors that may have changed over time along with the treatment, such as other confounding policy changes.
group_by()
along with summarise()
and weighted.mean()
as a helper function. For the plot you can use geom_vline()
to drop the event line at 2005 to indicate the year before the law took effect.]In Part II of this assignment we will review two-way fixed effects (TWFE) estimation, what it means in a staggered rollout design, and examine fundamental problems with this approach.
In Part I, we saw how the staggered rollout of ‘Stand Your Ground’ laws provides a series of natural experiments that lend themselves to the DID design. For example, how did the log of the homicide rate evolve in Florida before and after the law was passed compared to never-treated states over that time period? We could also think of another DID design for Michigan and the other 10 states who passed similar laws in 2006.
How should we go about estimated an average treatment effect for all 21 treated states? For years, econometricians relied on models that include both entity (e.g. state) fixed effects and time (e.g. year) fixed effects. These models are often referred to as two-way fixed effects (TWFE) models. Cheng and Hoekstra estimate TWFE models similar to the following:
\[ Y_{it} = {\delta}D_{it} + \sigma_{i} + \tau_{t} + \epsilon_{it} \qquad \qquad \qquad \qquad \qquad \qquad\hbox{(1)} \]
where subscripts i and t indicate state i and year t, respectively; \(D_{it}\) is the treatment variable, equal to the proportion of year t in which state i has a Stand Your Ground law in effect; \({\delta}\) is the OLS DID estimator; \(\sigma_{i}\) controls for state fixed effects; \(\tau_{t}\) controls for year fixed effects; and \(\epsilon_{it}\) is the idiosyncratic error term. Also note that for states where the law took effect in the middle of the year, Cheng and Hoekstra (2013) set the treatment variable equal to the proportion of the year during which the law was in effect, but below we will use a binary variable for simplicity. We will also estimate unweighted models for simplicity, but results are similar when weighting observations by mean state population.
Note that unlike in the simple 2x2 DID design we learned in Quant II, the treatment variable is not denoted by the form \(post*treated\), but rather by \(D_{it}\). This notation is a result of having more than two periods, and a treatment period that differs for each state according to the year they adopted the law.
Before we estimate the above model, let’s first recall the guidance on clustered standard errors. With panel data where treatment status varies across entities and time, it is generally advised to estimate standard errors that are clustered at the entity level (here, state)—provided there are a sufficiently large number of clusters (ideally 50 or more). Recall that clustering at the state level means adjusting standard error calculations to take into account the fact that observations within the same cluster (state) may have correlated errors. Clustered SEs account for correlation in the error term between observations in the same cluster, but preclude correlation in errors across clusters. For this course, all you need to know is that it’s best practice to estimate clustered SEs when estimating a TWFE model, provided that you observe enough clusters in the sample.
feols()
in the fixest
package, with standards error clustered by state. Report the coefficient of interest and the standard error.Cheng and Hoekstra (2013) note that their estimated effect amounts to around 600 additional homicides per year in the 21 adopting states. That is a lot of additional lives lost.
The strongest DID designs show parallel pre-trends and argue they are consistent with parallel trends in outcomes for the treatment group in the absence of the treatment and the comparison group. But they don’t stop there. They also consider various robustness checks, including falsification or placebo tests. To address concerns that adopting Stand Your Ground laws may have been spuriously correlated with unobserved factors driving general crime trends, Cheng and Hoekstra (2013) also show that adopting the laws is not associated with meaningful and significant changes in other types of crime. This approach is one of several that the authors take to rule out the possibility of omitted variables driving their results and make the case that their estimates are internally valid.
The economist Scott Cunningham contextualized their findings:
Thinking back to to the killing of Trayvon Martin by George Zimmerman, one is left to wonder whether Trayvon might still be alive had Florida not passed Stand Your Ground. This kind of counterfactual reasoning can drive you crazy, because it is unanswerable—we simply don’t know, cannot know, and never will know the answer to counterfactual questions. The fundamental problem of causal inference states that we need to know what would have happened that fateful night without Stand Your Ground and compare that with what happened with Stand Your Ground to know what can and cannot be placed at the feet of that law. What we do know is that under certain assumptions related to the DD design, homicides were on net around 8%–10% higher than they would’ve been when compared against explicit counterfactuals. And while that doesn’t answer every question, it suggests that a nontrivial number of deaths can be blamed on laws similar to Stand Your Ground.
We have since learned that using TWFE to implement a DID design with a staggered treatment is not quite so simple. To illustrate these issues, let’s begin by estimating the event study TWFE model, which separates out the estimated average treatment effect into year-by-year effects, rather than estimating a single average treatment effect.
Next let’s estimate the following event study specification:
\[ln(homrate)_{st} = {}evertreated_s \left [ \sum_{j=-6}^{-2} \beta_j^{pre} 1\{t - t_s^* = j\} + \sum_{j=0}^{5} \beta_j^{post} 1\{t - t_s^* = j\}\right ] + \mu_s + \tau_t + \varepsilon_{st}\] Here, \(evertreated_s\) is a binary variable indicating states that adopted a Stand Your Ground law at any point during this time frame, and \(t_s^*\) is the year of adoption in state \(s\). The \(1\{t - t_s^* = y\}\) terms are dummy variables corresponding to an event year, i.e., the year relative to adoption at time \(t_s^*\). The coefficients of interest are now \(\beta^{pre}_j\) and \(\beta^{post}_j\), which measure the relationship between our outcomes of interest and eligibility status in each of the 5 years leading up to adoption and 6 years after. Note that the dummy variable for the year before adoption is omitted (\(j = -1\)), so that the estimates of \(\beta^{pre}_j\) and \(\beta^{post}_j\) capture effects relative to just before adoption.
Recall that in Part 1 of the Data Assignment it wasn’t so obvious what the appropriate comparison group for treated states should be. In this case, there is no time relative to adoption for states that never adopted a Stand Your Ground law at all! Recall that we previously encoded relative time for these states to be \(j = -1\). That’s a bit odd, but effectively allows us to make comparisons with never-treated states. We’ll come back to this point.
Why are we interested in the \(\beta^{pre}_j\) parameters? These coefficients are falsification tests that show us how the log of the homicide rate was trending prior to adoption. Their pattern and statistical significance are a test of the key identifying assumption of parallel trends; statistically significant estimates during the pre-treatment period would be inconsistent with the parallel trends assumption. The \(\beta^{post}_j\) parameters represent the causal effect of adopting a Stand Your Ground law for each relative period \(j\).
fixest
includes an interaction helper function i()
, which we’ll use to generate the event study interaction terms. As described above, omit the event year immediately before adoption, which allows us to estimate results relative to that year. Fit the model and produce a standard event study plot, which shows \(\hat \beta^{pre}_j\) and \(\hat \beta^{post}_j\) in relation to relative time \(j\).Let’s walk through what this plot is showing us. First, recall the estimated average treatment effect, \(\hat{\delta}\) from equation 1. Instead of estimating a single average treatment effect over the entire treatment window relative to the entire pre-treatment window, the event study specification estimates a separate treatment effect for every period relative to a reference period one year before the law takes effect. Florida was treated in 2005, so the reference period for Florida is 2004, relative time period \(j = -1\). For Florida, 2005 is the initial treatment year, \(j = 0\); one year after the treatment, 2006, is denoted by \(j = 1\), and so forth. So the treatment window for Florida, \(j = \{0, 1, 2, 3, 4, 5\}\), corresponds to 2005 through 2010. And Florida’s pre-treatment window extends from 2000 through 2004, corresponding to \(j = \{-4, -3, -2, -1\}\). The state of Georgia didn’t adopt their Stand Your Ground law until 2006, so 2005 is the reference period for Georgia, or \(j = -1\), and so forth. Now imagine doing that for every treated state and stacking every states natural experiment around the relative event reference period \(j = -1\). This is how a TWFE event study with staggered timing relies on time relative to the treatment.
The above event study plot helps us to assess the plausibility of parallel trends. We can never confirm with absolute certainty that the parallel trends assumption holds, as they are inherently unobservable: the assumption says the counterfactual path of treated states during the treatment period is captured by comparison states, and we never observe these counterfactual trends! What the above event study plot lets us do is test whether prior trends are parallel; to examine whether treated and untreated states were trending similarly before they were treated. Indeed, we observe that all of the estimate coefficients prior to the treatment, prior to \(j = -1\), are not statistically significant — we know this because none of the plotted 95% confidence intervals (using SEs clustered at the state level) intersect with zero. Recall how we interpret each \(\beta^{pre}_j\) estimate: as the difference in the log of the homicide rate between each pre-treatment period and the reference period for treated states, compared to the same difference for the comparison group of states. It turns out that there are additional limitations to relying on the familiar hypothesis test for statistically significant differences. They are beyond the scope of this course, but if you’re interested you can read Section 4.4 of Roth et al 2023.
The fact that plot seems to be consistent with parallel pre-trends is a good sign for the parallel trends assumption, the remaining question is whether anything else coinciding with the treatment period may have affected adopting states differently from non-adopting states.
The event study plot also allows us to examine the dynamics of the treatment effect, which fluctuates between .07 and .13 throughout the treatment window, or an average increase in each treatment period of approximately 7 to 13 percent in the homicide rate.
It turns out that TWFE estimation generally works quite well when there is a single treatment year shared by all states. But TWFE estimation breaks down when the treatment is staggered over time in different states. In recent years, econometricians have come to understand problems that can arise from staggered treatments and bias estimated treatment effects—so much so that the sign of the estimated effect can flip in some cases!
It turns out that in a staggered treatment setting, TWFE estimation can’t handle heterogeneous treatment effects within groups over time. It also relies on a comparison group that probably doesn’t make much sense. Let’s try to understand what the problems are.
Let’s start by thinking more deeply about fixed effects and what they do. A regression model with fixed effects is often called the “within-estimator”. We learned that entity (or unit) fixed effects control for time invariant factors that differ between entities. Let’s focus on the example of state FEs in a state-time panel data set, such as with Cheng and Hoekstra (2013). A state FE term removes (and thus effectively controls for) the mean difference between that state and a reference state, so that the only variation being used to estimate the effect of the treatment comes from within each state. In other words, state FEs identify the effect of a particular state going from untreated to treated over time, rather than a comparison between treated and untreated states within the same time period.
The math is a bit tricky, but with a TWFE model like in equation 1 and staggered treatment timing, states that have previously been treated get used as a part of the comparison group along with states that have yet to be treated. In other words, the comparison group is composed of already treated, yet to be treated, and never treated states.
So why is that a problem? If the actual treatment effect is dynamic (i.e. it changes over time rather a one-time shift), or if the treatment effect differs between groups, then we no longer have parallel trends with the TWFE setup. Let’s think about the example of a linear treatment effect, increasing over time, that is the same for all treated states that are treated in different periods. In this case the “treated comparison group” would be trending upwards over time in a way that the “just-now treated group” isn’t. This is a violation of parallel trends! It biases the TWFE DID estimator, \({\hat{\delta}}\) from equation 1—in some cases the bias is large enough that you can get a negative DID estimate when the true effect is actually positive. Sign-flipping bias is very bad for informing public policy!
Ok so what can we do about this? A number of econometricians have derived estimators that address this problem. For now let’s make sure we understand that the estimated treatment effect from a TWFE model with staggered treatment timing can be decomposed into a series of 2x2 DIDs. Note that there are 5 different treatment groups in the sense that there were 5 different treatment years, 2005 through 2009. How many distinct 2x2 pre-post comparisons are there?
It turns out that the average treatment effect from the TWFE model of equation 1 can be written as a weighted average of all of the 25 2x2 comparisons listed above, as Goodman-Bacon (2021) shows.
bacondecomp
package. The bacon()
function is quite straightforward, here is a vignette with sample syntax. Follow the steps in this vignette to generate and report the decomposition. Assign the results of the decomposition to an object that you will refer back to. Explain what the output shows.geom_hline()
.Now that we understand what TWFE models do tell us, how should we approach estimation in a staggered rollout design?
It turns out that with a staggered rollout, TWFE models can yield biased treatment effect estimates under certain conditions. The Goodman-Bacon decomposition demonstrates how the TWFE estimate is the weighted average of the effect from the different types of comparisons. Some of these comparisons may be ‘good’ for the purposes of causal inference: (1) treated units to never treated units (good!); and (2) earlier treated units to not-yet treated units (good provided there are no anticipation effects). When the time profile of treatment effects is heterogeneous, the third comparison, later treated units with earlier treated units, is ‘bad’!
The problem is that the earlier treated group is not a good control group for the later treated group. This is because earlier treated units are on a different trajectory from later treated units because they have been exposed to the treatment for longer. The takeaway is that we don’t want to obtain estimates that are driven in part by this third type of comparison. Relatedly, some of the weights obtained from the Goodman-Bacon decompostion can in fact be negative! The consequences of this can be drastic, in some cases yielding negative treatment effects when the true effect is actually positive for all units.
There are a number of papers that have explored this problem and advanced solutions… along with their own R packages! We’ll stop here, but you can implement alternative estimators from packages such as did
(Callaway and Sant’Anna 2021), did_multiplegt
(Chaisemartin and D’Haultfœuille 2020, 2021), and fixest::sunab
(Sun and Abraham 2020), among others. Here is a good resource on DID solutions using R, as well as Python and Stata.