Load packages:
library(tidyverse)
library(gapminder)
library(panelr)
library(fixest)
library(lmtest)
library(multiwayvcov)
library(broom)
library(sandwich)
Resources used to create this lesson:
Panel data–also referred to as longitudinal data–refers to information obtained from observing the same set of entities at multiple points in time.
Some examples:
When observations span multiple time periods, the critical distinction between panel data and pooled cross-sectional data is that in the former the same entities are observed repeatedly over time, while in the latter different entities are observed in each time period.
Comparisons between entities in the same period are often susceptible to omitted variable bias. Say we’re interested in evaluating the impact of national mask mandates on pandemic outcomes. Countries with national mask mandates may also be implementing additional public health measures, face different economic conditions, and may be responding to being harder hit by the pandemic.
Thus comparing COVID cases per capita between countries with/without national mask mandates in a given month amounts to comparing countries that are very different in terms of other factors driving pandemic outcomes; in other words, comparisons between “treated” and “untreated” countries within a cross-section (e.g. a given month) may not be very informative, as any association between the mask-mandate “treatment” and outcomes may in large part be explained by these other factors.
In this example, exploiting cross-sectional variation (between countries within one period) to identify the impacts of a mask-mandate is very prone to bias (omitted variable bias and reverse causality).
With panel data, one can exploit variation in the mask-mandate treatment over time within countries to identify the impacts of a mask-mandate. More specifically, in this example fixed effects can be used to control for two broad categories of potentially confounding factors: (1) fixed (time-invariant) country-specific differences; and (2) time-varying differences shared by all countries.
Note how salient the major limitation of fixed effects models is in this example: any factors that within-countries over time cannot be accounted for with fixed effects. Said another way, you can’t have country-month fixed effects, because that would attempt to control for the policy variable under consideration; it would introduce perfect multicollinearity between the treatment dummy variable and the corresponding country-year fixed effect, leading one of them to be dropped from the regression).
What’s an example of a threat to internal validity that fixed effects can’t account for this research question? If a particular country’s decision to implement a national mask mandate is made in tandem with other effective public health measures like free testing programs, then it is impossible to distinguish between the relative contributions of these two policies using fixed effects alone. Maybe you can start to explicitly control for these other policies and other confounding factors with the available data, but trying to explicitly control for all observable sources of OVB seems like a losing battle in this case.
Self-assessment question:
Given the hypothetical panel data structure implied above, can you think of additional research design features and/or data that might help researchers to distinguish the impacts of a mask-mandate from any national testing policies implemented at the same time?
Just because the underlying data is panel in nature doesn’t mean the input data is structured in a useful way.
Fixed effects regression analysis generally require long-form panel data: each observation represents a unique entity-time period pairing (e.g. country-month).
Using the familiar gapminder data, we can confirm that this data frame is long-form panel data with one observation for each surveyed country (or reporting unit like provinces!) every 5 years.
## # A tibble: 20 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## 11 Afghanistan Asia 2002 42.1 25268405 727.
## 12 Afghanistan Asia 2007 43.8 31889923 975.
## 13 Albania Europe 1952 55.2 1282697 1601.
## 14 Albania Europe 1957 59.3 1476505 1942.
## 15 Albania Europe 1962 64.8 1728137 2313.
## 16 Albania Europe 1967 66.2 1984060 2760.
## 17 Albania Europe 1972 67.7 2263554 3313.
## 18 Albania Europe 1977 68.9 2509048 3533.
## 19 Albania Europe 1982 70.4 2780097 3631.
## 20 Albania Europe 1987 72 3075321 3739.
Alternatively, panel data can be structured in wide form, where the unit of observation in the data frame is the entity (here: country), and the observed values for each entity across different time periods appear as separate columns of data (e.g. lifeExp1952, lifeExp1957,…, lifeExp2007).
One advantage to wide-form panel data is that it is generally easier to compute some country-level statistics; e.g. the change in life expectancy from 1952 to 2007 for each country can be easily computed as the difference between the lifeExp1952 and lifeExp2007 columns.
So if wanted to get the gapminder data frame in wide form to obtain entity-level statics, how can we go about that?
The tidyr
package include a number of useful functions
for restructuring data frames, including pivot_longer
and
pivot_wider
that we used in Week 6 (Women in STEM).
The panelr
package also has its own functions for
reshaping panel data and estimating panel data models, which we’ll use
here. First call the panel_data()
function to tell R the
nature of the panel data frame: use the id
argument to
specify the entity variable, and the wave
argument to
specify the time variable.
From there one can use the long_panel()
and
widen_panel()
functions to move between long- and wide-form
data as needed, as in the following code example.
#use the panel_data function to specify panel nature
gap <- panel_data(gapminder, id = country, wave = year)
#reshape gap from long to wide form data
gap_wide <- widen_panel(gap)
#confirm reshape and dimensions of wide data
dim(gap_wide)
## [1] 142 38
## # A tibble: 6 × 38
## country continent lifeExp_1952 pop_1952 gdpPercap_1952 lifeExp_1957 pop_1957
## <fct> <fct> <dbl> <int> <dbl> <dbl> <int>
## 1 Afghanis… Asia 28.8 8425333 779. 30.3 9240934
## 2 Albania Europe 55.2 1282697 1601. 59.3 1476505
## 3 Algeria Africa 43.1 9279525 2449. 45.7 10270856
## 4 Angola Africa 30.0 4232095 3521. 32.0 4561361
## 5 Argentina Americas 62.5 17876956 5911. 64.4 19610538
## 6 Australia Oceania 69.1 8691212 10040. 70.3 9712569
## # ℹ 31 more variables: gdpPercap_1957 <dbl>, lifeExp_1962 <dbl>,
## # pop_1962 <int>, gdpPercap_1962 <dbl>, lifeExp_1967 <dbl>, pop_1967 <int>,
## # gdpPercap_1967 <dbl>, lifeExp_1972 <dbl>, pop_1972 <int>,
## # gdpPercap_1972 <dbl>, lifeExp_1977 <dbl>, pop_1977 <int>,
## # gdpPercap_1977 <dbl>, lifeExp_1982 <dbl>, pop_1982 <int>,
## # gdpPercap_1982 <dbl>, lifeExp_1987 <dbl>, pop_1987 <int>,
## # gdpPercap_1987 <dbl>, lifeExp_1992 <dbl>, pop_1992 <int>, …
#reshape back to wide from long (creates a row for every year, we'll drop NA rows later)
gap_long_temp <- long_panel(gap_wide,
prefix = "_",
begin = 1952,
end = 2007,
id = "country") %>%
rename(year = wave)
head(gap_long_temp)
## # A tibble: 6 × 6
## country year continent lifeExp pop gdpPercap
## <fct> <dbl> <fct> <dbl> <int> <dbl>
## 1 Afghanistan 1952 Asia 28.8 8425333 779.
## 2 Afghanistan 1953 Asia NA NA NA
## 3 Afghanistan 1954 Asia NA NA NA
## 4 Afghanistan 1955 Asia NA NA NA
## 5 Afghanistan 1956 Asia NA NA NA
## 6 Afghanistan 1957 Asia 30.3 9240934 821.
#drop rows with NA values
gap_long <- gap_long_temp %>% na.omit(lifeExp)
#confirm reshape and dimensions of long data
dim(gap_long)
## [1] 1704 6
## # A tibble: 6 × 6
## country year continent lifeExp pop gdpPercap
## <fct> <dbl> <fct> <dbl> <int> <dbl>
## 1 Afghanistan 1952 Asia 28.8 8425333 779.
## 2 Afghanistan 1957 Asia 30.3 9240934 821.
## 3 Afghanistan 1962 Asia 32.0 10267083 853.
## 4 Afghanistan 1967 Asia 34.0 11537966 836.
## 5 Afghanistan 1972 Asia 36.1 13079460 740.
## 6 Afghanistan 1977 Asia 38.4 14880372 786.
The following is a visual representation of reshaping from long to wide form data and vice versa from Manny Gimond’s EDA Course.
Fixed effects models allow us to control for certain kinds of unobserved factors (unobserved = not observed in the data): (1) factors that vary between entities but are constant over time; and (2) factors that vary over time that are shared by all entities. We can account for these type of factors in a regression model without observing them in the data–otherwise omitted variables–by incorporating entity fixed effects and time fixed effects, respectively.
Also note that in a conventional panel data setting you can choose to include both entity and time fixed effects, one, or neither. The decision to incorporate fixed effects should be done carefully as a reasoned strategy for controlling for unobserved factors that could result in biased coefficient estimates of interest.
In addition to the fixed effects video lecture from Quant II, you can consult Chapter 10 of Introduction to Econometrics with R for a review of fixed effects regression theory along with R code. The following is an excerpt from that chapter that summarizes two equivalent approaches for using R to estimate fixed effects models: the Least Squares Dummy Variable model, and traditional fixed effects models.
Consider the panel regression model
\[Y_{it} = \beta_0 + \beta_1 X_{it} + \beta_2 Z_i + u_{it}\] where the \(Z_i\) are unobserved time-invariant heterogeneities across the entities \(i=1,\dots,n\). We aim to estimate \(\beta_1\), the effect on \(Y_i\) of a change in \(X_i\) holding constant \(Z_i\). Letting \(\alpha_i = \beta_0 + \beta_2 Z_i\) we obtain the model \[\begin{align} Y_{it} = \alpha_i + \beta_1 X_{it} + u_{it} \tag{10.1}. \end{align}\] Having individual specific intercepts \(\alpha_i\), \(i=1,\dots,n\), where each of these can be understood as the fixed effect of entity \(i\), this model is called the fixed effects model. The variation in the \(\alpha_i\), \(i=1,\dots,n\) comes from the \(Z_i\). (10.1) can be rewritten as a regression model containing \(n-1\) dummy regressors and a constant: \[\begin{align} Y_{it} = \beta_0 + \beta_1 X_{it} + \gamma_2 D2_i + \gamma_3 D3_i + \cdots + \gamma_n Dn_i + u_{it} \tag{10.2}. \end{align}\] Model (10.2) has \(n\) different intercepts — one for every entity. (10.1) and (10.2) are equivalent representations of the fixed effects model.
The fixed effects model can be generalized to contain more than just one determinant of \(Y\) that is correlated with \(X\) and changes over time.
If specified properly, estimating a fixed effects model (as in 10.1) and a Least Squares Dummy Variable (LSDV) model (as in 10.2) will yield identical estimates for \(\beta_1\). The LSDV model will also report estimates for every “fixed effect” coefficient, whereas the fixed effects model will sweep away these fixed effects rather than estimating them; this happens by residualizing (or demeaning) the underlying data, as outlined in the fixed effects video lecture from Quant II.
Also note that you need to think carefully about the underlying error structure when estimating standard errors for inference on \(\hat\beta_1\). Examples of R functions for clustered standard errors for the LSDV and fixed effects models can be found below in Sections 5.1 and 5.2.
The LSDV model can be estimated in R using the lm()
function. The code examples below use the gapminder panel data frame to
estimate the impact of GDP per capita on life expectancy with fixed
effects for country and year (wave).
#specify LSDV model
lsdv1 <- lm(lifeExp ~ gdpPercap + factor(country) + factor(year), data = gap_long)
#display R-squared and Adj. R-squared for full model
summary(lsdv1)$r.squared
## [1] 0.9355361
summary(lsdv1)$adj.r.squared
## [1] 0.9291729
## Estimate Std. Error t value Pr(>|t|)
## -7.801664e-05 1.843088e-05 -4.232930e+00 2.442012e-05
#robust SEs (here we're just reporting on the coefficient for gdpPercap)
coeftest(lsdv1, vcov = vcovHC(lsdv1, type = "HC1"))[2,]
## Estimate Std. Error t value Pr(>|t|)
## -7.801664e-05 1.618158e-05 -4.821325e+00 1.566027e-06
The results are identical to Stata when using the areg command (or regress, properly specified):
#clustered SEs by country (equivalent to areg in Stata)
#stats for coefficient of interest is the 2nd element in this object.
lsdv1_vcov <- cluster.vcov(lsdv1,
gap_long$country,
df_correction = T) #a small sample size adjustment
#just report stats for coefficient of interest (the second row from coeftest)
coeftest(lsdv1, lsdv1_vcov)[2,]
## Estimate Std. Error t value Pr(>|t|)
## -7.801664e-05 3.688988e-05 -2.114852e+00 3.460180e-02
Here are the identical clustered SEs from Stata using the areg command:
Also note that the estimated slope effect of GDP per capita on life expectancy is negative! (Try running without year fixed effects and see how the estimated slope effect changes.) In other words, once we control for time-invariant country-specific characteristics and year-specific factors shared by all countries, we estimate a negative association between GDP per capita and life expectancy – this seems contrary to our expectations!
feols()
feols()
in the fixest
package is another
way to estimate fixed effects models, and is particularly good at
handling lots of fixed effects terms really fast.
#specify model with FEs for country and year, and robust SEs
feols_robust <- feols(lifeExp ~ gdpPercap | factor(country) + factor(year),
data = gap_long,
vcov = "hetero")
#obtain results
summary(feols_robust)
## OLS estimation, Dep. Var.: lifeExp
## Observations: 1,704
## Fixed-effects: factor(country): 142, factor(year): 12
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## gdpPercap -7.8e-05 1.6e-05 -4.82132 1.566e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.27866 Adj. R2: 0.929173
## Within R2: 0.011428
#you can also use tidy in the broom package to obtain results
tidy(feols_robust) %>% filter(term == "gdpPercap")
## # A tibble: 1 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 gdpPercap -0.0000780 0.0000162 -4.82 0.00000157
#specify model with FEs for country and year, and cluster robust SEs by country
feols_ccluster <- feols(lifeExp ~ gdpPercap | factor(country) + factor(year),
data = gap_long,)
summary(feols_ccluster, cluster = ~ factor(country))
## OLS estimation, Dep. Var.: lifeExp
## Observations: 1,704
## Fixed-effects: factor(country): 142, factor(year): 12
## Standard-errors: Clustered (factor(country))
## Estimate Std. Error t value Pr(>|t|)
## gdpPercap -7.8e-05 3.5e-05 -2.20895 0.028792 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.27866 Adj. R2: 0.929173
## Within R2: 0.011428
Note that here the clustered SEs differ slightly from the
lm()
approach and Stata with areg, due to a
different sample size adjustment with feols()
.
Here is a helpful resource for obtaining robust and clustered SEs in R, with comparisons to Stata.
Here is a very short overview of the motivation for clustered standard errors in panel data settings along with some practical guidance, and here is a short video lecture on the topic from Quant II.
Some additional practical guidance to keep in mind: clustered standard errors generally lead to over-rejection of the null hypothesis when the number of clusters is small (<50). See Cameron and Miller (2015) for more details.
In the fixed effects models estimated above, life expectancy is regressed on GDP capita with fixed effects for country and year. How can we think about visualizing the results of this fixed effects model? Let’s start by writing out the population regression function (PRF),
\[lifeExp_{ct} = \beta_0 + \beta_1 gdpPercap_{ct} + \phi_c + \theta_t + u_{ct} \tag{1}\] where c indexes countries and t indexes years. For notational purposes we generally prefer the traditional FE model notation, since we often aren’t interested in the values of the fixed effect terms themselves (we just want to use them to control for certain types of omitted variables). But it can still be instructive to return to the LSDV framework to understand what we’re estimating.
The LSDV model reminds us that we are estimating a single slope effect for GDP per capita, but we are allowing for different intercepts for every country and year. So we could plot separate regression lines to illustrate how the linear relationship between GDP per capita and life expectancy differs for every single country in 2007, for example; all the regression lines would be parallel, only the intercepts would differ.
In many cases, including this one, that approach would result in way too many regression lines on the same plot – it would just look like a big mess to have separate regression lines for every country in 2007. Moreover, we’re probably not even interested in visualizing different intercepts between countries. Rather, the point is to control for these differences so they aren’t a source of omitted variable bias when estimating \(\beta_1\), the slope effect of GDP per capita on life expectancy.
Hence the more common FE notation seen in equation (1), rather than
the LSDV model. Estimation of equation 1 is implemented by using
residualized (or demeaned) versions
of GDP per capita and life expectancy. That’s what FE commands like
feols()
in R and areg in Stata do. First, they regress the
outcome (lifeExp) on dummy variables for every country and year, and
then obtain the residual for every observation. This residual represents
the variation in the outcome that is not explained by
time-invariant country-specific characteristics, or year-specific
factors shared by all countries. The regressor of interest (gdpPercap)
is residualized in a similar fashion. Then the FE coefficient estimate
for \(\beta_1\) can be obtained by
regressing residualized life expectancy on residualized GDP per
capita.
With this process in mind, a natural visualization of the FE coefficient estimate of interest is to plot the regression line over the scatterplot of residualized GDP per capita vs residualized life expectancy.
#obtain residualized lifeExp and gdpPercap (with country and year FEs)
lifeExp_r <- resid(lm(gap_long$lifeExp ~ factor(gap_long$country) + factor(gap_long$year)))
gdpPercap_r <- resid(lm(gap_long$gdpPercap ~ factor(gap_long$country) + factor(gap_long$year)))
#add residualized variables to gap_long as new columns (similar to cbind)
gap_long_fe <- data.frame(gap_long, lifeExp_r, gdpPercap_r)
#plot slope effect from FE model over residualized data
ggplot(gap_long_fe, aes(x = gdpPercap_r, y = lifeExp_r)) +
geom_point(alpha = 0.2) + #adjusts the transparency of the geom_point layer
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
ggtitle(label = "Plot of residualized GDP per capita vs residualized life expectancy",
subtitle = "*Residuals take into account country and year fixed effects")
When the data includes observations for entities over time and not just a single cross-section of data, we need to think carefully about what sort of descriptive statistics would be informative. More specifically, we usually don’t want to compute summary statistics by pooling all observations over time (e.g., a difference in groups means averaged across all times periods). Instead it may be informative to calculate summary statistics separately within groups, or within one or more time periods (e.g. a difference in group means within in each time period).
Time series plots for different entities (or groups of entities) may be an informative visualization. For example, one could compare time-series plots of key variables averaged over treated entities vs untreated entities, or between some meaningful groups. The following example shows times series plots of life expectancy for countries grouped by continent.
#transform data to get continent-year means (bc we observe too many countries to plot separately)
ggplot(data = gap_long, aes(x = year, y = lifeExp, color = continent)) +
stat_summary(fun = mean, geom = "line") +
ggtitle("Life expectancy over time by continent")
Also note that the above plot calculates the unweighted mean of
lifeExp across all countries for a given continent-year grouping. If we
wanted to report population-weighted means for each continent-year, we
would need to transform the data frame before passing it to
ggplot()
, not from within the ggplot()
function call. This is because the ggplot statistical transformation
stat_summary()
doesn’t accept weights as an argument to
allow for weighted mean calculations.