1 Introduction

Load packages:

library(tidyverse)
library(gapminder)
library(panelr)
library(fixest)
library(lmtest) 
library(multiwayvcov)
library(broom)
library(sandwich)

Resources used to create this lesson:

2 What is panel data?

Panel data–also referred to as longitudinal data–refers to information obtained from observing the same set of entities at multiple points in time.

Some examples:

  • A sample of youth observed every year to assess their educational achievement and subsequent employment outcomes
    • in this example the entities are individuals and the time periods are years
  • A sample of businesses surveyed every month since the beginning of the pandemic
    • in this example the entities are businesses and the time periods are months
  • Administrative units (e.g. provinces) report quarterly information on hate crimes
    • in this example the entities are provinces and the time periods are quarters

When observations span multiple time periods, the critical distinction between panel data and pooled cross-sectional data is that in the former the same entities are observed repeatedly over time, while in the latter different entities are observed in each time period.

3 Key advantage of panel data: exploiting within-entity variation

Comparisons between entities in the same period are often susceptible to omitted variable bias. Say we’re interested in evaluating the impact of national mask mandates on pandemic outcomes. Countries with national mask mandates may also be implementing additional public health measures, face different economic conditions, and may be responding to being harder hit by the pandemic.

Thus comparing COVID cases per capita between countries with/without national mask mandates in a given month amounts to comparing countries that are very different in terms of other factors driving pandemic outcomes; in other words, comparisons between “treated” and “untreated” countries within a cross-section (e.g. a given month) may not be very informative, as any association between the mask-mandate “treatment” and outcomes may in large part be explained by these other factors.

In this example, exploiting cross-sectional variation (between countries within one period) to identify the impacts of a mask-mandate is very prone to bias (omitted variable bias and reverse causality).

With panel data, one can exploit variation in the mask-mandate treatment over time within countries to identify the impacts of a mask-mandate. More specifically, in this example fixed effects can be used to control for two broad categories of potentially confounding factors: (1) fixed (time-invariant) country-specific differences; and (2) time-varying differences shared by all countries.

Note how salient the major limitation of fixed effects models is in this example: any factors that within-countries over time cannot be accounted for with fixed effects. Said another way, you can’t have country-month fixed effects, because that would attempt to control for the policy variable under consideration; it would introduce perfect multicollinearity between the treatment dummy variable and the corresponding country-year fixed effect, leading one of them to be dropped from the regression).

What’s an example of a threat to internal validity that fixed effects can’t account for this research question? If a particular country’s decision to implement a national mask mandate is made in tandem with other effective public health measures like free testing programs, then it is impossible to distinguish between the relative contributions of these two policies using fixed effects alone. Maybe you can start to explicitly control for these other policies and other confounding factors with the available data, but trying to explicitly control for all observable sources of OVB seems like a losing battle in this case.

Self-assessment question:

Given the hypothetical panel data structure implied above, can you think of additional research design features and/or data that might help researchers to distinguish the impacts of a mask-mandate from any national testing policies implemented at the same time?

4 Panel data structure

4.1 Long-form panel data

Just because the underlying data is panel in nature doesn’t mean the input data is structured in a useful way.

Fixed effects regression analysis generally require long-form panel data: each observation represents a unique entity-time period pairing (e.g. country-month).

Using the familiar gapminder data, we can confirm that this data frame is long-form panel data with one observation for each surveyed country (or reporting unit like provinces!) every 5 years.

head(gapminder, n = 20)
## # A tibble: 20 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## 11 Afghanistan Asia       2002    42.1 25268405      727.
## 12 Afghanistan Asia       2007    43.8 31889923      975.
## 13 Albania     Europe     1952    55.2  1282697     1601.
## 14 Albania     Europe     1957    59.3  1476505     1942.
## 15 Albania     Europe     1962    64.8  1728137     2313.
## 16 Albania     Europe     1967    66.2  1984060     2760.
## 17 Albania     Europe     1972    67.7  2263554     3313.
## 18 Albania     Europe     1977    68.9  2509048     3533.
## 19 Albania     Europe     1982    70.4  2780097     3631.
## 20 Albania     Europe     1987    72    3075321     3739.

4.2 Wide-form panel data

Alternatively, panel data can be structured in wide form, where the unit of observation in the data frame is the entity (here: country), and the observed values for each entity across different time periods appear as separate columns of data (e.g. lifeExp1952, lifeExp1957,…, lifeExp2007).

One advantage to wide-form panel data is that it is generally easier to compute some country-level statistics; e.g. the change in life expectancy from 1952 to 2007 for each country can be easily computed as the difference between the lifeExp1952 and lifeExp2007 columns.

4.3 Reshaping data

So if wanted to get the gapminder data frame in wide form to obtain entity-level statics, how can we go about that?

The tidyr package include a number of useful functions for restructuring data frames, including pivot_longer and pivot_wider that we used in Week 6 (Women in STEM).

The panelr package also has its own functions for reshaping panel data and estimating panel data models, which we’ll use here. First call the panel_data() function to tell R the nature of the panel data frame: use the id argument to specify the entity variable, and the wave argument to specify the time variable.

From there one can use the long_panel() and widen_panel() functions to move between long- and wide-form data as needed, as in the following code example.

#use the panel_data function to specify panel nature
gap <- panel_data(gapminder, id = country, wave = year)

#reshape gap from long to wide form data
gap_wide <- widen_panel(gap)

#confirm reshape and dimensions of wide data
dim(gap_wide)
## [1] 142  38
head(gap_wide)
## # A tibble: 6 × 38
##   country   continent lifeExp_1952 pop_1952 gdpPercap_1952 lifeExp_1957 pop_1957
##   <fct>     <fct>            <dbl>    <int>          <dbl>        <dbl>    <int>
## 1 Afghanis… Asia              28.8  8425333           779.         30.3  9240934
## 2 Albania   Europe            55.2  1282697          1601.         59.3  1476505
## 3 Algeria   Africa            43.1  9279525          2449.         45.7 10270856
## 4 Angola    Africa            30.0  4232095          3521.         32.0  4561361
## 5 Argentina Americas          62.5 17876956          5911.         64.4 19610538
## 6 Australia Oceania           69.1  8691212         10040.         70.3  9712569
## # ℹ 31 more variables: gdpPercap_1957 <dbl>, lifeExp_1962 <dbl>,
## #   pop_1962 <int>, gdpPercap_1962 <dbl>, lifeExp_1967 <dbl>, pop_1967 <int>,
## #   gdpPercap_1967 <dbl>, lifeExp_1972 <dbl>, pop_1972 <int>,
## #   gdpPercap_1972 <dbl>, lifeExp_1977 <dbl>, pop_1977 <int>,
## #   gdpPercap_1977 <dbl>, lifeExp_1982 <dbl>, pop_1982 <int>,
## #   gdpPercap_1982 <dbl>, lifeExp_1987 <dbl>, pop_1987 <int>,
## #   gdpPercap_1987 <dbl>, lifeExp_1992 <dbl>, pop_1992 <int>, …
#reshape back to wide from long (creates a row for every year, we'll drop NA rows later)
gap_long_temp <- long_panel(gap_wide, 
                            prefix = "_", 
                            begin = 1952, 
                            end = 2007,
                            id = "country") %>% 
  rename(year = wave)
head(gap_long_temp)
## # A tibble: 6 × 6
##   country      year continent lifeExp     pop gdpPercap
##   <fct>       <dbl> <fct>       <dbl>   <int>     <dbl>
## 1 Afghanistan  1952 Asia         28.8 8425333      779.
## 2 Afghanistan  1953 Asia         NA        NA       NA 
## 3 Afghanistan  1954 Asia         NA        NA       NA 
## 4 Afghanistan  1955 Asia         NA        NA       NA 
## 5 Afghanistan  1956 Asia         NA        NA       NA 
## 6 Afghanistan  1957 Asia         30.3 9240934      821.
#drop rows with NA values
gap_long <- gap_long_temp %>% na.omit(lifeExp)

#confirm reshape and dimensions of long data
dim(gap_long)
## [1] 1704    6
head(gap_long)
## # A tibble: 6 × 6
##   country      year continent lifeExp      pop gdpPercap
##   <fct>       <dbl> <fct>       <dbl>    <int>     <dbl>
## 1 Afghanistan  1952 Asia         28.8  8425333      779.
## 2 Afghanistan  1957 Asia         30.3  9240934      821.
## 3 Afghanistan  1962 Asia         32.0 10267083      853.
## 4 Afghanistan  1967 Asia         34.0 11537966      836.
## 5 Afghanistan  1972 Asia         36.1 13079460      740.
## 6 Afghanistan  1977 Asia         38.4 14880372      786.

The following is a visual representation of reshaping from long to wide form data and vice versa from Manny Gimond’s EDA Course.


5 Fixed effects estimation

Fixed effects models allow us to control for certain kinds of unobserved factors (unobserved = not observed in the data): (1) factors that vary between entities but are constant over time; and (2) factors that vary over time that are shared by all entities. We can account for these type of factors in a regression model without observing them in the data–otherwise omitted variables–by incorporating entity fixed effects and time fixed effects, respectively.

Also note that in a conventional panel data setting you can choose to include both entity and time fixed effects, one, or neither. The decision to incorporate fixed effects should be done carefully as a reasoned strategy for controlling for unobserved factors that could result in biased coefficient estimates of interest.

In addition to the fixed effects video lecture from Quant II, you can consult Chapter 10 of Introduction to Econometrics with R for a review of fixed effects regression theory along with R code. The following is an excerpt from that chapter that summarizes two equivalent approaches for using R to estimate fixed effects models: the Least Squares Dummy Variable model, and traditional fixed effects models.


Consider the panel regression model

\[Y_{it} = \beta_0 + \beta_1 X_{it} + \beta_2 Z_i + u_{it}\] where the \(Z_i\) are unobserved time-invariant heterogeneities across the entities \(i=1,\dots,n\). We aim to estimate \(\beta_1\), the effect on \(Y_i\) of a change in \(X_i\) holding constant \(Z_i\). Letting \(\alpha_i = \beta_0 + \beta_2 Z_i\) we obtain the model \[\begin{align} Y_{it} = \alpha_i + \beta_1 X_{it} + u_{it} \tag{10.1}. \end{align}\] Having individual specific intercepts \(\alpha_i\), \(i=1,\dots,n\), where each of these can be understood as the fixed effect of entity \(i\), this model is called the fixed effects model. The variation in the \(\alpha_i\), \(i=1,\dots,n\) comes from the \(Z_i\). (10.1) can be rewritten as a regression model containing \(n-1\) dummy regressors and a constant: \[\begin{align} Y_{it} = \beta_0 + \beta_1 X_{it} + \gamma_2 D2_i + \gamma_3 D3_i + \cdots + \gamma_n Dn_i + u_{it} \tag{10.2}. \end{align}\] Model (10.2) has \(n\) different intercepts — one for every entity. (10.1) and (10.2) are equivalent representations of the fixed effects model.

The fixed effects model can be generalized to contain more than just one determinant of \(Y\) that is correlated with \(X\) and changes over time.


If specified properly, estimating a fixed effects model (as in 10.1) and a Least Squares Dummy Variable (LSDV) model (as in 10.2) will yield identical estimates for \(\beta_1\). The LSDV model will also report estimates for every “fixed effect” coefficient, whereas the fixed effects model will sweep away these fixed effects rather than estimating them; this happens by residualizing (or demeaning) the underlying data, as outlined in the fixed effects video lecture from Quant II.

Also note that you need to think carefully about the underlying error structure when estimating standard errors for inference on \(\hat\beta_1\). Examples of R functions for clustered standard errors for the LSDV and fixed effects models can be found below in Sections 5.1 and 5.2.

5.1 Least Squares Dummy Variable model estimation

The LSDV model can be estimated in R using the lm() function. The code examples below use the gapminder panel data frame to estimate the impact of GDP per capita on life expectancy with fixed effects for country and year (wave).

#specify LSDV model
lsdv1 <- lm(lifeExp ~ gdpPercap + factor(country) + factor(year), data = gap_long)
#display R-squared and Adj. R-squared for full model
summary(lsdv1)$r.squared
## [1] 0.9355361
summary(lsdv1)$adj.r.squared
## [1] 0.9291729
#"regular" SEs (homoskedasticity-only)
coeftest(lsdv1)[2,]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -7.801664e-05  1.843088e-05 -4.232930e+00  2.442012e-05
#robust SEs (here we're just reporting on the coefficient for gdpPercap)
coeftest(lsdv1, vcov = vcovHC(lsdv1, type = "HC1"))[2,]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -7.801664e-05  1.618158e-05 -4.821325e+00  1.566027e-06

The results are identical to Stata when using the areg command (or regress, properly specified):


#clustered SEs by country (equivalent to areg in Stata)

  #stats for coefficient of interest is the 2nd element in this object. 
  lsdv1_vcov <- cluster.vcov(lsdv1, 
                              gap_long$country,
                              df_correction = T) #a small sample size adjustment
    
  #just report stats for coefficient of interest (the second row from coeftest)
    coeftest(lsdv1, lsdv1_vcov)[2,] 
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -7.801664e-05  3.688988e-05 -2.114852e+00  3.460180e-02

Here are the identical clustered SEs from Stata using the areg command:


Also note that the estimated slope effect of GDP per capita on life expectancy is negative! (Try running without year fixed effects and see how the estimated slope effect changes.) In other words, once we control for time-invariant country-specific characteristics and year-specific factors shared by all countries, we estimate a negative association between GDP per capita and life expectancy – this seems contrary to our expectations!

5.2 Fixed effects estimation with feols()

feols() in the fixest package is another way to estimate fixed effects models, and is particularly good at handling lots of fixed effects terms really fast.

#specify model with FEs for country and year, and robust SEs
feols_robust <- feols(lifeExp ~ gdpPercap | factor(country) + factor(year), 
                data = gap_long,
                vcov = "hetero")
#obtain results
summary(feols_robust) 
## OLS estimation, Dep. Var.: lifeExp
## Observations: 1,704
## Fixed-effects: factor(country): 142,  factor(year): 12
## Standard-errors: Heteroskedasticity-robust 
##           Estimate Std. Error  t value  Pr(>|t|)    
## gdpPercap -7.8e-05    1.6e-05 -4.82132 1.566e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.27866     Adj. R2: 0.929173
##                 Within R2: 0.011428
#you can also use tidy in the broom package to obtain results
tidy(feols_robust) %>% filter(term == "gdpPercap")
## # A tibble: 1 × 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 gdpPercap -0.0000780 0.0000162     -4.82 0.00000157
#specify model with FEs for country and year, and cluster robust SEs by country
feols_ccluster <- feols(lifeExp ~ gdpPercap | factor(country) + factor(year), 
                data = gap_long,)
summary(feols_ccluster, cluster = ~ factor(country)) 
## OLS estimation, Dep. Var.: lifeExp
## Observations: 1,704
## Fixed-effects: factor(country): 142,  factor(year): 12
## Standard-errors: Clustered (factor(country)) 
##           Estimate Std. Error  t value Pr(>|t|)    
## gdpPercap -7.8e-05    3.5e-05 -2.20895 0.028792 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.27866     Adj. R2: 0.929173
##                 Within R2: 0.011428

Note that here the clustered SEs differ slightly from the lm() approach and Stata with areg, due to a different sample size adjustment with feols().

Here is a helpful resource for obtaining robust and clustered SEs in R, with comparisons to Stata.

Here is a very short overview of the motivation for clustered standard errors in panel data settings along with some practical guidance, and here is a short video lecture on the topic from Quant II.

Some additional practical guidance to keep in mind: clustered standard errors generally lead to over-rejection of the null hypothesis when the number of clusters is small (<50). See Cameron and Miller (2015) for more details.

5.3 Visualizing fixed effects results

In the fixed effects models estimated above, life expectancy is regressed on GDP capita with fixed effects for country and year. How can we think about visualizing the results of this fixed effects model? Let’s start by writing out the population regression function (PRF),

\[lifeExp_{ct} = \beta_0 + \beta_1 gdpPercap_{ct} + \phi_c + \theta_t + u_{ct} \tag{1}\] where c indexes countries and t indexes years. For notational purposes we generally prefer the traditional FE model notation, since we often aren’t interested in the values of the fixed effect terms themselves (we just want to use them to control for certain types of omitted variables). But it can still be instructive to return to the LSDV framework to understand what we’re estimating.

The LSDV model reminds us that we are estimating a single slope effect for GDP per capita, but we are allowing for different intercepts for every country and year. So we could plot separate regression lines to illustrate how the linear relationship between GDP per capita and life expectancy differs for every single country in 2007, for example; all the regression lines would be parallel, only the intercepts would differ.

In many cases, including this one, that approach would result in way too many regression lines on the same plot – it would just look like a big mess to have separate regression lines for every country in 2007. Moreover, we’re probably not even interested in visualizing different intercepts between countries. Rather, the point is to control for these differences so they aren’t a source of omitted variable bias when estimating \(\beta_1\), the slope effect of GDP per capita on life expectancy.

Hence the more common FE notation seen in equation (1), rather than the LSDV model. Estimation of equation 1 is implemented by using residualized (or demeaned) versions of GDP per capita and life expectancy. That’s what FE commands like feols() in R and areg in Stata do. First, they regress the outcome (lifeExp) on dummy variables for every country and year, and then obtain the residual for every observation. This residual represents the variation in the outcome that is not explained by time-invariant country-specific characteristics, or year-specific factors shared by all countries. The regressor of interest (gdpPercap) is residualized in a similar fashion. Then the FE coefficient estimate for \(\beta_1\) can be obtained by regressing residualized life expectancy on residualized GDP per capita.

With this process in mind, a natural visualization of the FE coefficient estimate of interest is to plot the regression line over the scatterplot of residualized GDP per capita vs residualized life expectancy.

#obtain residualized lifeExp and gdpPercap (with country and year FEs)
lifeExp_r <- resid(lm(gap_long$lifeExp ~ factor(gap_long$country) + factor(gap_long$year)))
gdpPercap_r <- resid(lm(gap_long$gdpPercap ~ factor(gap_long$country) + factor(gap_long$year)))

#add residualized variables to gap_long as new columns (similar to cbind)
gap_long_fe <- data.frame(gap_long, lifeExp_r, gdpPercap_r)

#plot slope effect from FE model over residualized data
ggplot(gap_long_fe, aes(x = gdpPercap_r, y = lifeExp_r)) +
  geom_point(alpha = 0.2) + #adjusts the transparency of the geom_point layer
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  ggtitle(label = "Plot of residualized GDP per capita vs residualized life expectancy",
          subtitle = "*Residuals take into account country and year fixed effects")

6 Exploratory data analysis with panel data

When the data includes observations for entities over time and not just a single cross-section of data, we need to think carefully about what sort of descriptive statistics would be informative. More specifically, we usually don’t want to compute summary statistics by pooling all observations over time (e.g., a difference in groups means averaged across all times periods). Instead it may be informative to calculate summary statistics separately within groups, or within one or more time periods (e.g. a difference in group means within in each time period).

Time series plots for different entities (or groups of entities) may be an informative visualization. For example, one could compare time-series plots of key variables averaged over treated entities vs untreated entities, or between some meaningful groups. The following example shows times series plots of life expectancy for countries grouped by continent.

#transform data to get continent-year means (bc we observe too many countries to plot separately)
ggplot(data = gap_long, aes(x = year, y = lifeExp, color = continent)) +
  stat_summary(fun = mean, geom = "line") + 
  ggtitle("Life expectancy over time by continent")

Also note that the above plot calculates the unweighted mean of lifeExp across all countries for a given continent-year grouping. If we wanted to report population-weighted means for each continent-year, we would need to transform the data frame before passing it to ggplot(), not from within the ggplot() function call. This is because the ggplot statistical transformation stat_summary() doesn’t accept weights as an argument to allow for weighted mean calculations.