1 Introduction

Load packages:

library(tidyverse)
library(gapminder)

Resources used to create this lesson:

2 Importing data

The utils package that is automatically loaded with R includes a set of functions for importing data (e.g. read.csv()). Comparable functions included in the readr package (e.g. read_csv()) are generally faster and import data as tibbles that work well with tidyverse packages. We’ll focus on the readr functions for importing data, which are included as part of the Tidyverse suite of packages.

2.1 Comma delimited files (.csv)

Many datasets are stored as text files with rows of data and values for each column separated by commas (or some other delimiter).

df1 <- read_csv("yourfilename.csv")

The default is for strings to be stored as characters rather than factors (unlike read.csv() in the utils package, which has different default behavior). One common option you can set is to specify which values are read in as NA. In this example we’ll treat blank character values as NAs. Another option allows you to specify column names.

varnames <- c("c1", "c2", "c3", "c4)
df1 <- read_csv("yourfilename.csv", na = "", col_names = varnames)

read_tsv() is a similar function for reading in tab-delimited text files of data. read_csv() and read_tsv() are known as wrapper functions. Wrapper functions call another function that does the real work, but “wrap” that function in way that allows for simpler syntax and is easier to work with. Wrappers often specify some of the common options for you, but there are often many more options you can explore in the documentation.

In this case it’s read_delim() that does the real work when you call read_csv(), read_csv() simply tells read_delim() to specify the options to treat commas as the file delimiter.

fread() in the data.tables package is another powerful tool for reading in data without having to specify many (or any!) options.

2.2 Excel files

An Excel file is effectively a set of data tables, with each worksheet as its own data table. The readxl package includes the excel_sheets() function for recognizing sheet names, and the read_excel() function for importing specific sheets in a workbook of data. You can use excel_sheets() to obtain worksheet names and then tell R to read those worksheets by specifying them as arguments when calling read_excel(). You can also simply refer to a worksheet by its index number in the workbook, as in the below code chunk. Just make sure you remember to install and load the readxl package first.

library(readxl)
excel_sheets("yourworkbook.xls")
df2 <- read_excel("yourworkbook.xlsx", sheet = 2)

2.3 R data files

R also has its own data file formats: an .rdata file can include one or more data objects, an .rds file includes just one data object. You can easily read in files in these formats using the load() and readRDS() functions in base R.

df3 <- load("yourrdatafile1.rdata") #load an .rdata file
df3 <- readRDS("yourrdatafile2.rds") #load an .rds file

2.4 Fixed width files

A fixed width file is a text file where each column has a maximum number of characters (i.e. a “fixed width”). There are no delimiters used within rows to separate values for each column. This means you have tell R, or any statistical software package, which characters should be assigned to which variables. read_fwf() is a readr package that makes it relatively easy to read in fixed width files.

3 Saving and exporting data

You can save one or more objects in your environment as an .rdata file that can be loaded at another time.

# Saving one object in an rdata file
save(dataobject1, file = "data.RData")

# Save multiple objects
save(dataobject1, dataobject2, file = "data.RData")

# To load the data again
load("data.RData")

There may be instances where you want to export data for use in other applications, such as mapping software or web-based applications. Also included in the readr package, the write_csv() function is an easy wrapper function for exporting .csv files.

write_csv(yourdataframe, "yourfilename.csv")

4 Appending data sets

Appending is simply stacking rows of data on top of each other. For example, you may have two identically formatted data frames, each with data for a different year, that you want to combine into a single data frame spanning both years.

The most important requirement for effective appending is to make sure all of your datasets have comparable information stored under the same column names, and with values coded in a consistent manner. For example, if you appended individual-level data from one year with height measured as inches to another year of data where height is measured as centimeters, you’d end up with a useless set of height measurements. You’d first want to convert height in one dataset to the same unit of measurement as the other, then you’d want to make sure height is stored under the same column name in both datasets.

Here is the generic syntax for appending two data frames (df1 and df2) using rbind().

df_combined <- rbind(df1, df2)

rbind() only works if both data frames have the exact same columns. If you have an additional column of data in one data frame that you want to keep when you append them together (assigning NA values for this column to observations from the other set), try rbind.fill() in the plyr package.

Combining columns of data is known as a join, which we will explore next week.

5 Inspecting columns of data

We’ve already seen that str() is a very handy tool for inspecting the data object types. Before we start cleaning and summarizing data, let’s review some basic functions for inspecting data frames and columns of data. We’ll focus on the familiar gapminder dataset and the continent column as an example.

str(gapminder) shows us the structure of any object, in case the gapminder data frame, including the dimensions (columns x rows), column names, data type/class, and snapshot of values for each column.

str(gapminder)

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

head() and tail(): Show the first or last parts of an object (an optional argument allows you to specify how many observations to show). This is useful for observing high/low values of a variable after calling arrange().

gapminder %>% filter(year == 2007) %>% arrange(desc(lifeExp)) %>% head(n = 5)

## # A tibble: 5 × 6
##   country          continent  year lifeExp       pop gdpPercap
##   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Japan            Asia       2007    82.6 127467972    31656.
## 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
## 3 Iceland          Europe     2007    81.8    301931    36181.
## 4 Switzerland      Europe     2007    81.7   7554661    37506.
## 5 Australia        Oceania    2007    81.2  20434176    34435.

There are a number of functions that allow us to inspect specific columns of data (i.e. variables). Remember that if you pass a data frame to a tidyverse function through a pipe, you don’t need to refer to the data frame as an argument in the next function, and you can just refer to columns by name when specifying function arguments (when applicable). But if you call a base R function that isn’t part of the tidyverse package, you’ll need to use the $ syntax to access specific columns within a data frame.

gapminder %>% head(n = 3)

## # A tibble: 3 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.

head(gapminder, n = 3)

## # A tibble: 3 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.

dim(gapminder)

## [1] 1704    6

typeof(gapminder$continent)

## [1] "integer"

attributes(gapminder$continent)

## $levels
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania" 
## 
## $class
## [1] "factor"

class(gapminder$continent)

## [1] "factor"

levels(gapminder$continent)

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

nlevels(gapminder$continent)

## [1] 5

6 Basic data cleaning tools

The dplyr functions introduced last week are useful tools for understanding and managing your dataset: filter(), arrange(), select(), mutate() and rename(). They help us understand what columns of data look like… and by extension suggest certain data cleaning tasks to make the information we want to work with clear, consistent, and ready for statistical analysis.

Here are some basic tools for recoding information stored in columns of data.

6.1 Recoding values

recode() is another dplyr function for replacing numeric, character or factor values. Say we want to recode gapminder observations listed with the country name “West Bank and Gaza” to instead refer to the country name “State of Palestine.” We can use the recode() function as part of the argument in the mutate() function to create a new variable called country_clean.

g2 <- gapminder %>% 
  mutate(country_clean = recode(country, "West Bank and Gaza" = "State of Palestine")) 
# let's check if it worked by printing observations where country == "West Bank and Gaza"
g2 %>% filter(country == "West Bank and Gaza")

## # A tibble: 12 × 7
##    country            continent  year lifeExp     pop gdpPercap country_clean   
##    <fct>              <fct>     <int>   <dbl>   <int>     <dbl> <fct>           
##  1 West Bank and Gaza Asia       1952    43.2 1030585     1516. State of Palest…
##  2 West Bank and Gaza Asia       1957    45.7 1070439     1827. State of Palest…
##  3 West Bank and Gaza Asia       1962    48.1 1133134     2199. State of Palest…
##  4 West Bank and Gaza Asia       1967    51.6 1142636     2650. State of Palest…
##  5 West Bank and Gaza Asia       1972    56.5 1089572     3133. State of Palest…
##  6 West Bank and Gaza Asia       1977    60.8 1261091     3683. State of Palest…
##  7 West Bank and Gaza Asia       1982    64.4 1425876     4336. State of Palest…
##  8 West Bank and Gaza Asia       1987    67.0 1691210     5107. State of Palest…
##  9 West Bank and Gaza Asia       1992    69.7 2104779     6018. State of Palest…
## 10 West Bank and Gaza Asia       1997    71.1 2826046     7111. State of Palest…
## 11 West Bank and Gaza Asia       2002    72.4 3389578     4515. State of Palest…
## 12 West Bank and Gaza Asia       2007    73.4 4018332     3025. State of Palest…

Note that country is a factor, so categorical values (levels) are stored as integers but displayed as characters. We can refer to the levels themselves when using recode().

recode() can also be used to recode values in numeric, character, and logical columns of data.

replace() is a base R function that is can also be used fo recoding values.

6.2 Setting factor levels

Suppose that we have a factor variable var1 with 4 categories, but we want to exclude observations in the fourth category as NA because we’re unclear what this category means. One way to do that is to use factor() to explicitly set the levels to use, in turn treating all excluded factor levels as NA.

# set a vector including the desired factor levels to use for var1
levels_to_keep <- c("cat1", "cat2", "cat3")

# assign this vector to the levels attribute of factor var1
df %>% mutate(var1 = factor(var1, levels = levels_to_keep))

6.3 Re-ordering factor levels

The forcats package included in the tidyverse includes a number of tools for working with factors. forcats includes a set of functions for re-ordering factor levels, which can be useful for summarizing factor variables.

# the default level is alphabetical
gapminder$continent %>% levels()

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

# reorder levels in reverse alphabetical order
gapminder$continent %>% 
  fct_rev() %>% 
  levels()

## [1] "Oceania"  "Europe"   "Asia"     "Americas" "Africa"

# reorder levels by frequency 
fct_count(gapminder$continent)

## # A tibble: 5 × 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24

gapminder$continent %>% 
  fct_infreq() %>% 
  levels()

## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"

# reorder levels by putting Europe first
gapminder$continent %>% 
  fct_relevel("Europe") %>% 
  levels()

## [1] "Europe"   "Africa"   "Americas" "Asia"     "Oceania"

# reorder levels based on minimum lifeExp among all observations in each continent
gapminder$continent %>% 
  fct_reorder(gapminder$lifeExp, min) %>% 
  levels()

## [1] "Africa"   "Asia"     "Americas" "Europe"   "Oceania"

You can take a look at the forcats documentation for a cheat sheet that describes other functions for working with factor variables.

6.4 if_else

if_else is a useful dplyr function for manipulating data in (non-factor) columns of data based on conditional statements. To get a sense of how if_else works, let’s create a binary (logical) variable called cont_asia equal to TRUE if continent == "Asia" and FALSE for other continents.

g3 <- gapminder %>% 
  mutate(cont_asia = if_else(continent == "Asia", TRUE, FALSE) ) 
str(g3)

## tibble [1,704 × 7] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
##  $ cont_asia: logi [1:1704] TRUE TRUE TRUE TRUE TRUE TRUE ...

We’ll check to make sure this mutate worked correctly in the next section.

Also note that base R includes its own ifelse function that works a bit differently than the dplyr if_else function. The dplyr version is more rigid when it comes to working with different data types.

mutate_if() is another tool for creating variables in ways that depend on certain specified conditions.

7 Summary statistics

Summary statistics, or descriptive statistics, are terms we’ll use to describe information about the distribution of a random variable. Before we start exploring relationships between variables, we generally want to understand the distribution of individual variables. Some examples of summary statistics:

mean, median, min, max, and standard deviation are statistics that characterize the distribution of a continuous variable
frequency tables characterize the distribution of categorical variables

7.1 summary()

summary() is a base R function that summarizes objects in ways that depend on the class of the object. It can be a handy way to quickly generate summary statistics for columns of data, though the display format varies depending on the class of the argument. Here are some examples.

summary(g3)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap        cont_asia      
##  Min.   :6.001e+04   Min.   :   241.2   Mode :logical  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1   FALSE:1308     
##  Median :7.024e+06   Median :  3531.8   TRUE :396      
##  Mean   :2.960e+07   Mean   :  7215.3                  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5                  
##  Max.   :1.319e+09   Max.   :113523.1                  
##

# note how summary treats factors, characters, and logical data types differently
summary(g3$continent)

##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

summary(as.character(g3$cont_asia))

##    Length     Class      Mode 
##      1704 character character

summary(g3$cont_asia)

##    Mode   FALSE    TRUE 
## logical    1308     396

7.2 summarize()

A more flexible function for computing summary statistics is the summarize() function in the dplyr package. summarize() stores results as data frames and can be used to perform more complicated calculations in combination with group_by(). This can be helpful for characterizing conditional distributions, such as computing statistics for one variable conditional on the value of another variable. (Crosstabs, discussed in the next section, are another tool for showing conditional or joint distributions.)

gapminder %>%
  filter(year == 2007) %>%  
  summarize(n_countries = n(), avg_lifeExp_2007 = mean(lifeExp))

## # A tibble: 1 × 2
##   n_countries avg_lifeExp_2007
##         <int>            <dbl>
## 1         142             67.0

Here is a list of “helper” functions we can use within summarize():

Function	Description
`n`	count
`n_distinct`	count unique values
`mean`	mean
`median`	median
`max`	largest value
`min`	smallest value
`sd`	standard deviation
`sum`	sum of values
`first`	first value
`last`	last value
`nth`	nth value
`any`	condition true for at least one value

7.3 Frequency tables and crosstabs

The base R table() function is a quick tool for generating frequency tables (also known as one-way tables) and crosstabs (two-way tables). Note that passing the data column continent as an argument to the table() function yields an equivalent frequency table as passing it to the summary function.

table(g3$continent)

## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

Let’s look at a crosstab between continent and cont_asia to make sure we correctly created the logical variable indicating Asian observations in Section 6.4: if_else. We can also look at column or row proportions (conditional distributions) instead of cell counts.

# joint distribution
table(g3$continent, g3$cont_asia)

##           
##            FALSE TRUE
##   Africa     624    0
##   Americas   300    0
##   Asia         0  396
##   Europe     360    0
##   Oceania     24    0

# distribution of continent conditional on being in Asia (admittedly not very interesting)
prop.table(table(g3$continent, g3$cont_asia) , 2)  %>% round(2)

##           
##            FALSE TRUE
##   Africa    0.48 0.00
##   Americas  0.23 0.00
##   Asia      0.00 1.00
##   Europe    0.28 0.00
##   Oceania   0.02 0.00

Note that there may be other packages with functions specifically designed for the types of data you’re working with. For example, fct_count() in the forcats package shows a frequency table for factor variables, yielding equivalent results to the base R table() function but with enhanced presentation.

fct_count(g3$continent)

## # A tibble: 5 × 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24

8 Grouping

The group_by() function allows us to add extra structure to your data by telling R to recognize groups. This allows for calculating statistics at the group-level using summarize() and mutate(), both of which respect groups.

summarize() will generally take a dataset and return a single observation (per group) with specified summary statistics, such as counting observations or computing means (in each group, if groups are specified).

Combined with mutate(), summarize() and other dplyr functions, group_by() enables exploratory data analysis tools that we will use throughout this course. Let’s start by using these functions with the gapminder data frame to count observations for certain groups of country-year observations and make some calculations at the continent level. Take a look at the following examples.

# use summarize to count observations in each continent
gapminder %>%
  filter(year %in% c(1952, 2007)) %>% # %in% identifies if an element belongs to a vector 
  group_by(continent) %>%
  summarize(n = n())

## # A tibble: 5 × 2
##   continent     n
##   <fct>     <int>
## 1 Africa      104
## 2 Americas     50
## 3 Asia         66
## 4 Europe       60
## 5 Oceania       4

# count does both grouping and counting in a single function!
gapminder %>%
  filter(year %in% c(1952, 2007)) %>% 
  count(continent)

## # A tibble: 5 × 2
##   continent     n
##   <fct>     <int>
## 1 Africa      104
## 2 Americas     50
## 3 Asia         66
## 4 Europe       60
## 5 Oceania       4

# use summarize to count observations for each country-year and compute mean lifeExp
gapminder %>%
  filter(year %in% c(1952, 2007)) %>% 
  group_by(continent, year) %>%
  summarize(n = n(), avg_lifeExp = mean(lifeExp))

## # A tibble: 10 × 4
##    continent  year     n avg_lifeExp
##    <fct>     <int> <int>       <dbl>
##  1 Africa     1952    52        39.1
##  2 Africa     2007    52        54.8
##  3 Americas   1952    25        53.3
##  4 Americas   2007    25        73.6
##  5 Asia       1952    33        46.3
##  6 Asia       2007    33        70.7
##  7 Europe     1952    30        64.4
##  8 Europe     2007    30        77.6
##  9 Oceania    1952     2        69.3
## 10 Oceania    2007     2        80.7

# let's use mutate to compute the change between 1952 and 2007 in country life expectancy
gapminder %>%
  filter(year %in% c(1952, 2007)) %>% 
  group_by(country) %>% 
  select(country, year, lifeExp) %>% 
  mutate(lifeExp_gain = lifeExp - first(lifeExp))

## # A tibble: 284 × 4
##    country      year lifeExp lifeExp_gain
##    <fct>       <int>   <dbl>        <dbl>
##  1 Afghanistan  1952    28.8          0  
##  2 Afghanistan  2007    43.8         15.0
##  3 Albania      1952    55.2          0  
##  4 Albania      2007    76.4         21.2
##  5 Algeria      1952    43.1          0  
##  6 Algeria      2007    72.3         29.2
##  7 Angola       1952    30.0          0  
##  8 Angola       2007    42.7         12.7
##  9 Argentina    1952    62.5          0  
## 10 Argentina    2007    75.3         12.8
## # … with 274 more rows

# here's a more advanced example: obtaining extreme lifeExp values for each year
gapminder %>%
  select(year, country, lifeExp) %>%
  group_by(year) %>%
  filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2) %>% # min_rank is a dplyr ranking function
  arrange(year, lifeExp) %>%
  print(n = Inf) # Infinity can be used to override limits on the number of rows/columns printed

## # A tibble: 24 × 3
## # Groups:   year [12]
##     year country      lifeExp
##    <int> <fct>          <dbl>
##  1  1952 Afghanistan     28.8
##  2  1952 Norway          72.7
##  3  1957 Afghanistan     30.3
##  4  1957 Iceland         73.5
##  5  1962 Afghanistan     32.0
##  6  1962 Iceland         73.7
##  7  1967 Afghanistan     34.0
##  8  1967 Sweden          74.2
##  9  1972 Sierra Leone    35.4
## 10  1972 Sweden          74.7
## 11  1977 Cambodia        31.2
## 12  1977 Iceland         76.1
## 13  1982 Sierra Leone    38.4
## 14  1982 Japan           77.1
## 15  1987 Angola          39.9
## 16  1987 Japan           78.7
## 17  1992 Rwanda          23.6
## 18  1992 Japan           79.4
## 19  1997 Rwanda          36.1
## 20  1997 Japan           80.7
## 21  2002 Zambia          39.2
## 22  2002 Japan           82  
## 23  2007 Swaziland       39.6
## 24  2007 Japan           82.6

9 Intro to ggplot for visualization

ggplot2 is a package in the tidyverse that provides a system for creating data visualizations. It’s a bit more difficult to learn than other R functionality because ggplot syntax relies on its own grammatical rules.

With ggplot2, you begin a plot with the function ggplot(). This creates a coordinate system that you can add layers to. We start with three components:

data: the first argument of ggplot() is the data to use in the graph, e.g. ggplot(data = gapminder). This creates an empty graph that you can add one or more layers to.
aesthetics map variables to visual properties like color, shape, size, etc. **Aesthetic mappings are set as arguments to the aes() function, which must be specified along with a ggplot() call. ggplot2 knows to look for mapped variables inside the data argument.
geometries are geometric elements added to plots in layers. Examples includes points, lines, and bars (e.g. geom_point(), geom_line(), and geom_bar() or geom_histogram()).

Other components that you can specify include scale, statistical transformations, or facets that break information into subplots. Here is a handy reference sheet from ggplot2.

We’ll cover ggplot data visualization in more detail in a few weeks, but for now let’s review just the very basics needed to generate histograms and barplots as a tool for summarizing data.

9.1 barplots

Let’s learn the basic syntax by plotting the number of countries observed in each continent in the gapminder dataset for 2007. Barplots use geom_bar() to plot categorical data, whereas histograms use geom_histogram() to visualize the distribution of a continuous variable.

gap_2007 <- gapminder %>% filter(year == 2007)

# it can be helpful to include argument names while learning the syntax
ggplot(data = gap_2007) + aes(x = continent) + geom_bar()

# but R is smart enough to recognize many arguments without specifying by name
ggplot(gap_2007) + aes(continent) + geom_bar()

Here is the same information with additional options to display continent proportions instead of counts.

# but R is smart enough to recognize many arguments without specifying by name
ggplot(gap_2007) + aes(x = continent, y = ..prop.., group = 1) + geom_bar()

geom_bar() works by simply counting observations in each category of the argument x (in this case x = continent). Alternatively, we can collapse the country-level data to continent level observations using group_by() and summarize(), and the plot the data directly using geom_col(). Don’t worry too much about this, just note that the geometries we use in our plots depend on the structure of the data.

gap_continents <- gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  summarize(country_ct = n())
ggplot(data = gap_continents) + aes(x = continent, y = country_ct) + geom_col()

9.2 histograms

Here is the code to illustrate the distribution of life expectancy across countries in 2007 using a very basic histogram with no additional options specified.

ggplot(gap_2007) + aes(x = lifeExp) + geom_histogram()

Lecture 3.1: Data Analysis Tools

SIPA U6614 | Instructor: Harold Stolper

null