Load packages:
library(tidyverse)
library(gapminder)
Resources used to create this lesson:
The utils
package that is automatically loaded with R
includes a set of functions for importing data
(e.g. read.csv()
). Comparable functions included in the
readr
package (e.g. read_csv()
) are generally
faster and import data as tibbles that work well with
tidyverse
packages. We’ll focus on the readr
functions for importing data, which are included as part of the Tidyverse suite of
packages.
Many datasets are stored as text files with rows of data and values for each column separated by commas (or some other delimiter).
<- read_csv("yourfilename.csv") df1
The default is for strings to be stored as characters rather than
factors (unlike read.csv()
in the utils
package, which has different default behavior). One common option you
can set is to specify which values are read in as NA
. In
this example we’ll treat blank character values as NAs. Another option
allows you to specify column names.
<- c("c1", "c2", "c3", "c4)
varnames df1 <- read_csv("yourfilename.csv", na = "", col_names = varnames)
read_tsv()
is a similar function for reading in
tab-delimited text files of data. read_csv()
and
read_tsv()
are known as wrapper functions. Wrapper
functions call another function that does the real work, but “wrap” that
function in way that allows for simpler syntax and is easier to work
with. Wrappers often specify some of the common options for you, but
there are often many more options you can explore in the
documentation.
In this case it’s read_delim()
that does the real work
when you call read_csv()
, read_csv()
simply
tells read_delim()
to specify the options to treat commas
as the file delimiter.
fread()
in the data.tables
package is
another powerful tool for reading in data without having to specify many
(or any!) options.
An Excel file is effectively a set of data tables, with each
worksheet as its own data table. The readxl
package
includes the excel_sheets()
function for recognizing sheet
names, and the read_excel()
function for importing specific
sheets in a workbook of data. You can use excel_sheets()
to
obtain worksheet names and then tell R to read those worksheets by
specifying them as arguments when calling read_excel()
. You
can also simply refer to a worksheet by its index number in the
workbook, as in the below code chunk. Just make sure you remember to
install and load the readxl
package first.
library(readxl)
excel_sheets("yourworkbook.xls")
<- read_excel("yourworkbook.xlsx", sheet = 2) df2
R also has its own data file formats: an .rdata file can include one
or more data objects, an .rds file includes just one data object. You
can easily read in files in these formats using the load()
and readRDS()
functions in base R.
<- load("yourrdatafile1.rdata") #load an .rdata file
df3 <- readRDS("yourrdatafile2.rds") #load an .rds file df3
A fixed width file is a text file where each column has a maximum
number of characters (i.e. a “fixed width”). There are no delimiters
used within rows to separate values for each column. This means you have
tell R, or any statistical software package, which characters should be
assigned to which variables. read_fwf()
is a
readr
package that makes it relatively easy to read in
fixed width files.
You can save one or more objects in your environment as an .rdata file that can be loaded at another time.
# Saving one object in an rdata file
save(dataobject1, file = "data.RData")
# Save multiple objects
save(dataobject1, dataobject2, file = "data.RData")
# To load the data again
load("data.RData")
There may be instances where you want to export data for use in other
applications, such as mapping software or web-based applications. Also
included in the readr
package, the write_csv()
function is an easy wrapper function for exporting .csv files.
write_csv(yourdataframe, "yourfilename.csv")
Appending is simply stacking rows of data on top of each other. For example, you may have two identically formatted data frames, each with data for a different year, that you want to combine into a single data frame spanning both years.
The most important requirement for effective appending is to make sure all of your datasets have comparable information stored under the same column names, and with values coded in a consistent manner. For example, if you appended individual-level data from one year with height measured as inches to another year of data where height is measured as centimeters, you’d end up with a useless set of height measurements. You’d first want to convert height in one dataset to the same unit of measurement as the other, then you’d want to make sure height is stored under the same column name in both datasets.
Here is the generic syntax for appending two data frames (df1 and
df2) using rbind()
.
<- rbind(df1, df2) df_combined
rbind()
only works if both data frames have the exact
same columns. If you have an additional column of data in one data frame
that you want to keep when you append them together (assigning NA values
for this column to observations from the other set), try
rbind.fill()
in the plyr
package.
Combining columns of data is known as a join, which we will explore next week.
We’ve already seen that str()
is a very handy tool for
inspecting the data object types. Before we start cleaning and
summarizing data, let’s review some basic functions for inspecting data
frames and columns of data. We’ll focus on the familiar gapminder
dataset and the continent
column as an example.
str(gapminder)
shows us the structure of any object, in
case the gapminder data frame, including the dimensions (columns x
rows), column names, data type/class, and snapshot of values for each
column.
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
head()
and tail()
: Show the first or last
parts of an object (an optional argument allows you to specify how many
observations to show). This is useful for observing high/low values of a
variable after calling arrange()
.
%>% filter(year == 2007) %>% arrange(desc(lifeExp)) %>% head(n = 5) gapminder
## # A tibble: 5 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 3 Iceland Europe 2007 81.8 301931 36181.
## 4 Switzerland Europe 2007 81.7 7554661 37506.
## 5 Australia Oceania 2007 81.2 20434176 34435.
There are a number of functions that allow us to inspect specific
columns of data (i.e. variables). Remember that if you pass a data frame
to a tidyverse function through a pipe, you don’t need to refer to the
data frame as an argument in the next function, and you can just refer
to columns by name when specifying function arguments (when applicable).
But if you call a base R function that isn’t part of the tidyverse
package, you’ll need to use the $
syntax to access specific
columns within a data frame.
%>% head(n = 3) gapminder
## # A tibble: 3 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
head(gapminder, n = 3)
## # A tibble: 3 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
dim(gapminder)
## [1] 1704 6
typeof(gapminder$continent)
## [1] "integer"
attributes(gapminder$continent)
## $levels
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
##
## $class
## [1] "factor"
class(gapminder$continent)
## [1] "factor"
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gapminder$continent)
## [1] 5
The dplyr functions introduced last week are useful tools for
understanding and managing your dataset: filter()
,
arrange()
, select()
, mutate()
and
rename()
. They help us understand what columns of data look
like… and by extension suggest certain data cleaning tasks to make the
information we want to work with clear, consistent, and ready for
statistical analysis.
Here are some basic tools for recoding information stored in columns of data.
recode()
is another dplyr function for replacing
numeric, character or factor values. Say we want to recode gapminder
observations listed with the country name “West Bank and Gaza” to
instead refer to the country name “State of Palestine.” We can use the
recode()
function as part of the argument in the
mutate()
function to create a new variable called
country_clean
.
<- gapminder %>%
g2 mutate(country_clean = recode(country, "West Bank and Gaza" = "State of Palestine"))
# let's check if it worked by printing observations where country == "West Bank and Gaza"
%>% filter(country == "West Bank and Gaza") g2
## # A tibble: 12 × 7
## country continent year lifeExp pop gdpPercap country_clean
## <fct> <fct> <int> <dbl> <int> <dbl> <fct>
## 1 West Bank and Gaza Asia 1952 43.2 1030585 1516. State of Palest…
## 2 West Bank and Gaza Asia 1957 45.7 1070439 1827. State of Palest…
## 3 West Bank and Gaza Asia 1962 48.1 1133134 2199. State of Palest…
## 4 West Bank and Gaza Asia 1967 51.6 1142636 2650. State of Palest…
## 5 West Bank and Gaza Asia 1972 56.5 1089572 3133. State of Palest…
## 6 West Bank and Gaza Asia 1977 60.8 1261091 3683. State of Palest…
## 7 West Bank and Gaza Asia 1982 64.4 1425876 4336. State of Palest…
## 8 West Bank and Gaza Asia 1987 67.0 1691210 5107. State of Palest…
## 9 West Bank and Gaza Asia 1992 69.7 2104779 6018. State of Palest…
## 10 West Bank and Gaza Asia 1997 71.1 2826046 7111. State of Palest…
## 11 West Bank and Gaza Asia 2002 72.4 3389578 4515. State of Palest…
## 12 West Bank and Gaza Asia 2007 73.4 4018332 3025. State of Palest…
Note that country
is a factor, so categorical values
(levels) are stored as integers but displayed as characters. We can
refer to the levels themselves when using recode()
.
recode()
can also be used to recode values in numeric,
character, and logical columns of data.
replace()
is a base R function that is can also be used
fo recoding values.
Suppose that we have a factor variable var1
with 4
categories, but we want to exclude observations in the fourth category
as NA
because we’re unclear what this category means. One
way to do that is to use factor()
to explicitly set the
levels to use, in turn treating all excluded factor levels as
NA
.
# set a vector including the desired factor levels to use for var1
<- c("cat1", "cat2", "cat3")
levels_to_keep
# assign this vector to the levels attribute of factor var1
%>% mutate(var1 = factor(var1, levels = levels_to_keep)) df
The forcats
package included in the tidyverse includes a
number of tools for working with factors. forcats
includes
a set of functions for re-ordering factor levels, which can be useful
for summarizing factor variables.
# the default level is alphabetical
$continent %>% levels() gapminder
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
# reorder levels in reverse alphabetical order
$continent %>%
gapminderfct_rev() %>%
levels()
## [1] "Oceania" "Europe" "Asia" "Americas" "Africa"
# reorder levels by frequency
fct_count(gapminder$continent)
## # A tibble: 5 × 2
## f n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
$continent %>%
gapminderfct_infreq() %>%
levels()
## [1] "Africa" "Asia" "Europe" "Americas" "Oceania"
# reorder levels by putting Europe first
$continent %>%
gapminderfct_relevel("Europe") %>%
levels()
## [1] "Europe" "Africa" "Americas" "Asia" "Oceania"
# reorder levels based on minimum lifeExp among all observations in each continent
$continent %>%
gapminderfct_reorder(gapminder$lifeExp, min) %>%
levels()
## [1] "Africa" "Asia" "Americas" "Europe" "Oceania"
You can take a look at the forcats documentation for a cheat sheet that describes other functions for working with factor variables.
if_else
is a useful dplyr function for manipulating data
in (non-factor) columns of data based on conditional statements. To get
a sense of how if_else
works, let’s create a binary
(logical) variable called cont_asia
equal to
TRUE
if continent == "Asia"
and
FALSE
for other continents.
<- gapminder %>%
g3 mutate(cont_asia = if_else(continent == "Asia", TRUE, FALSE) )
str(g3)
## tibble [1,704 × 7] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
## $ cont_asia: logi [1:1704] TRUE TRUE TRUE TRUE TRUE TRUE ...
We’ll check to make sure this mutate worked correctly in the next section.
Also note that base R includes its own ifelse
function
that works a bit differently than the dplyr if_else
function. The dplyr version is more rigid when it comes to working with
different data types.
mutate_if()
is another tool for creating variables in
ways that depend on certain specified conditions.
Summary statistics, or descriptive statistics, are terms we’ll use to describe information about the distribution of a random variable. Before we start exploring relationships between variables, we generally want to understand the distribution of individual variables. Some examples of summary statistics:
summary()
is a base R function that summarizes objects
in ways that depend on the class of the object. It can be a handy way to
quickly generate summary statistics for columns of data, though the
display format varies depending on the class of the argument. Here are
some examples.
summary(g3)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap cont_asia
## Min. :6.001e+04 Min. : 241.2 Mode :logical
## 1st Qu.:2.794e+06 1st Qu.: 1202.1 FALSE:1308
## Median :7.024e+06 Median : 3531.8 TRUE :396
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
# note how summary treats factors, characters, and logical data types differently
summary(g3$continent)
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
summary(as.character(g3$cont_asia))
## Length Class Mode
## 1704 character character
summary(g3$cont_asia)
## Mode FALSE TRUE
## logical 1308 396
A more flexible function for computing summary statistics is the
summarize()
function in the dplyr package.
summarize()
stores results as data frames and can be used
to perform more complicated calculations in combination with
group_by()
. This can be helpful for characterizing
conditional distributions, such as computing statistics for one
variable conditional on the value of another variable. (Crosstabs,
discussed in the next section, are another tool for showing
conditional or joint distributions.)
%>%
gapminder filter(year == 2007) %>%
summarize(n_countries = n(), avg_lifeExp_2007 = mean(lifeExp))
## # A tibble: 1 × 2
## n_countries avg_lifeExp_2007
## <int> <dbl>
## 1 142 67.0
Here is a list of “helper” functions we can use within
summarize()
:
Function | Description |
---|---|
n |
count |
n_distinct |
count unique values |
mean |
mean |
median |
median |
max |
largest value |
min |
smallest value |
sd |
standard deviation |
sum |
sum of values |
first |
first value |
last |
last value |
nth |
nth value |
any |
condition true for at least one value |
The base R table()
function is a quick tool for
generating frequency tables (also known as one-way tables) and crosstabs
(two-way tables). Note that passing the data column
continent
as an argument to the table()
function yields an equivalent frequency table as passing it to the
summary
function.
table(g3$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
Let’s look at a crosstab between continent
and
cont_asia
to make sure we correctly created the logical
variable indicating Asian observations in Section 6.4: if_else. We can also look at column or row
proportions (conditional distributions) instead of cell counts.
# joint distribution
table(g3$continent, g3$cont_asia)
##
## FALSE TRUE
## Africa 624 0
## Americas 300 0
## Asia 0 396
## Europe 360 0
## Oceania 24 0
# distribution of continent conditional on being in Asia (admittedly not very interesting)
prop.table(table(g3$continent, g3$cont_asia) , 2) %>% round(2)
##
## FALSE TRUE
## Africa 0.48 0.00
## Americas 0.23 0.00
## Asia 0.00 1.00
## Europe 0.28 0.00
## Oceania 0.02 0.00
Note that there may be other packages with functions specifically
designed for the types of data you’re working with. For example,
fct_count()
in the forcats
package shows a
frequency table for factor variables, yielding equivalent results to the
base R table()
function but with enhanced presentation.
fct_count(g3$continent)
## # A tibble: 5 × 2
## f n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
The group_by()
function allows us to add extra structure
to your data by telling R to recognize groups. This allows for
calculating statistics at the group-level using summarize()
and mutate()
, both of which respect groups.
summarize()
will generally take a dataset and return a
single observation (per group) with specified summary statistics, such
as counting observations or computing means (in each group, if groups
are specified).
Combined with mutate()
, summarize()
and
other dplyr functions, group_by()
enables exploratory data
analysis tools that we will use throughout this course. Let’s start by
using these functions with the gapminder data frame to count
observations for certain groups of country-year observations and make
some calculations at the continent level. Take a look at the following
examples.
# use summarize to count observations in each continent
%>%
gapminder filter(year %in% c(1952, 2007)) %>% # %in% identifies if an element belongs to a vector
group_by(continent) %>%
summarize(n = n())
## # A tibble: 5 × 2
## continent n
## <fct> <int>
## 1 Africa 104
## 2 Americas 50
## 3 Asia 66
## 4 Europe 60
## 5 Oceania 4
# count does both grouping and counting in a single function!
%>%
gapminder filter(year %in% c(1952, 2007)) %>%
count(continent)
## # A tibble: 5 × 2
## continent n
## <fct> <int>
## 1 Africa 104
## 2 Americas 50
## 3 Asia 66
## 4 Europe 60
## 5 Oceania 4
# use summarize to count observations for each country-year and compute mean lifeExp
%>%
gapminder filter(year %in% c(1952, 2007)) %>%
group_by(continent, year) %>%
summarize(n = n(), avg_lifeExp = mean(lifeExp))
## # A tibble: 10 × 4
## continent year n avg_lifeExp
## <fct> <int> <int> <dbl>
## 1 Africa 1952 52 39.1
## 2 Africa 2007 52 54.8
## 3 Americas 1952 25 53.3
## 4 Americas 2007 25 73.6
## 5 Asia 1952 33 46.3
## 6 Asia 2007 33 70.7
## 7 Europe 1952 30 64.4
## 8 Europe 2007 30 77.6
## 9 Oceania 1952 2 69.3
## 10 Oceania 2007 2 80.7
# let's use mutate to compute the change between 1952 and 2007 in country life expectancy
%>%
gapminder filter(year %in% c(1952, 2007)) %>%
group_by(country) %>%
select(country, year, lifeExp) %>%
mutate(lifeExp_gain = lifeExp - first(lifeExp))
## # A tibble: 284 × 4
## country year lifeExp lifeExp_gain
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 28.8 0
## 2 Afghanistan 2007 43.8 15.0
## 3 Albania 1952 55.2 0
## 4 Albania 2007 76.4 21.2
## 5 Algeria 1952 43.1 0
## 6 Algeria 2007 72.3 29.2
## 7 Angola 1952 30.0 0
## 8 Angola 2007 42.7 12.7
## 9 Argentina 1952 62.5 0
## 10 Argentina 2007 75.3 12.8
## # … with 274 more rows
# here's a more advanced example: obtaining extreme lifeExp values for each year
%>%
gapminder select(year, country, lifeExp) %>%
group_by(year) %>%
filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2) %>% # min_rank is a dplyr ranking function
arrange(year, lifeExp) %>%
print(n = Inf) # Infinity can be used to override limits on the number of rows/columns printed
## # A tibble: 24 × 3
## # Groups: year [12]
## year country lifeExp
## <int> <fct> <dbl>
## 1 1952 Afghanistan 28.8
## 2 1952 Norway 72.7
## 3 1957 Afghanistan 30.3
## 4 1957 Iceland 73.5
## 5 1962 Afghanistan 32.0
## 6 1962 Iceland 73.7
## 7 1967 Afghanistan 34.0
## 8 1967 Sweden 74.2
## 9 1972 Sierra Leone 35.4
## 10 1972 Sweden 74.7
## 11 1977 Cambodia 31.2
## 12 1977 Iceland 76.1
## 13 1982 Sierra Leone 38.4
## 14 1982 Japan 77.1
## 15 1987 Angola 39.9
## 16 1987 Japan 78.7
## 17 1992 Rwanda 23.6
## 18 1992 Japan 79.4
## 19 1997 Rwanda 36.1
## 20 1997 Japan 80.7
## 21 2002 Zambia 39.2
## 22 2002 Japan 82
## 23 2007 Swaziland 39.6
## 24 2007 Japan 82.6
ggplot2 is a package in the tidyverse that provides a system for creating data visualizations. It’s a bit more difficult to learn than other R functionality because ggplot syntax relies on its own grammatical rules.
With ggplot2, you begin a plot with the function
ggplot()
. This creates a coordinate system that you can add
layers to. We start with three components:
ggplot()
is the data to use in the graph,
e.g. ggplot(data = gapminder)
. This creates an empty graph
that you can add one or more layers to.aes()
function, which must be specified along with a
ggplot()
call. ggplot2 knows to look for mapped variables
inside the data argument.geom_point()
, geom_line()
, and
geom_bar()
or geom_histogram()
).Other components that you can specify include scale, statistical transformations, or facets that break information into subplots. Here is a handy reference sheet from ggplot2.
We’ll cover ggplot data visualization in more detail in a few weeks, but for now let’s review just the very basics needed to generate histograms and barplots as a tool for summarizing data.
Let’s learn the basic syntax by plotting the number of countries
observed in each continent in the gapminder dataset for 2007. Barplots
use geom_bar()
to plot categorical data, whereas histograms
use geom_histogram()
to visualize the distribution of a
continuous variable.
<- gapminder %>% filter(year == 2007) gap_2007
# it can be helpful to include argument names while learning the syntax
ggplot(data = gap_2007) + aes(x = continent) + geom_bar()
# but R is smart enough to recognize many arguments without specifying by name
ggplot(gap_2007) + aes(continent) + geom_bar()
Here is the same information with additional options to display continent proportions instead of counts.
# but R is smart enough to recognize many arguments without specifying by name
ggplot(gap_2007) + aes(x = continent, y = ..prop.., group = 1) + geom_bar()
geom_bar()
works by simply counting observations in each
category of the argument x (in this case x = continent
).
Alternatively, we can collapse the country-level data to continent level
observations using group_by()
and summarize()
,
and the plot the data directly using geom_col()
. Don’t
worry too much about this, just note that the geometries we use in our
plots depend on the structure of the data.
<- gapminder %>%
gap_continents filter(year == 2007) %>%
group_by(continent) %>%
summarize(country_ct = n())
ggplot(data = gap_continents) + aes(x = continent, y = country_ct) + geom_col()
Here is the code to illustrate the distribution of life expectancy across countries in 2007 using a very basic histogram with no additional options specified.
ggplot(gap_2007) + aes(x = lifeExp) + geom_histogram()