Tutorial 2: Working with data in R

Aims

By the end of this second tutorial, you will be able to

Import data files into an R data frame
Make exploratory graphics
Perform elementary manipulations on a data frame.

Downloading and importing data

In R it is straightforward to import data that are stored locally on your machine or directly from the web. We’ll import some data from Our World In Data. You can access the data from the data file directly here. The variables are explained in the code book.

Within RStudio, there is a neat interface for reading in data. In the Environment tab in the top right corner of the RStudio window, select Import Dataset and From Text (readr)…. Note that other options support importing data in Excel and other common formats. Some of the functionality here is from the tidyverse suite of packages, a widely used framework for Data Science in R.

In the dialog box that opens, navigate to the folder where the downloaded file is located. Once you have selected the file, the window will show a preview of how the data will be imported. For this dataset, the data should appear as a neat, rectangular array. This means we are now ready to import the data. Other datasets might require more effort.

The Code Preview box gives a few lines of code that could be used to read in the data manually. Instead, simply click Import. Then R runs the code from the Code Preview box.

Alternatively, we could read in the file directly from the web. To do this, we need to load the tidyverse.

library(tidyverse)
dat_url = "https://nyc3.digitaloceanspaces.com/owid-public/data/energy/owid-energy-data.csv"
dat = read_csv(dat_url)

Exploring data

The View(dat) command produces a nice interactive spreadsheet representation of the data, which appears in a separate panel. There are many different variables - look through the codebook to understand what each variable means.

We’ll now see some tools for exploring the data and constructing some simple plots.

First, we’ll restrict our attention to the data relating to the United Kingdom. We define a new object by filtering the original data object.

dat_uk = filter(dat,
                country == "United Kingdom")

We can make a simple plot of how the UK population has changed over the time period recorded in the dataset. Note that we use $ to refer to the columns of our data.

plot(dat_uk$year,
     dat_uk$population)

For this data, perhaps it is more natural to plot a line than to plot individual points. We can do this by setting the type argument.

plot(dat_uk$year,
     dat_uk$population,
     type = "l")

This is better, although there is still more work to do before we have a plot that we would be happy including in a well-presented report.

`ggplot` for data-controlled graphics

The ggplot package allows very precise control of how plots are made. It is especially useful when working with large, structured datasets. The gg in`ggplot refers to the Grammar of Graphics, a framework for specifying precisely how elements of a data visualisation are built up on the page.

The syntax takes some getting used to. We’ll start by giving the full command, and then breaking it down.

ggplot( dat = dat_uk,
        mapping = aes(x = year,
                      y = population)) +
  geom_line()

The first argument is the data frame to be used for plotting: here, this is dat_uk. We then specify a mapping. This determines how variables in the dataset correspond to visual properties. Here, we specify that the x-axis is controlled by the year variable, and the y-axis by the population variable.

The term after the + specifies a new layer to be placed on the plotting environment. In this case, we specify a line.

We could alternatively plot individual data points.

ggplot( dat = dat_uk,
        mapping = aes(x = year,
                      y = population)) +
  geom_point()

We can control properties of the plotted points by specifying the appropriate aesthetics. You can obtain a full list of the properties from the help ?geom_point under aesthetics. Here we change the colour and size.

ggplot( dat = dat_uk,
        mapping = aes(x = year,
                      y = population)) +
  geom_point(colour = "red",
             size = 0.75)

One very useful feature of ggplot is the ability to modify the properties of points (or lines) using values in the data.

As a simple example, we define an indicator variable, which is TRUE if the year is later than 1975, and FALSE otherwise.

dat_uk$indicator = dat_uk$year > 1975

We then pass the indicator variable to geom_point as an aesthetic controlling the colour.

ggplot( dat = dat_uk,
        mapping = aes(x = year,
                      y = population)) +
  geom_point(aes(colour = indicator))

Now points are coloured differently according to whether or not they are later than 1975.

Modifying plot attributes

We’ll now make the plot easier to read. We make the following changes

Measure population in units of millions, by dividing the population variable by $10^6$ .
Add neat labels for the x- and y-axes, and a centred title.
Change the y-axis scale so that tick marks occur at evenly spaced values between 40 and 70.

ggplot( dat = dat_uk,
        mapping = aes(x = year,
                      y = population/10^6)) +
  geom_line() +
  scale_x_continuous("Year") +
  scale_y_continuous("Population (millions)",
                     breaks = seq(40, 70, 2)) +
  ggtitle( "UK Population from 1900 to 2021") +
  theme(plot.title = element_text(hjust = 0.5))

Comparing countries

Let’s now add data from some other countries to the plot. So that the plot isn’t too busy, we will just consider a few countries, with broadly similar populations.

Start by making a new data frame.

dat_small = filter(dat, country %in%
                          c("United Kingdom",
                            "France",
                            "Germany",
                            "Spain"))

Then plot, specifying in the aesthetic that colour is controlled by the variable country.

ggplot( dat = dat_small,
        mapping = aes(x = year,
                      y = population/10^6,
                      colour = country)) +
  geom_line() +
  scale_x_continuous("Year") +
  scale_y_continuous("Population (millions)",
                     breaks = seq(20, 100, 5)) +
  ggtitle( "UK Population from 1900 to 2021") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(colour = "Country")

Note two additional changes in the plot above: to the range of y-axis values, to account for the wider range of values, and to the specification of the legend label in the final labs argument.

Manipulating data frames

The two most important functions for manipulating data frames are filter and select. We have already seen that filter extracts observations (rows of the data frame). select extracts variables (columns of the data frame).

e.g. if we wanted to extract all variables relating to coal, we could use the following code

select(dat, contains("coal"))

## # A tibble: 22,343 × 12
##    coal_cons_change_pct coal_cons_change_twh coal_cons_per_cap… coal_consumption
##                   <dbl>                <dbl>              <dbl>            <dbl>
##  1                   NA                   NA                 NA               NA
##  2                   NA                   NA                 NA               NA
##  3                   NA                   NA                 NA               NA
##  4                   NA                   NA                 NA               NA
##  5                   NA                   NA                 NA               NA
##  6                   NA                   NA                 NA               NA
##  7                   NA                   NA                 NA               NA
##  8                   NA                   NA                 NA               NA
##  9                   NA                   NA                 NA               NA
## 10                   NA                   NA                 NA               NA
## # … with 22,333 more rows, and 8 more variables: coal_elec_per_capita <dbl>,
## #   coal_electricity <dbl>, coal_prod_change_pct <dbl>,
## #   coal_prod_change_twh <dbl>, coal_prod_per_capita <dbl>,
## #   coal_production <dbl>, coal_share_elec <dbl>, coal_share_energy <dbl>

Other options for working with select can be found in the help, ?select.

The pipe (`%>%`)

Often, we want to perform a sequence of manipulations, one after the other. e.g. we might be interested only in certain variables, and only in observations from a particular year.

This is exactly what is achieved by the code below. It extracts all observations from the year 2018, and then selects just the two variables population and gdp. This subset of the original data is assigned to the data frame dat_2018.

Note the use of %>%, which is called the pipe - it is helpful to read this as “and then”.

dat_2018 = dat %>%
  filter(year == 2018) %>%
  select(population, gdp)

We take the data frame dat, and then extract the observations from the year 2018, and then select the variables population and gdp.

Let’s now look at the relationship between population and gdp.

ggplot( data = dat_2018,
        mapping = aes(x = population,
                      y = gdp)) +
  geom_point()

There are a few things going wrong here. The first problem is the warning from R that some points could not be plotted, because of missing values.

It is always worth investigating missing values, to understand whether there is a common reason for missingness, but for now we’ll just remove any missing values.

dat_2018 = dat %>%
  filter(year == 2018) %>%
  select(population, gdp) %>%
  drop_na()

Now when we plot, there is no warning.

ggplot( data = dat_2018,
        mapping = aes(x = population,
                      y = gdp)) +
  geom_point()

However, the plot is still rather hard to read. The scale needed to display all of the data points means that most observations in are the bottom left corner, and it is hard to see the pattern.

For data that vary over many orders of magnitude, it helps to use a log scale

ggplot( data = dat_2018,
        mapping = aes(x = population,
                      y = gdp)) +
  geom_point() +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10")

Viewed on a log scale, a linear relationship is clearly visible. Adding a linear trendline guides the eye.

ggplot( data = dat_2018,
        mapping = aes(x = population,
                      y = gdp)) +
  geom_point() +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  stat_smooth(method = "lm")

The grey interval around the trendline represents a 95% confidence interval for average GDP for a country with a given population. Note that the uncertainty is smaller in the region where there is more data, which is intuitively reasonable.

Finally, we make some cosmetic changes to the axes so that the scales display properly. We use the breaks argument to specify where the axis labels should be, and labels to specify the format, as powers of 10.

library(scales)
ggplot( data = dat_2018,
        mapping = aes(x = population,
                      y = gdp)) +
  geom_point() +
    scale_x_continuous("Population",
                       trans = "log10",
                       breaks = trans_breaks('log10',
                                             function(x) 10^x),
                       labels = trans_format('log10',
                                           math_format(10^.x))) +
  scale_y_continuous("GDP (USD)",
                     trans = "log10",
                     breaks = trans_breaks('log10',
                                             function(x) 10^x),
                     labels = trans_format('log10',
                                           math_format(10^.x))) +
  stat_smooth(method = "lm") +
  ggtitle( "GDP vs Population in 2018") +
  theme(plot.title = element_text(hjust = 0.5))

Next steps

In this tutorial, we have seen how to take some first steps in exploring data. In the next tutorial, you will see how to

fit regression models in R.
interpret linear models in context.
understand the uncertainty in model predictions.