<- read_csv("data/gapminder1960to2010_socioeconomic.csv") gapminder1960_2010
11 Looking for changes
- Be able to visualise changes in data
11.1 Libraries and functions
11.1.1 Libraries
11.1.2 Functions
11.2 Purpose and aim
In this section we’re going to look at dealing with data that changes. These can be changes over time or, for example, changes across treatments / regions / concentrations etc.
11.3 Loading data
We’ll be using a new data set for this section - it contains similar information as the gapminder
data set we’ve used so far, but it has data for different years. There is data from 1960 to 2010.
11.4 Changes over time
Let’s say we’re interested in life expectancy. We now have data on this variable for 50 different years, so it’d be nice to see how life expectancy changed over time.
There are 193 countries in this data set, so it’s probably not a good idea to plot them all at once…
Let’s focus close to home and see how life expectancy changed in the United Kingdom over these years.
To do this, we first filter out all of the data of the United Kingdom, and then plot it.
%>%
gapminder1960_2010 filter(country == "United Kingdom") %>%
ggplot(aes(x = year,
y = life_expectancy,
group = country)) +
geom_line()
We can see that life expectancy has increased markedly over the last 50 years. Notice that the y-axis is in a range of around 70 - 85! If we’d change that so that the y-axis started at zero, then our plot would look rather different.
We can set the y-axis range or limits with ylim()
, specifying the first and last value that we want in the plot:
%>%
gapminder1960_2010 filter(country == "United Kingdom") %>%
ggplot(aes(x = year,
y = life_expectancy,
group = country)) +
geom_line() +
ylim(0, 90)
These two plots show the same data, but the clarity of the message is rather different.
How you scale and define your axes matters, as you might have derived from the plots above. Have a look at the graphs below, which are based on exactly the same data:
Let’s assume that they were published in the campaign prospectus of the Republican and Democratic parties. Which one do you think ended up where?
These plots of course show only data for one country, so it doesn’t give us much context. How impressive is the increase in life expectancy in the United Kingdom, compared to other countries? We know that, for example, the United States and China have had a lot of economic growth in the past 50 year, so let’s compare the United Kingdom with them.
We adjust the filter that we used earlier, to include the United States and China. We also colour the data by country, so that we can distinguish the three countries.
%>%
gapminder1960_2010 filter(country %in% c("China", "United Kingdom", "United States")) %>%
ggplot(aes(x = year,
y = life_expectancy,
colour = country,
group = country)) +
geom_line()
%in%
syntax
We use %in%
when we want to compare against a collection of values. Let’s look at a very simple data set called colours
, which contains 5 different colour values:
colours
# A tibble: 5 × 1
value
<chr>
1 green
2 yellow
3 yellow
4 red
5 purple
If we wanted to filter out the yellow and purple values, we could do that like this:
filter(colours, value %in% c("yellow", "purple"))
# A tibble: 3 × 1
value
<chr>
1 yellow
2 yellow
3 purple
What happens is that R goes through each item after %in%
and checks if it can find it in the value
column. So in this case it first checks yellow
, followed by purple
.
From this plot we can see that the United Kingdom and United States show very similar increases in life expectancy, roughly increasing by 10 years.
However, plotting this together with China’s life expectancy, it shows that China has seen a much larger increase over the past 50 years, since its life expectancy was only just above 30 year in 1960!
11.4.1 Exercises
11.5 Summary
- Visualising changes over time is a powerful tool to detect trends
- Decisions on axis limits can dramatically change the message