For our homework exercises, we will use a new dataset from the Gapminder Foundation, which gives access to global data as well as many tools to help explore it.
We will use data relating to socio-economic statistics for 2010. The columns in our data file are:
Column | Description |
---|---|
country | country name |
world_region | 6 world regions |
year | year that each datapoint refers to |
children_per_woman | total fertility rate |
life_expectancy | average number of years a newborn child would live if current mortality patterns were to stay the same |
income_per_person | gross domestic product per person adjusted for differences in purchasing power |
is_oecd | Whether a country belongs to the “OECD” (TRUE ) or not
(FALSE ) |
income_groups | categorical classification of income groups |
population | total number of a country’s population |
main_religion | religion of the majority of population in 2008 |
child_mortality | death of children under 5 years old per 1000 births |
life_expectancy_female | life expectancy at birth, females |
life_expectancy_male | life expectancy at birth, males |
Tip: this exercise builds on the skills gained in Getting Started.
gapminder-dataviz
.data_raw
for saving the raw data.gapminder-dataviz
directory you just created.We create a new directory as well as sub-directories, shown here schematically:
gapminder-dataviz
|_ data_raw
|_ data_processed
|_ fig_output
|_ scripts
We use data_raw
to save the data file that we download
with the link provided.
Finally, we create an R project on this directory:
gapminder-dataviz
that we’ve just created.RStudio should refresh itself and then indicate that the working
directory has been set to the new folder. For example, you can run
the command getwd()
on the console to confirm that this is
the case.
As with any dataset you must first understand its content and formatting. Understanding what data you have will help you decide what story you can learn from the data and how best to present it
Create a new script to analyse these data and call it
01-gapminder_exploration.R
. Then, populate it with code to
achieve the following:
gapminder2010_socioeconomic.csv
file into a
data.frame
/tibble
object called
gapminder
.
read_csv()
function. Remember to first load the
tidyverse
package with library(tidyverse)
.
nrow()
, ncol()
,
summary()
and str()
to check data integrity.
We can read our data as follows:
#> Rows: 193 Columns: 13
#> ── Column specification ──────────────────
#> Delimiter: ","
#> chr (5): country, world_region, income_groups, main_religion, life_expectanc...
#> dbl (7): year, children_per_woman, life_expectancy, income_per_person, popul...
#> lgl (1): is_oecd
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To examine the contents of the data.frame we can use several functions, for example, to get the number of rows and columns:
ncol(gapminder) # number of columns in the data.frame
nrow(gapminder) # number of rows in the data.frame
The str()
function gives a more comprehensive view of
the contents of the data.frame, including the number of rows and columns
as well as the type of variable each columns was imported as by R:
The summary function is also very useful as it gives a quick overview of the types of variables as well as average and quantiles for numeric data:
The types of variables we have are:
country
and world_region
income_groups
is_oecd
income_per_person
or life_expectancy
year
From the output of the summary function, we can notice a few issues with these data:
life_expectancy_male
values are invalid ==
-999.life_expectancy_female
was imported as a character
variable, but should be numeric.If we look at the top few rows of the table:
We can see that life_expectancy_female
is showing the
value “-” in the 4th row of data. Probably the person recording these
data encoded missing values with the “-” symbol, but the
read_csv()
function did not recognise this as missing data.
The default is to consider empty cells as missing data and so
we should correct this in the dataset to make sure all missing
values are encoded in the same way.
There are a few other issues in the main_religion
column, which were a little harder to detect. If we look at the unique
values of this column, we will notice different spellings/formats for
some of its values:
These types of spelling mistakes are very common and it’s important to be aware that R would consider “muslim” and “Muslim” to be different words (due to the case-sensitivity).
#> Warning: Removed 6 rows containing missing values
#> (`geom_point()`).
The warning message we get is because 6 of the rows in the data frame
do not have life_expectancy information (they are NA
missing data).
If it was a key variable for your analysis you might have wanted to remove those individuals with missing data. In this case, we don’t mind having these missing data, so we can carry on with our analysis.
Note that ‘Warning messages’ are simply that - a warning, not an error. They are very helpful and always worth reading.
filter()
function to subset the table to retain
only rows where world_region == "south_asia"
.
income_total
,
which is the product of population and income per person (i.e. the total
average income of the country).
mutate()
function to create a new column.
To identify countries in South Asia, we can use the following:
From the output we can see this table has 8 rows, therefore 8 countries in this part of the world.
To create the new column we can use the mutate()
function, as such:
If we wanted to save this in our table, we need to update the object,
using the <-
assignment: