3  Data

Detailed course materials can be found in this section, including exercises to practice. If you are a self-learner, make sure to check the setup page.

3.1 Data

The data we will be using throughout all the sessions are contained in a single ZIP file. They are all small CSV files (comma separated values). You can download the data below:

Warning

The data we use throughout the course is varied, covering many different topics. In some cases the data on medical or socioeconomic topics may be uncomfortable to some, since they can touch on diseases or death.

All the data are chosen for their pedagogical effectiveness.

3.2 Tidy data

For two samples the data can be stored in one of three formats:

  1. as two separate vectors,
  2. in a stacked data frame,
  3. or in an unstacked data frame/list.

Two separate vectors case is (hopefully) obvious.

When using a data frame we have different options to organise our data. The best way of formatting data is by using the tidy data format.

Tidy data has the following properties:

  • Each variable has its own column
  • Each observation has its own row
  • Each value has its own cell

Stacked form (or long format data) is where the data is arranged in such a way that each variable (thing that we measured) has its own column. If we consider a dataset containing meerkat weights (in g) from two different countries then a stacked format of the data would look like:

# A tibble: 6 × 2
  country  weight
  <chr>     <dbl>
1 Botswana    514
2 Botswana    568
3 Botswana    519
4 Uganda      624
5 Uganda      662
6 Uganda      633

In the unstacked (or wide format) form a variable (measured thing) is present in more than one column. For example, let’s say we measured meerkat weight in two countries over a period of years. We could then organise our data in such a way that for each year the measured values are split by country:

# A tibble: 3 × 3
   year Botswana Uganda
  <dbl>    <dbl>  <dbl>
1  1990      514    624
2  1992      568    662
3  1995      519    633

Having tidy data is the easiest way of doing analyses in programming languages and I would strongly encourage you all to start adopting this format as standard for data collection and processing.

3.3 Conditional operators

To set filtering conditions, use the following relational operators:

  • > is greater than
  • >= is greater than or equal to
  • < is less than
  • <= is less than or equal to
  • == is equal to
  • != is different from
  • %in% is contained in

To combine conditions, use the following logical operators:

  • & AND
  • | OR