12  Reshaping data

Learning objectives
  • Learn what long and wide data formats are.
  • Be able to recognise when best to use each.
  • Be able to switch from long to wide and back.

12.1 Context

So far, we have provided data in the most convenient format. In real life, this is of course not always the case, because people collect data in a format that works best for them - not the computer. So, sometimes we need to change the shape of our data, so we can calculate or visualise the data we’d like.

12.2 Section setup

We’ll continue this section with the script named da4-09-reshaping-data.R. If needed, add the following code to the top of your script and run it.

# A collection of R packages designed for data science
library(tidyverse)

surveys_join <- read_csv("data/surveys_join.csv")

We’ll continue this section with the script named da4-09-reshaping-data.py. If needed, add the following code to the top of your script and run it.

# A Python data analysis and manipulation tool
import pandas as pd

# Python equivalent of `ggplot2`
from plotnine import *

surveys_join = pd.read_csv("data/surveys_join.csv")

12.3 Data reshaping

Let’s look at a hypothetical data set, based on the type of variables we’ve come across in the surveys data.

Figure 12.1: Wide and long data formats contain the same information

The data that is present in both tables is the same - it’s just encoded slightly differently.

  1. The “long” format is where we have a column for each of the types of things we measured or recorded in our data. In other words, each variable has its own column.
  2. The “wide” format occurs when we have data relating to the same measured thing in different columns. In this case, we have values related to our metric spread across multiple columns (a column each for a year).
Wide or long?

Neither of these formats is necessarily more correct than the other: it will depend on what analysis you intend on doing. However, it is worth mentioning that the “long” format is often preferred, as it is clearer how many distinct types of variables we have in the data.

To figure out which format you might need, it may help to think of which visualisations you may want to build with ggplot() (or other packages, for that example). Taking the above example:

  • If you were interested in looking at the change of weight across years for each individual, then the long format is more suitable to define each aesthetic of such a graph: aes(x = year, y = weight, colour = record_id).
  • If you were interested in the correlation of this metric between 2021 and 2022, then the wide format would be more suitable: aes(x = yr_2021, y = yr_2022, colour = record_id).

12.4 From long to wide

Let’s illustrate that with a dummy data set, called surveys_join. In this synthetic data set we have weight measurements for individuals across four years: 2021 - 2024. This means that there are four measurements for each record_id.

# Read in the data
surveys_join <- read_csv("data/surveys_join.csv")

# Look at the first few rows
head(surveys_join)
# A tibble: 6 × 3
  record_id  year weight
      <dbl> <dbl>  <dbl>
1       166  2021   195.
2       166  2022   190.
3       166  2023   184.
4       166  2024   182 
5       228  2021   192.
6       228  2022   189 

We can reshape our data from long to wide as follows, where I do not overwrite the existing data, but instead just pipe it through to the head() function, so we can see what the pivot_wider() function is doing:

surveys_join |> 
  pivot_wider(names_from = "year",
              values_from = "weight",
              names_prefix = "yr_") |>
  head()
# A tibble: 6 × 5
  record_id yr_2021 yr_2022 yr_2023 yr_2024
      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1       166    195.    190.    184.    182 
2       228    192.    189     188     179 
3       337    181.    176.    177     168.
4       330    172.    170.    165.    157.
5       205    177.    176.    167.    166.
6       878    170.    172     167.    161.

Let’s unpack that a bit.

The pivot_wider() function needs at least the first two arguments:

  1. names_from = the column who’s values we want to use for our new column names (year)
  2. values_from = the column who’s values we want to use to populate these new columns (weight)
  3. names_prefix = a prefix for our column names (optional)

Here we also add names_prefix = "yr_", otherwise the column names would contain only numbers and that’s not very good programming habit.

Let’s assign it to a new object and then visualise some of the data.

surveys_wide <- surveys_join |> 
  pivot_wider(names_from = "year",
              values_from = "weight",
              names_prefix = "yr_")
# Read in the data
surveys_join = pd.read_csv("data/surveys_join.csv")

# Look at the first few rows
surveys_join.head()
   record_id  year  weight
0        166  2021   195.4
1        166  2022   189.5
2        166  2023   183.6
3        166  2024   182.0
4        228  2021   191.9
surveys_wide = surveys_join.pivot(
    index = "record_id",
    columns = "year",
    values = "weight"
)

# Add 'yr_' prefix to year columns
surveys_wide.columns = [f"yr_{col}" for col in surveys_wide.columns]

# Reset index to make 'record_id' a column again
surveys_wide = surveys_wide.reset_index()

surveys_wide.head()
   record_id  yr_2021  yr_2022  yr_2023  yr_2024
0        118    199.5    192.8    192.6    187.6
1        160    192.5    189.0    184.3    181.4
2        166    195.4    189.5    183.6    182.0
3        178    190.1    188.4    182.2    177.5
4        184    195.1    186.7    193.5    187.3

We can then use this to visualise possible relationships between the different years:

ggplot(surveys_wide, aes(x = yr_2021, y = yr_2022)) +
  geom_point()
Figure 12.2: Scatterplot of weight for 2021 and 2022

If you’d be interested in comparisons across all years, you’d have to use the original, long format because there isn’t a single column in the wide table that contains all of the year information.

ggplot(surveys_join, aes(x = factor(year), y =  weight)) +
  geom_boxplot()
Figure 12.3: Boxplot of weight for 2021 - 2024
p = (ggplot(surveys_wide, aes(x = "yr_2021", y = "yr_2022")) +
    geom_point())
    
p.show()
Figure 12.4: Scatterplot of weight for 2021 and 2022
p = (ggplot(surveys_join, aes(x = "factor(year)", y = "weight")) +
    geom_boxplot())

p.show()
Figure 12.5: Boxplot of weight for 2021 - 2024

12.5 From wide to long

We can reshape our data from wide to long. This is more or less the inverse of what we did above.

surveys_wide |> 
  pivot_longer(cols = -record_id,
               names_to = "year",
               values_to = "weight",
               names_prefix = "yr_")
# A tibble: 200 × 3
   record_id year  weight
       <dbl> <chr>  <dbl>
 1       166 2021    195.
 2       166 2022    190.
 3       166 2023    184.
 4       166 2024    182 
 5       228 2021    192.
 6       228 2022    189 
 7       228 2023    188 
 8       228 2024    179 
 9       337 2021    181.
10       337 2022    176.
# ℹ 190 more rows

Here we use the following arguments:

  1. cols = this tells pivot_longer() which columns to pivot - here we want to use all but the record_id column
  2. names_to = the name of the column that gets to hold the column names (e.g. yr_2021, yr_2022 …)
  3. values_to = the name of the column that will contain the measured values (here those are the weight measurements)
  4. names_prefix = here we tell it that all column names have a prefix yr_, which then gets removed prior to populating the column

To go back to the long format, we use the .melt() function.

surveys_long = surveys_wide.melt(
    id_vars = "record_id",            # columns to keep fixed
    var_name = "year",                # name of the new 'year' column
    value_name = "weight"             # name of the new 'weight' column
)

surveys_long.head()
   record_id     year  weight
0        118  yr_2021   199.5
1        160  yr_2021   192.5
2        166  yr_2021   195.4
3        178  yr_2021   190.1
4        184  yr_2021   195.1

This uses the following arguments:

  1. id_vars = this tells it what the id column is - record_id in our case, which does not get pivoted.
  2. var_name = the name of the column that gets to hold the column names (e.g. yr_2021, yr_2022 …)
  3. value_name = the name of the column that will contain the measured values (here those are the weight measurements)

This then creates a column year that contains the values yr_2021, yr_2022, ..., since we added the prefix. If we want to remove the prefix we can do the following:

# Remove 'yr_' prefix and convert year to integer
surveys_long["year"] = surveys_long["year"].str.replace("yr_", "").astype(int)

surveys_long.head()
   record_id  year  weight
0        118  2021   199.5
1        160  2021   192.5
2        166  2021   195.4
3        178  2021   190.1
4        184  2021   195.1

12.6 Exercises

12.6.1 Reshaping: auxin

Exercise 1 - Reshaping data

Level:

For this exercise we’ll be using the data from data/auxin.csv.

These are synthetic data that measure the plant height of Arabidopsis thaliana, in different genotypes (variants of the same species, here: control and mutant). It measures the effect the plant hormone auxin has on the growth of these plants.

This is done across different auxin concentrations (none, low, high).

Please do the following:

  1. Check the data structure.
  2. Create a “wide” table where, for each replicate_id / genotype pair, we have a column for each auxin concentration category. The data in these columns should be the plant_height measurements.
  3. Use this wide format to calculate the average plant height for each row.
  4. Change the data back to “long” format & check if you have your original data back.

First, let’s read in the data.

auxin <- read_csv("data/auxin.csv")
auxin = pd.read_csv("data/auxin.csv")

1. Data structure

Before we make any changes, it’s important to get a sense of how the data is currently shaped.

head(auxin)
# A tibble: 6 × 4
  replicate_id genotype concentration plant_height
         <dbl> <chr>    <chr>                <dbl>
1            1 control  high                  31.7
2            2 control  high                  33.6
3            3 control  high                  31.3
4            4 control  high                  30.5
5            5 control  high                  30.4
6            6 control  high                  30.1
auxin.head()
   replicate_id genotype concentration  plant_height
0             1  control          high          31.7
1             2  control          high          33.6
2             3  control          high          31.3
3             4  control          high          30.5
4             5  control          high          30.4

We have 56 distinct replicate_id values, with at least 1 and up to 6 observations.

There are also three distinct concentration categories: high, low, none

Important

Check this yourself!

2. From long to wide

So what we want to do is get a table where for each replicate_id and genotype combination we have a separate column for each concentration category. That would allow us, for example, to calculate the average plant height for each measurement across the different concentration types.

The data should look something like:

replicate_id genotype high low none
1 control 31.7 38.5 42.7
2 control 33.6 35.9 46.8
3 control 31.3 38.4 42.8

So, let’s do that.

auxin_wide <- pivot_wider(auxin,
            names_from = concentration,
            values_from = plant_height)

head(auxin_wide)
# A tibble: 6 × 5
  replicate_id genotype  high   low  none
         <dbl> <chr>    <dbl> <dbl> <dbl>
1            1 control   31.7  38.5  42.7
2            2 control   33.6  35.9  46.8
3            3 control   31.3  38.4  42.8
4            4 control   30.5  42.3  46.5
5            5 control   30.4  32.8  46.5
6            6 control   30.1  37.4  45.4
auxin_wide = auxin.pivot(
    index = ["replicate_id", "genotype"],
    columns = "concentration",
    values = "plant_height"
).reset_index()

auxin_wide.head()
concentration  replicate_id genotype  high   low  none
0                         1  control  31.7  38.5  42.7
1                         1   mutant  33.4  41.5  37.6
2                         2  control  33.6  35.9  46.8
3                         2   mutant  30.4  35.7  40.3
4                         3  control  31.3  38.4  42.8

3. Calculating average plant_height

Having the data in this format allows us to do:

auxin_wide |> 
  mutate(avg_height = round((high + low + none) / 3, 1)) |> 
  head()
# A tibble: 6 × 6
  replicate_id genotype  high   low  none avg_height
         <dbl> <chr>    <dbl> <dbl> <dbl>      <dbl>
1            1 control   31.7  38.5  42.7       37.6
2            2 control   33.6  35.9  46.8       38.8
3            3 control   31.3  38.4  42.8       37.5
4            4 control   30.5  42.3  46.5       39.8
5            5 control   30.4  32.8  46.5       36.6
6            6 control   30.1  37.4  45.4       37.6
auxin_wide.assign(
    avg_height = ((auxin_wide["high"] + auxin_wide["low"] + auxin_wide["none"]) / 3).round(1)
).head()
concentration  replicate_id genotype  high   low  none  avg_height
0                         1  control  31.7  38.5  42.7        37.6
1                         1   mutant  33.4  41.5  37.6        37.5
2                         2  control  33.6  35.9  46.8        38.8
3                         2   mutant  30.4  35.7  40.3        35.5
4                         3  control  31.3  38.4  42.8        37.5

4. From wide to long

We can also revert back to our original “long” format data. The data then has its original shape back, which follows:

replicate_id genotype concentration plant_height
1 control high 31.7
2 control high 33.6
3 control high 31.3

So, let’s do that.

auxin_long <- pivot_longer(auxin_wide,
             cols = c("high", "low", "none"),
             names_to = "concentration",
             values_to = "plant_height")
auxin_long = auxin_wide.melt(
    id_vars = ["replicate_id", "genotype"],  # columns to keep as identifiers
    value_vars = ["high", "low", "none"],    # columns to unpivot
    var_name = "concentration",              # new column for old column names
    value_name = "plant_height"              # new column for values
)

However, the eagle-eyed among you might have noticed that there are more rows in our data than we started with:

nrow(auxin)
[1] 275
nrow(auxin_long)
[1] 324
auxin.shape[0]      # original data
275
auxin_long.shape[0] # wide-and-back
324

This is because in some of the replicate_id / genotype combinations there were no measured values for all three concentration types. This introduced missing values, which are then propagated when going back to a long format.

So, to deal with this, we can simply remove the values where the plant_height value is missing:

auxin_long <- pivot_longer(auxin_wide,
             cols = c("high", "low", "none"),
             names_to = "concentration",
             values_to = "plant_height",
             values_drop_na = TRUE)

nrow(auxin_long)
[1] 275
auxin_long = (auxin_wide.melt(
    id_vars = ["replicate_id", "genotype"],  # columns to keep as identifiers
    value_vars = ["high", "low", "none"],    # columns to unpivot
    var_name = "concentration",              # new column for old column names
    value_name = "plant_height"              # new column for values
)
.dropna(subset = ["plant_height"])
)

auxin_long.shape[0]
275

Success!

12.7 Summary

Key points
  • We can reshape data, going from long to wide (and back).
  • Which format you use depends on the aim: consider how you’d plot data.
  • We can use pivot_wider() to create a wide-format data set.
  • We can use pivot_longer() to create a long-format data set.
  • We can reshape data, going from long to wide (and back).
  • Which format you use depends on the aim: consider how you’d plot data.
  • We can use .pivot() to create a wide-format data set.
  • We can use .melt() to create a long-format data set.