14  Cleaning data

Learning objectives

14.1 Context

Often data is in a messy state before you can work with it. So, it is useful to know when and how to make changes to your data.

14.2 Section setup

We’ll continue this section with the script named da4-11-cleaning-data.R. If needed, add the following code to the top of your script and run it.

# A collection of R packages designed for data science
library(tidyverse)

surveys <- read_csv("data/surveys.csv")
plot_types <- read_csv("data/plots.csv")

We’ll continue this section with the script named da4-11-cleaning-data.py. If needed, add the following code to the top of your script and run it.

# A Python data analysis and manipulation tool
import pandas as pd

# Python equivalent of `ggplot2`
from plotnine import *

surveys = pd.read_csv("data/surveys.csv")
plot_types = pd.read_csv("data/plots.csv")

14.3 Cleaning data

14.3.1 Changing plot_id

In the example above we saw that it wasn’t great practice to just use numbers to indicate plot_id, since they obviously have no numerical value.

It would be better to encode them in the format plot_xxx where xxx is a number with leading zeros (so that it sorts nicely).

We can do that as follows:

plot_types |> 
  mutate(plot_id = paste0("plot_", sprintf("%03d", plot_id)))
# A tibble: 24 × 2
   plot_id  plot_type                
   <chr>    <chr>                    
 1 plot_001 Spectab exclosure        
 2 plot_002 Control                  
 3 plot_003 Long-term Krat Exclosure 
 4 plot_004 Control                  
 5 plot_005 Rodent Exclosure         
 6 plot_006 Short-term Krat Exclosure
 7 plot_007 Rodent Exclosure         
 8 plot_008 Control                  
 9 plot_009 Spectab exclosure        
10 plot_010 Rodent Exclosure         
# ℹ 14 more rows

Note: this means that you would also have to change the plot_id column values in the surveys data set, if you wanted to combine the data from these tables!

14.3.2 Variable naming

LO: variable naming (janitor package)

14.3.3 Encoding issues

LO: encoding issues

14.3.4 Missing data

LO: dealing with missing data

14.4 Summary

Key points