# A collection of R packages designed for data science
library(tidyverse)
<- read_csv("data/surveys.csv")
surveys <- read_csv("data/plots.csv") plot_types
14 Cleaning data
14.1 Context
Often data is in a messy state before you can work with it. So, it is useful to know when and how to make changes to your data.
14.2 Section setup
We’ll continue this section with the script named da4-11-cleaning-data.R
. If needed, add the following code to the top of your script and run it.
We’ll continue this section with the script named da4-11-cleaning-data.py
. If needed, add the following code to the top of your script and run it.
# A Python data analysis and manipulation tool
import pandas as pd
# Python equivalent of `ggplot2`
from plotnine import *
= pd.read_csv("data/surveys.csv")
surveys = pd.read_csv("data/plots.csv") plot_types
14.3 Cleaning data
14.3.1 Changing plot_id
In the example above we saw that it wasn’t great practice to just use numbers to indicate plot_id
, since they obviously have no numerical value.
It would be better to encode them in the format plot_xxx
where xxx
is a number with leading zeros (so that it sorts nicely).
We can do that as follows:
|>
plot_types mutate(plot_id = paste0("plot_", sprintf("%03d", plot_id)))
# A tibble: 24 × 2
plot_id plot_type
<chr> <chr>
1 plot_001 Spectab exclosure
2 plot_002 Control
3 plot_003 Long-term Krat Exclosure
4 plot_004 Control
5 plot_005 Rodent Exclosure
6 plot_006 Short-term Krat Exclosure
7 plot_007 Rodent Exclosure
8 plot_008 Control
9 plot_009 Spectab exclosure
10 plot_010 Rodent Exclosure
# ℹ 14 more rows
Note: this means that you would also have to change the plot_id
column values in the surveys
data set, if you wanted to combine the data from these tables!
14.3.2 Variable naming
LO: variable naming (janitor package)
14.3.3 Encoding issues
LO: encoding issues
14.3.4 Missing data
LO: dealing with missing data