# install if needed
install.packages("naniar")
10 Manipulating rows
- Learn to order and arrange rows.
- Be able to find and retain unique rows.
- Understand how logical operators are used.
- Implement conditional statements to filter specific data.
- Be able to identify and decide how to deal with missing data.
10.1 Context
Data sets can contain large quantities of observations. Often we are only interested in part of the data at a given time. We can deal with this by manipulating rows.
10.2 Section setup
We’ll continue this section with the script named da3-07-manipulating-rows.R
. If needed, add the following code to the top of your script and run it.
In this section we will use a new package, called naniar
, to visualise where we have missing data. Install it as follows:
# A collection of R packages designed for data science
library(tidyverse)
# Package to visualise missing data
library(naniar)
<- read_csv("data/surveys.csv") surveys
We’ll continue this section with the script named da3-07-manipulating-rows.py
. If needed, add the following code to the top of your script and run it.
We’ll use a new package, so install if needed from the terminal:
conda install conda-forge::missingno
Then, if needed, add the following code to the top of your script and run it.
# A Python data analysis and manipulation tool
import pandas as pd
# Python equivalent of `ggplot2`
from plotnine import *
# Load missingno
import missingno as msno
= pd.read_csv("data/surveys.csv") surveys
10.3 Manipulation of observations
10.3.1 Ordering rows
We often want to order data in a certain way, for example ordering by date or in alphabetically. The example below illustrates how we would order data based on weight
:
Let’s illustrate this with the surveys
data set, arranging the data based on year
.
|>
surveys arrange(year)
# A tibble: 35,549 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
7 7 7 16 1977 2 PE F NA NA
8 8 7 16 1977 1 DM M 37 NA
9 9 7 16 1977 1 DM F 34 NA
10 10 7 16 1977 6 PF F 20 NA
# ℹ 35,539 more rows
If we’d want to arrange the data in descending order (most recent to oldest), we would employ the desc()
helper function:
|>
surveys arrange(desc(year))
# A tibble: 35,549 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 33321 1 12 2002 1 DM M 38 44
2 33322 1 12 2002 1 DO M 37 58
3 33323 1 12 2002 1 PB M 28 45
4 33324 1 12 2002 1 AB <NA> NA NA
5 33325 1 12 2002 1 DO M 35 29
6 33326 1 12 2002 2 OT F 20 26
7 33327 1 12 2002 2 OT M 20 24
8 33328 1 12 2002 2 OT F 21 22
9 33329 1 12 2002 2 DM M 37 47
10 33330 1 12 2002 2 DO M 35 51
# ℹ 35,539 more rows
We can read that bit of code as “take the surveys
data set, send it to the arrange()
function and ask it to arrange the data in descending order (using desc()
) based on the year
column”.
We can also combine this approach with multiple variables, for example arranging data based on descending year and (ascending) hindfoot length:
|>
surveys arrange(desc(year), hindfoot_length)
# A tibble: 35,549 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 33647 3 14 2002 3 PF M 9 8
2 35301 12 8 2002 4 PF M 13 6
3 35506 12 31 2002 6 PF M 13 8
4 34281 6 15 2002 23 RM F 14 9
5 34663 7 14 2002 16 RM M 14 7
6 35101 11 10 2002 9 PF M 14 7
7 35487 12 29 2002 23 RO F 14 13
8 33429 2 9 2002 3 PF M 15 8
9 33535 2 10 2002 13 PF F 15 7
10 33556 2 10 2002 5 RO M 15 9
# ℹ 35,539 more rows
= "year") surveys.sort_values(by
record_id month day year ... species_id sex hindfoot_length weight
0 1 7 16 1977 ... NL M 32.0 NaN
343 344 10 18 1977 ... NL NaN NaN NaN
342 343 10 18 1977 ... PF NaN NaN NaN
341 342 10 18 1977 ... DM M 34.0 25.0
340 341 10 18 1977 ... DS M 50.0 NaN
... ... ... ... ... ... ... ... ... ...
34063 34064 5 16 2002 ... PP F 23.0 18.0
34064 34065 5 16 2002 ... DO M 36.0 32.0
34065 34066 5 16 2002 ... DO M 37.0 29.0
34059 34060 5 16 2002 ... DM M 36.0 55.0
35548 35549 12 31 2002 ... NaN NaN NaN NaN
[35549 rows x 9 columns]
If we’d want to arrange the data in descending order (most recent to oldest), we would specify this with the ascending = False
argument:
= "year", ascending = False) surveys.sort_values(by
record_id month day year ... species_id sex hindfoot_length weight
35548 35549 12 31 2002 ... NaN NaN NaN NaN
34060 34061 5 16 2002 ... PB M 27.0 37.0
34066 34067 5 16 2002 ... DM F 36.0 50.0
34065 34066 5 16 2002 ... DO M 37.0 29.0
34064 34065 5 16 2002 ... DO M 36.0 32.0
... ... ... ... ... ... ... ... ... ...
341 342 10 18 1977 ... DM M 34.0 25.0
342 343 10 18 1977 ... PF NaN NaN NaN
343 344 10 18 1977 ... NL NaN NaN NaN
344 345 11 12 1977 ... DM F 34.0 45.0
0 1 7 16 1977 ... NL M 32.0 NaN
[35549 rows x 9 columns]
We can also combine this approach with multiple variables, for example arranging data based on descending year and (ascending) hindfoot length:
surveys.sort_values(= ["year", "hindfoot_length"],
by = [False, True]
ascending )
record_id month day year ... species_id sex hindfoot_length weight
33646 33647 3 14 2002 ... PF M 9.0 8.0
35300 35301 12 8 2002 ... PF M 13.0 6.0
35505 35506 12 31 2002 ... PF M 13.0 8.0
34280 34281 6 15 2002 ... RM F 14.0 9.0
34662 34663 7 14 2002 ... RM M 14.0 7.0
... ... ... ... ... ... ... ... ... ...
494 495 12 11 1977 ... NL NaN NaN NaN
496 497 12 11 1977 ... OT NaN NaN NaN
499 500 12 11 1977 ... OT NaN NaN NaN
500 501 12 11 1977 ... OT NaN NaN NaN
502 503 12 11 1977 ... NaN NaN NaN NaN
[35549 rows x 9 columns]
10.3.2 Finding unique values
Sometimes it is useful to retain rows with unique combinations of some of our variables (i.e. remove any duplicated rows).
This can be done with the distinct()
function.
|>
surveys distinct(species_id, year)
# A tibble: 535 × 2
species_id year
<chr> <dbl>
1 NL 1977
2 DM 1977
3 PF 1977
4 PE 1977
5 DS 1977
6 PP 1977
7 SH 1977
8 OT 1977
9 DO 1977
10 OX 1977
# ℹ 525 more rows
We can do this by specifying which column we’d like to get the unique values from (here, we’re using species_id
and year
as an example). We then use .drop_duplicates()
to remove all the duplicate values:
"species_id", "year"]].drop_duplicates() surveys[[
species_id year
0 NL 1977
2 DM 1977
5 PF 1977
6 PE 1977
10 DS 1977
... ... ...
34574 UR 2002
34778 OL 2002
35032 SF 2002
35510 DX 2002
35511 US 2002
[535 rows x 2 columns]
10.3.3 Filtering by condition
Often we want to filter our data based on specific conditions / properties in the data. For example, in our data set you might want to filter for certain years, a specific weight range or only get all the observations for the first 100 record IDs.
Before we delve into this, it is important to understand that when we set a condition like above, the output is a logical vector. Let’s see an example using a small vector.
<- c(1985, 1990, 1999, 1995, 2010, 2000)
some_years < 2000 some_years
[1] TRUE TRUE TRUE TRUE FALSE FALSE
For this example we’ll keep using Pandas, for consistency. Since we’re only dealing with a simple, one-dimensional bunch of data, we create a Series
:
= pd.Series([1985, 1990, 1999, 1995, 2010, 2000])
some_years = some_years < 2000
result
result
0 True
1 True
2 True
3 True
4 False
5 False
dtype: bool
It is possible to combine several conditions by using the logical operators &
(AND) and |
(OR). For example, if we wanted the years between 1990 and 2000:
# both conditions have to be true
> 1990 & some_years < 2000 some_years
[1] FALSE FALSE TRUE TRUE FALSE FALSE
And if we wanted the years below 1990 or above 2000, then:
# only one or the other of the conditions has to be true
< 1990 | some_years > 2000 some_years
[1] TRUE FALSE FALSE FALSE TRUE FALSE
= (some_years > 1990) & (some_years < 2000)
result
result
0 False
1 False
2 True
3 True
4 False
5 False
dtype: bool
And if we wanted the years below 1990 or above 2000, then:
= (some_years < 1990) | (some_years > 2000)
result
result
0 True
1 False
2 False
3 False
4 True
5 False
dtype: bool
This concept is also applied to tables. We could filter across all rows in the surveys
data set, for a hindfoot_length
of larger than 31 mm:
hindfoot_length > 31
). If the outcome is TRUE
then the row is returned.
We could then do this as follows:
|>
surveys filter(hindfoot_length > 31)
# A tibble: 15,729 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 8 7 16 1977 1 DM M 37 NA
7 9 7 16 1977 1 DM F 34 NA
8 11 7 16 1977 5 DS F 53 NA
9 12 7 16 1977 7 DM M 38 NA
10 13 7 16 1977 3 DM M 35 NA
# ℹ 15,719 more rows
"hindfoot_length"] > 31] surveys[surveys[
record_id month day year ... species_id sex hindfoot_length weight
0 1 7 16 1977 ... NL M 32.0 NaN
1 2 7 16 1977 ... NL M 33.0 NaN
2 3 7 16 1977 ... DM F 37.0 NaN
3 4 7 16 1977 ... DM M 36.0 NaN
4 5 7 16 1977 ... DM M 35.0 NaN
... ... ... ... ... ... ... .. ... ...
35533 35534 12 31 2002 ... DM M 37.0 56.0
35534 35535 12 31 2002 ... DM M 37.0 53.0
35535 35536 12 31 2002 ... DM F 35.0 42.0
35536 35537 12 31 2002 ... DM F 36.0 46.0
35547 35548 12 31 2002 ... DO M 36.0 51.0
[15729 rows x 9 columns]
This then only keeps the observations where hindfoot_length > 31
, in this case 15729 observations.
To set filtering conditions, use the following relational operators:
>
is greater than>=
is greater than or equal to<
is less than<=
is less than or equal to==
is equal to!=
is different from%in%
is contained in
To combine conditions, use the following logical operators:
&
AND|
OR
Some functions return logical results and can be used in filtering operations:
is.na(x)
returns TRUE if a value in x is missing
The !
can be used to negate a logical condition:
!is.na(x)
returns TRUE if a value in x is NOT missing!(x %in% y)
returns TRUE if a value in x is NOT present in y
To set filtering conditions, use the following relational operators:
>
is greater than>=
is greater than or equal to<
is less than<=
is less than or equal to==
is equal to!=
is different from.isin([...])
is contained in
To combine conditions, use the following logical operators:
&
AND|
OR
Some functions return logical results and can be used in filtering operations:
df["x"].isna()
returns True if a value in x is missing
The ~
(bitwise NOT) can be used to negate a logical condition:
~df["x"].isna()
returns True if a value in x is NOT missing~df["x"].isin(["y"])
returns True if a value in x is NOT present in"y"
10.3.4 Missing data revisited
It’s important to carefully consider how to deal with missing data, as we have previously seen. It’s easy enough to filter out all rows that contain missing data, however this is rarely the best course of action, because you might accidentally throw out useful data in columns that you’ll need later.
Furthermore, it’s often a good idea to see if there is any structure in your missing data. Maybe certain variables are consistently absent, which could tell you something about your data.
We could filter out all the missing weight
values as follows:
|>
surveys filter(!is.na(weight))
# A tibble: 32,283 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 63 8 19 1977 3 DM M 35 40
2 64 8 19 1977 7 DM M 37 48
3 65 8 19 1977 4 DM F 34 29
4 66 8 19 1977 4 DM F 35 46
5 67 8 19 1977 7 DM M 35 36
6 68 8 19 1977 8 DO F 32 52
7 69 8 19 1977 2 PF M 15 8
8 70 8 19 1977 3 OX F 21 22
9 71 8 19 1977 7 DM F 36 35
10 74 8 19 1977 8 PF M 12 7
# ℹ 32,273 more rows
=["weight"]) surveys.dropna(subset
record_id month day year ... species_id sex hindfoot_length weight
62 63 8 19 1977 ... DM M 35.0 40.0
63 64 8 19 1977 ... DM M 37.0 48.0
64 65 8 19 1977 ... DM F 34.0 29.0
65 66 8 19 1977 ... DM F 35.0 46.0
66 67 8 19 1977 ... DM M 35.0 36.0
... ... ... ... ... ... ... .. ... ...
35540 35541 12 31 2002 ... PB F 24.0 31.0
35541 35542 12 31 2002 ... PB F 26.0 29.0
35542 35543 12 31 2002 ... PB F 27.0 34.0
35546 35547 12 31 2002 ... RM F 15.0 14.0
35547 35548 12 31 2002 ... DO M 36.0 51.0
[32283 rows x 9 columns]
We can combine this for multiple columns:
|>
surveys filter(!is.na(weight) & !is.na(hindfoot_length))
# A tibble: 30,738 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 63 8 19 1977 3 DM M 35 40
2 64 8 19 1977 7 DM M 37 48
3 65 8 19 1977 4 DM F 34 29
4 66 8 19 1977 4 DM F 35 46
5 67 8 19 1977 7 DM M 35 36
6 68 8 19 1977 8 DO F 32 52
7 69 8 19 1977 2 PF M 15 8
8 70 8 19 1977 3 OX F 21 22
9 71 8 19 1977 7 DM F 36 35
10 74 8 19 1977 8 PF M 12 7
# ℹ 30,728 more rows
= ["weight", "hindfoot_length"]) surveys.dropna(subset
record_id month day year ... species_id sex hindfoot_length weight
62 63 8 19 1977 ... DM M 35.0 40.0
63 64 8 19 1977 ... DM M 37.0 48.0
64 65 8 19 1977 ... DM F 34.0 29.0
65 66 8 19 1977 ... DM F 35.0 46.0
66 67 8 19 1977 ... DM M 35.0 36.0
... ... ... ... ... ... ... .. ... ...
35540 35541 12 31 2002 ... PB F 24.0 31.0
35541 35542 12 31 2002 ... PB F 26.0 29.0
35542 35543 12 31 2002 ... PB F 27.0 34.0
35546 35547 12 31 2002 ... RM F 15.0 14.0
35547 35548 12 31 2002 ... DO M 36.0 51.0
[30738 rows x 9 columns]
We can also combine that with other filters.
|>
surveys filter(!is.na(weight) & !is.na(hindfoot_length)) |>
filter(hindfoot_length > 40)
# A tibble: 2,045 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 357 11 12 1977 9 DS F 50 117
2 362 11 12 1977 1 DS F 51 121
3 367 11 12 1977 20 DS M 51 115
4 377 11 12 1977 9 DS F 48 120
5 381 11 13 1977 17 DS F 48 118
6 383 11 13 1977 11 DS F 52 126
7 385 11 13 1977 17 DS M 50 132
8 392 11 13 1977 11 DS F 53 122
9 394 11 13 1977 4 DS F 48 107
10 398 11 13 1977 4 DS F 50 115
# ℹ 2,035 more rows
To do this, we might prefer to use a slightly syntax: the .notna()
. This allows us to chain operations a bit cleaner, making our code easier to read:
surveys["weight"].notna()) &
(surveys["hindfoot_length"].notna()) &
(surveys["hindfoot_length"] > 40)
(surveys[ ]
record_id month day year ... species_id sex hindfoot_length weight
356 357 11 12 1977 ... DS F 50.0 117.0
361 362 11 12 1977 ... DS F 51.0 121.0
366 367 11 12 1977 ... DS M 51.0 115.0
376 377 11 12 1977 ... DS F 48.0 120.0
380 381 11 13 1977 ... DS F 48.0 118.0
... ... ... ... ... ... ... .. ... ...
29572 29573 5 15 1999 ... DS M 50.0 96.0
29705 29706 6 12 1999 ... DS M 49.0 102.0
30424 30425 3 4 2000 ... DO F 64.0 35.0
30980 30981 7 1 2000 ... DO F 42.0 46.0
33367 33368 1 12 2002 ... PB M 47.0 27.0
[2045 rows x 9 columns]
Often it’s not that easy to get a sense of how missing data are distributed in the data set. We can use summary statistics and visualisations to get a better sense.
The easiest way of getting some numbers on the missing data is by using the summary()
function, which will report the number of NA
’s for each column (for example: see the hindfoot_length
column):
summary(surveys)
record_id month day year plot_id
Min. : 1 Min. : 1.000 Min. : 1.00 Min. :1977 Min. : 1.0
1st Qu.: 8888 1st Qu.: 4.000 1st Qu.: 9.00 1st Qu.:1984 1st Qu.: 5.0
Median :17775 Median : 6.000 Median :16.00 Median :1990 Median :11.0
Mean :17775 Mean : 6.478 Mean :15.99 Mean :1990 Mean :11.4
3rd Qu.:26662 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1997 3rd Qu.:17.0
Max. :35549 Max. :12.000 Max. :31.00 Max. :2002 Max. :24.0
species_id sex hindfoot_length weight
Length:35549 Length:35549 Min. : 2.00 Min. : 4.00
Class :character Class :character 1st Qu.:21.00 1st Qu.: 20.00
Mode :character Mode :character Median :32.00 Median : 37.00
Mean :29.29 Mean : 42.67
3rd Qu.:36.00 3rd Qu.: 48.00
Max. :70.00 Max. :280.00
NA's :4111 NA's :3266
Often it’s nice to visualise where your missing values are, to see if there are any patterns that are obvious. There are several packages in R that can do this, of which naniar
is one.
# install if needed
install.packages("naniar")
# load the library
library(naniar)
The easiest way of counting the number of missing values in Python is by combining .isna()
and .sum()
:
# count missing values
sum() surveys.isna().
record_id 0
month 0
day 0
year 0
plot_id 0
species_id 763
sex 2511
hindfoot_length 4111
weight 3266
dtype: int64
Often it’s nice to visualise where your missing values are, to see if there are any patterns that are obvious. There are several packages in Python that can do this, of which missingno
is one.
import missingno as msno
10.4 Summary
- We use the
arrange()
function to order data, and can reverse the order usingdesc()
. - Unique rows can be retained with
distinct()
. - We use
filter()
to choose rows based on conditions. - Conditions can be set using several operators:
>
,>=
,<
,<=
,==
,!=
,%in%
. - Conditions can be combined using
&
(AND) and|
(OR). - The function
is.na()
can be used to identify missing values. It can be negated as!is.na()
to find non-missing values. - We can visualise missing data using the
vis_miss()
function from thenaniar
package.
- We can use the
.sort_values()
method to order data, specifying theascending =
argument asTrue
orFalse
to control the order. - The
.drop_duplicates()
method allows us to retain unique rows. - We use subsetting (e.g.
surveys[surveys["hindfoot_length"] > 31]
) together with conditions to filter data. – Conditions can be set using several operators:>
,>=
,<
,<=
,==
,!=
,.isin([...])
. - Conditions can be combined using
&
(AND) and|
(OR). - The function
.isna()
can be used to identify missing values. It can be negated using~
to find non-missing values (e.g.surveys[~surveys["weight"].isna()]
. - We can visualise missing data using the
msno.matrix
function from themissingno
package.