5  Data types & structures

TipLearning objectives
  • Create familiarity with the most common data types.
  • Know about basic data structures.
  • Create, use and make changes to objects.
  • Create, use and make changes to collections of data.
  • Deal with missing data.

5.1 Context

We’ve seen examples where we entered data directly into a function. Most of the time we have data from elsewhere, such as a spreadsheet. In the previous section we created single objects. We’ll build up from this and introduce vectors and tabular data. We’ll also briefly mention other data types, such as matrices, arrays.

5.2 Explained: Data types & structures

Computers are picky when it comes to data and they like consistency. As such, it’s good to be aware of the fact that data can be viewed or interpreted in different ways by the computer.

For example, you might have research data where the presence or absence of a tumour is scored. This would often be recorded as 0 when absent and 1 as present. Your computer views these values as numbers and would happily calculate the average of those values. Not ideal, because a tumour being, on average, 0.3 present makes no sense!

So, it is important to spend a bit of time looking at your data to make sure that the computer sees it in the correct way.

5.2.1 Quantitative data

Discrete data

Discrete data are numerical data that can only take distinct values. They can be counted and only take whole numbers. Examples of discrete data include, for example, the number of planets in a solar system or the number of questions answered on an exam.

Description
The number of questions answered on an exam (e.g. 12 out of 20)
If somebody has completed a survey (binary data; yes/no)
The number of students in a class (e.g. 20, 32)

Continuous data

Continuous data can take any value within a given range. These data can be measured and can include decimals or fractions.

Description
Temperature of a liquid (e.g. 20 °C)
Height of people in a cohort (e.g. 168 cm)
Average heart rate in a patient (e.g. 70 beats per minute)
Water levels in an aquifer (e.g. 2.4 metres)

5.2.2 Qualitative data

Qualitative data are data that describe qualities which can’t be measured or quantified numerically. We can roughly split these data into two types: ones with an inherent order to them, and ones without.

Nominal data: categories

These are categorical data that represent categories or distinct groups, without any inherent order or ranking.

Description
Eye colour (e.g. blue, brown)
Education level (e.g. primary school, secondary school)
Treatment group (e.g. control, treatment)

Ordinal data: categories with ranking or ordering

Ordinal data are similar to nominal data, in that they represent different categories or groups. However, these also have an inherent ordering to them.

Description
Rating scale (e.g., 1 to 5 stars for difficulty levels)
Rank or position (e.g., 1st, 2nd, 3rd place in a tournament)
Order or progression (e.g., low, medium, high priority)

5.2.3 Getting the computer to see the right way

In general, computers can view these different types of data in specific ways.

R has the following main data types:

Data type Description
numeric Represents numbers; can be whole (integers) or decimals
(e.g., 19or 2.73).
integer Specific type of numeric data; can only be an integer
(e.g., 7L where L indicates an integer).
character Also called text or string
(e.g., "Rabbits are great!").
logical Also called boolean values; takes either TRUE or FALSE.
factor A type of categorical data that can have inherent ordering
(e.g., low, medium, high).

Python has the following main data types:

Data type Description
int Specific type of numeric data; can only be an integer
(e.g., 7 or 56).
float Decimal numbers
(e.g., 3.92 or 9.824).
str Text or string data
(e.g., "Rabbits are great!").
bool Logical or boolean values; takes either True or False.

5.2.4 Data structures

In the section on running code we saw how we can run code interactively. However, we frequently need to save values so we can work with them. We’ve just seen that we can have different types of data. We can save these into different data structures. Which data structure you need is often determined by the type of data and the complexity.

In the following sections we look at simple data structures.

5.3 Objects

We can store values into objects. To do this, we assign values to them. An object acts as a container for that value.

To create an object, we need to give it a name followed by the assignment operator and the value we want to give it, for example:

temperature <- 23

We can read the code as: the value 23 is assigned (<-) to the object temperature. Note that when you run this line of code the object you just created appears on your environment tab (top-right panel).

When assigning a value to an object, R does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.

temperature = 23

We can read the code as: the value 23 is assigned (=) to the object temperature.

When assigning a value to an object, Python does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.

ImportantThe assignment operator

We use an assignment operator to assign values on the right to objects on the left.

In R we use <- as the assignment operator.

In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke on a PC, while typing Option + - (push Option at the same time as the - key) does the same on a Mac.

Note

Although R also supports the use of = as an assignment operator, there are some very slight differences in their use, as illustrated here. Generally, people just stick to <- for those reasons.

In Python we use = as the assignment operator.

Objects can be given almost any name such as x, current_temperature, or subject_id. You want the object names to be explicit and short. There are some exceptions / considerations (see below).

WarningRestrictions on object names

Object names can contain letters, numbers, underscores and periods.

They cannot start with a number nor contain spaces. Different people use different conventions for long variable names, two common ones being:

Underscore: my_long_named_object

Camel case: myLongNamedObject

What you use is up to you, but be consistent. Programming languages are case-sensitive so temperature is different from Temperature.

  • Some names are reserved words or keywords, because they are the names of core functions (e.g., if, else, for, see R or Python for a complete list).
  • Avoid using function names (e.g., c, T, mean, data, df, weights), even if allowed. If in doubt, check the help to see if the name is already in use.
  • Avoid full-stops (.) within an object name as in my.data. Full-stops often have meaning in programming languages, so it’s best to avoid them.
  • Use consistent styling.

Whatever style you use, be consistent!

5.3.1 Using objects

Now that we have the temperature in memory, we can use it to perform operations. For example, this might the temperature in Celsius and we might want to calculate it to Kelvin.

To do this, we need to add 273.15:

temperature + 273.15
[1] 296.15
temperature + 273.15
296.15

We can change an object’s value by assigning a new one:

temperature <- 36
temperature + 273.15
[1] 309.15
temperature = 36
temperature + 273.15
309.15

Finally, assigning a value to one object does not change the values of other objects. For example, let’s store the outcome in Kelvin into a new object temp_K:

temp_K <- temperature + 273.15
temp_K = temperature + 273.15

Changing the value of temperature does not change the value of temp_K.

temperature <- 14
temp_K
[1] 309.15
temperature = 14
temp_K
309.15

5.4 Collections of data

In the examples above we have stored single values into an object. Of course we often have to deal with more than that. Generally speaking, we can create collections of data. This enables us to organise our data, for example by creating a collection of numbers or text values.

Creating a collection of data is pretty straightforward, particularly if you are doing it manually. Conceptually, we can of these collections in 4 distinct ways, based on the type of data they contain. We’ll cover tabular data in the next chapter.

Collection R Python
1D homogeneous vector NumPy array (1D)
2D homogeneous matrix / array NumPy array (2D)
General container list list / tuple
Tabular (mixed) data.frame/tibble pandas DataFrame
ImportantHaving a type

Different data types result in slightly different types of objects. It can be quite useful to check how your data is viewed by the computer.

We can use the class() function to check what type of object we’re dealing with.

class(temp_K)
[1] "numeric"

We can use the type() function to check what type of object we’re dealing with.

type(temp_K)
<class 'float'>

5.4.1 Homogeneous (1D)

The simplest collection of data in R is called a vector. This really is the workhorse of R.

A vector is composed by a series of values, which can numbers, text or any of the data types described. However, they are expected to all be of the same type.

We can assign a series of values to a vector using the c() function. For example, we can create a vector of temperatures and assign it to a new object temp_c:

temp_c <- c(23, 24, 31, 27, 18, 21)

temp_c        # check object contents
[1] 23 24 31 27 18 21
class(temp_c) # check object type
[1] "numeric"

A vector can also contain text. For example, let’s create a vector that contains weather descriptions:

weather <- c("sunny", "cloudy", "partial_cloud", "cloudy", "sunny", "rainy")

weather        # check object contents 
[1] "sunny"         "cloudy"        "partial_cloud" "cloudy"       
[5] "sunny"         "rainy"        
class(weather) # check object type
[1] "character"

In Python NumPy arrays are incredibly efficient for computing, so they are widely used. We can access NumPy as follows:

import numpy as np

Next, we can create a simple NumPy array that contains numbers.

temp_c = np.array([23, 24, 31, 27, 18, 21])

temp_c       # check object contents
array([23, 24, 31, 27, 18, 21])
type(temp_c) # check object type
<class 'numpy.ndarray'>

We can do something similar using text / character strings:

weather = (["sunny", "cloudy", "partial_cloud", "cloudy", "sunny", "rainy"])

weather       # check object contents
['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']
type(weather) # check object type
<class 'list'>

Note that when we define text (e.g. "cloudy" or "sunny"), we need to use quotes.

When we deal with numbers - whole or decimal (e.g. 23, 18.5) - we do not use quotes.

5.4.2 Homogeneous (2D)

Often we have more than just one set of data points. Following our temperature example, let’s say we measured minimum and maximum temperatures across a range of days.

We could arrange them in two columns, one with the minimum and one with the maximum values. All of the data is of the same type: numerical.

In R we can do this by creating an array / matrix.

temps <- array(c(18, 20, 25, 22, 15, 17,
                 23, 24, 31, 27, 18, 21),
               dim = c(6, 2))

temps        # check object contents
     [,1] [,2]
[1,]   18   23
[2,]   20   24
[3,]   25   31
[4,]   22   27
[5,]   15   18
[6,]   17   21
class(temps) # check object type
[1] "matrix" "array" 

In Python we can do this by creating a NumPy array:

temps = np.array([
    [18, 23],
    [20, 24],
    [25, 31],
    [22, 27],
    [15, 18],
    [17, 21]
])

temps       # check object contents
array([[18, 23],
       [20, 24],
       [25, 31],
       [22, 27],
       [15, 18],
       [17, 21]])
temps.shape # check object dimensions
(6, 2)
type(temps) # check object type
<class 'numpy.ndarray'>

5.4.3 General container

In R we can use a list to store different types of data - which do not need to be of the same length (this is different to tabular data, which we’ll cover in the next chapter).

Have a look at the following example:

list_example <- list(
  temperature = c(18, 20, 25, 22, 15, 17),     # numeric vector
  weather     = c("sunny", "cloudy", "rainy"), # character vector
  flag        = TRUE,                          # logical
  note        = "Weather observations"         # string
)

list_example        # check object contents
$temperature
[1] 18 20 25 22 15 17

$weather
[1] "sunny"  "cloudy" "rainy" 

$flag
[1] TRUE

$note
[1] "Weather observations"
class(list_example) # check object type
[1] "list"

This returns all the individual parts of the list. We won’t work much with lists in this course, but you’re likely to encounter them in the future - for example if you’re doing statistical analysis.

General data containers in Python can either a list or a tuple. Both can hold items of the same of different types. The difference between the two is that a list can be changed (mutable), whereas a tuple cannot be changed after it’s created (immutable).

We can assign a collection of numbers to a list:

temp_c = [23, 24, 31, 27, 18, 21]

temp_c       # check object contents
[23, 24, 31, 27, 18, 21]
type(temp_c) # check object type
<class 'list'>

A list can also contain text. For example, let’s create a list that contains weather descriptions:

weather = ["sunny", "cloudy", "partial_cloud", "cloudy", "sunny", "rainy"]

weather       # check object contents
['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']
type(weather) # check object type
<class 'list'>

We can also create a tuple. Remember, this is like a list, but it cannot be altered after creating it. Note the difference in the type of brackets, where we use ( ) round brackets instead of [ ] square brackets:

temp_c_tuple = (23, 24, 31, 27, 18, 21)

temp_c_tuple       # check object contents
(23, 24, 31, 27, 18, 21)
type(temp_c_tuple) # check object type
<class 'tuple'>

5.5 Type coercion

This occurs when there are more than one type of data (e.g. numerical, text, logical) in an object that expects all data to be the same. What the computer then does is to coerce all the data to a common type that avoids data loss.

Take a look at the following example, where we’re mixing different data types.

mixed_data <- c(264, NA, "Bob", 12)

mixed_data        # object contents
[1] "264" NA    "Bob" "12" 
class(mixed_data) # object type
[1] "character"

What has happened is that all values have been coerced to character.

In Python, lists are fine with different types of data. NumPy arrays however expect a single type.

mixed_data = np.array([264, None, "Bob", 12])

mixed_data       # object contents
array([264, None, 'Bob', 12], dtype=object)
type(mixed_data) # object type
<class 'numpy.ndarray'>

We can see that the object type is a NumPy array, but at the output of the data we see dtype=object, which means character data.

This happens because the computer doesn’t know what to do when it encounters more than one data type (numerical, logical and text, in this case). To preserve as much data, it converts everything to text.

5.5.1 Converting types

In some cases you might want to enforce a certain data type. If you do this, just be aware that some data could get lost.

Look at the following example, where we create a very simple 1D collection of data, where we introduced a number in quotes, so it’s viewed as text. In this case, forcing all the data as numeric would fix that error.

temp_error <- c(12, 23, "18", 26)

class(temp_error)
[1] "character"
temp_error <- as.numeric(temp_error)

class(temp_error)
[1] "numeric"

Here we create a NumPy array and check the data type:

temp_error = np.array([12, 23, "18", 26])

print(temp_error, temp_error.dtype) # check the contents and data type
['12' '23' '18' '26'] <U21

It gives us <U21 as a data type. This indicates that NumPy sized the array as a Unicode string array with a maximum of 21 characters. That’s quite a long-winded way of saying “they are not viewed as numbers”.

Thankfully we can fix that, by converting the type to int or integers.

temp_error = temp_error.astype(int)

print(temp_error, temp_error.dtype)
[12 23 18 26] int64

5.6 Making changes

Quite often we would want to make some changes to a collection of data. There are different ways we can do this.

Let’s say we gathered some new temperature data and wanted to add this to the original temp_c data.

We’d use the c() function to combine the new data:

c(temp_c, 22, 34)
[1] 23 24 31 27 18 21 22 34

We take the original temp_c list and add the new values:

temp_c + [22, 34]
[23, 24, 31, 27, 18, 21, 22, 34]

Let’s consider another scenario. Again, we went out to gather some new temperature data, but this time we stored the measurements into an object called temp_new and wanted to add these to the original temp_c data.

temp_new <- c(5, 16, 8, 12)

Next, we wanted to combine these new data with the original data, which we stored in temp_c.

Again, we can use the c() function:

c(temp_c, temp_new)
 [1] 23 24 31 27 18 21  5 16  8 12
temp_new = [5, 16, 8, 12]

We can use the + operator to add the two lists together:

temp_c + temp_new
[23, 24, 31, 27, 18, 21, 5, 16, 8, 12]

5.6.1 Number sequences

We often need to create sequences of numbers when analysing data. There are some useful shortcuts available to do this, which can be used in different situations. Run the following code to see the output.

1:10                                # integers from 1 to 10
10:1                                # integers from 10 to 1
seq(1, 10, by = 2)                  # from 1 to 10 by steps of 2
seq(10, 1, by = -0.5)               # from 10 to 1 by steps of -0.5
seq(1, 10, length.out = 21)         # 21 equally spaced values from 1 to 10

Python has some built-in functionality to deal with number sequences, but the numpy library is particularly helpful. We installed and loaded it previously, but if needed, re-run the following:

import numpy as np

Next, we can create several different number sequences:

list(range(1, 11))                 # integers from 1 to 10
list(range(10, 0, -1))             # integers from 10 to 1
list(range(1, 11, 2))              # from 1 to 10 by steps of 2
list(np.arange(10, 1, -0.5))       # from 10 to 1 by steps of -0.5
list(np.linspace(1, 10, num = 21)) # 21 equally spaced values from 1 to 10

5.7 Subsetting

Sometimes we want to extract one or more values from a collection of data. We will go into more detail later, but for now we’ll see how to do this on the simple data structures we’ve covered so far.

For simple subsetting we can use square brackets [ ] to extract values. Let’s explore this using our weather object.

weather          # remind ourselves of the data
[1] "sunny"         "cloudy"        "partial_cloud" "cloudy"       
[5] "sunny"         "rainy"        
weather[2]       # extract the second value
[1] "cloudy"
weather[2:4]     # extract the second to fourth value
[1] "cloudy"        "partial_cloud" "cloudy"       
weather[c(3, 1)] # extract the third and first value
[1] "partial_cloud" "sunny"        
weather[-1]      # extract all apart from the first value
[1] "cloudy"        "partial_cloud" "cloudy"        "sunny"        
[5] "rainy"        
weather          # remind ourselves of the data
['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']
weather[1]       # extract the second value
'cloudy'
weather[1:4]     # extract the second to fourth value (end index is exclusive)
['cloudy', 'partial_cloud', 'cloudy']
weather[1:]      # extract all apart from the first value
['cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']

In the course materials we keep R and Python separate in most cases. However, if you end up using both languages at some point then it’s important to be aware about some key differences. One of them is indexing.

Each item in a collection of data has a number, called an index. Now, it would be great if this was consistent across all programming languages, but it’s not.

R uses 1-based indexing whereas Python uses zero-based indexing. What does this mean? Compare the following:

plants <- c("tree", "shrub", "grass") # the index of "tree" is 1, "shrub" is 2 etc.
plants = ["tree", "shrub", "grass"]   # the index of "tree" is 0, "shrub" is 1 etc.  

Behind the scenes of any programming language there is a lot of counting going on. So, it matters if you count starting at zero or one. So, if I’d ask:

“Hey, R - give me the items with index 1 and 2 in plants” then I’d get tree and shrub.

If I’d ask that question in Python, then I’d get shrub and grass. Fun times.

5.8 Dealing with missing data

It may seem weird that you have to consider what isn’t there, but that’s exactly what we do when we have missing data. Ideally, when we’re collecting data we entries for every single thing we measure. But, alas, life is messy. That one patient may have missed an appointment, or one eppendorf tube got dropped, or etc etc.

R includes the concept of missing data, meaning we can specify that a data point is missing. Missing data are represented as NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This makes it harder to overlook the cases where you are dealing with missing data. This is a good thing!

For example, let’s look at the following data, where we have measured six different patients and recorded their systolic blood pressure.

systolic_pressure <- c(125, 134, NA, 145, NA, 141)

We can see that we’re missing measurements for two of them. If we want to calculate the average systolic blood pressure across these patients, then we could use the mean() function. However, this results in NA.

mean(systolic_pressure)
[1] NA

The reason that happens is because missing values are obviously not numbers and, as such, the mean() function doesn’t know what to do with the.

To overcome this, we need to tell it to ignore missing values and then calculate the mean. We do this by adding the argument na.rm = TRUE to it. This argument works on many different functions and instructs it to remove missing values before any calculation takes place.

mean(systolic_pressure, na.rm = TRUE)
[1] 136.25

There are quite a few ways that you can deal with missing data and we’ll discuss more of them in later sessions.

We can also count the number of missing values we have, by using the is.na() function, together with sum(). Look at the following code:

is.na(systolic_pressure)
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE

For each value in systolic_pressure we get a TRUE or FALSE value. If the value is NA, it returns TRUE. If not missing, FALSE. Behind the scenes R sees TRUE as a value of 1 and FALSE as a value of 0. We can thus count the number of TRUE values with:

sum(is.na(systolic_pressure))
[1] 2

The built-in functionality of Python is not very good at dealing with missing data. This means that you normally need to deal with them manually.

One of the ways you can denote missing data in Python is with None or NaN (“Not A Number”). Let’s look at the following data, where we have measured six different patients and recorded their systolic blood pressure.

systolic_pressure = [125, 134, None, 145, None, 141]

Next, we’d have to filter out the missing values (don’t worry about the exact meaning of the code at this point):

filtered_data = [x for x in systolic_pressure if x is not None]

And lastly we would be able to calculate the mean value:

sum(filtered_data) / len(filtered_data)
136.25

There are quite a few (easier!) ways that you can deal with missing data and we’ll discuss more of them in later sessions, once we start dealing with tabular data.

NoteTo exclude or not exclude?

It may be tempting to simply remove all observations that contain missing data. It often makes the analysis easier! However, there is good reason to be more subtle: throwing away good data.

Let’s look at the following hypothetical data set, where we use NA to denote missing values. We are interested in the average weight and age across the patients.

patient_id    weight_kg   age
N982          72          47
N821          68          49
N082          NA          63
N651          78          NA

We could remove all the rows that contain any missing data, thereby getting rid of the last two observations. However, that would mean we’d lose data on age from the penultimate row, and data on weight_kg from the last row.

Instead, it would be better to tell the computer to ignore missing values on a variable-by-variable basis and calculate the averages on the data that is there.

5.9 Exercises

5.9.1 Creating objects

ExerciseExercise 1 - Creating objects

Level:

Create an object that contains a sequence of even numbers between 1 and 21.

Using code, how many numbers are in the sequence?

We can do this as follows:

num_seq <- seq(2, 21, by = 2)

Determine the length:

length(num_seq)
[1] 10
num_seq = list(range(2, 21, 2))

Determine the length:

len(num_seq)
10

There are 10 numbers in the sequence.

5.9.2 Summation

ExerciseExercise 2 - Summation

Level:

For this exercise, create a series of uneven numbers of 10 to 30.

Using programming, answer the following: what is the sum of the resulting series of numbers?

num_seq <- seq(11, 29, by = 2)
sum(num_seq)
[1] 200
num_seq = list(range(11, 30, 2))
sum(num_seq)
200

5.9.3 Data types

ExerciseExercise 3 - Data types

Level:

Programming languages have a habit of coercing data types. What data types do you expect the following collections to be?

ex1 <- c(22, 87, NA, 32)
ex2 <- c(22, 87, 96, "unsure")
ex3 <- c(22, 87, 96.8, 102)
ex4 <- c(89, "rain", 12, TRUE)
ex5 <- c(TRUE, FALSE, TRUE, TRUE, "1", TRUE)
ex6 <- c(TRUE, FALSE, TRUE, TRUE, 1, TRUE)
ex1 = np.array([22, 87, None, 32])
ex2 = np.array([22, 87, 96, "unsure"])
ex3 = np.array([22, 87, 96.8, 102])
ex4 = np.array([89, "rain", 12, True])
ex5 = np.array([True, False, True, True, "1", True])
ex6 = np.array([True, False, True, True, 1, True])
class(ex1)
[1] "numeric"
class(ex2)
[1] "character"
class(ex3)
[1] "numeric"
class(ex4)
[1] "character"
class(ex5)
[1] "character"
class(ex6)
[1] "numeric"

There are some perhaps unexpected data types in there, so let’s focus on it a bit more.

Output Explanation
"numeric" All numeric values + NA (which is allowed in numeric vectors).
"character" Mixing numbers with a string ("unsure") coerces everything to character.
"numeric" Integers and decimal values are both numeric in R (stored as doubles).
"character" Presence of "rain" (a string) forces all elements to become character.
"character" Mixing logical (TRUE/FALSE) with string ("1") coerces all to character.
"numeric" Logical values (TRUE/FALSE) are coerced to 1 and 0, making the vector numeric.
ex1.dtype
dtype('O')
ex2.dtype
dtype('<U21')
ex3.dtype
dtype('float64')
ex4.dtype
dtype('<U21')
ex5.dtype
dtype('<U5')
ex6.dtype
dtype('int64')

There are some weird data types in there, so let’s unpack that a bit more.

dtype Meaning Explanation
O Object The array holds generic Python objects (mixed types, e.g. int + None). No coercion possible.
<U21, <U5 Unicode string The array holds text (U = Unicode). The number (21, 5) is the maximum string length in that array. NumPy converted everything to strings.
float64 64-bit floating-point numbers Purely numeric (can hold integers and floats together).
int64 64-bit integers Purely integer values (booleans are treated as 1 and 0).

5.10 Summary

TipKey points
  • The most common data types include numerical, text and logical data.
  • We can store data in single objects, enabling us to use the data
  • Multiple data points and types can be stored as different collections of data
  • We can make changes to objects and collections of data
  • We need to be explicit about missing data