4  Data types & structures

Learning objectives
  • Create familiarity with the most common data types
  • Know about basic data structures
  • Create, use and make changes to objects
  • Create, use and make changes to collections of data
  • Deal with missing data

4.1 Context

We’ve seen examples where we entered data directly into a function. Most of the time we have data from elsewhere, such as a spreadsheet. In the previous section we created single objects. We’ll build up from this and introduce vectors and tabular data. We’ll also briefly mention other data types, such as matrices, arrays.

4.2 Explained: Data types & structures

Computers are picky when it comes to data and they like consistency. As such, it’s good to be aware of the fact that data can be viewed or interpreted in different ways by the computer.

For example, you might have research data the presence or absence of a tumour is scored. This would often be recorded as 0 when absent and 1 as present. Your computer views these values as numbers and would happily calculate the average of those values. Not ideal, because a tumour being, on average, 0.3 present makes no sense!

So, it makes sense to spend a bit of time looking at your data and making sure that the computer sees it in the correct way.

4.2.1 Quantitative data

Discrete data

LO: discrete data

Continuous data

LO: continuous data

4.2.2 Qualitative data

Categories

LO: categories (nominal data)

Ordinal data

LO: factors (ordinal data)

4.2.3 Getting the computer to see the right way

In general, computers can view these different types of data in specific ways.

R has the following main data types:

Data type Description
numeric Represents numbers; can be whole (integers) or decimals
(e.g., 19or 2.73).
integer Specific type of numeric data; can only be an integer
(e.g., 7L where L indicates an integer).
character Also called text or string
(e.g., "Rabbits are great!").
logical Also called boolean values; takes either TRUE or FALSE.
factor A type of categorical data that can have inherent ordering
(e.g., low, medium, high).

Python has the following main data types:

Data type Description
int Specific type of numeric data; can only be an integer
(e.g., 7 or 56).
float Decimal numbers
(e.g., 3.92 or 9.824).
str Text or string data
(e.g., "Rabbits are great!").
bool Logical or boolean values; takes either True or False.

4.2.4 Data structures

In the section on running code we saw how we can run code interactively. However, we frequently need to save values so we can work with them. We’ve just seen that we can have different types of data. We can save these into different data structures. Which data structure you need is often determined by the type of data and the complexity.

In the following sections we look at simple data structures.

4.3 Objects

We can store values into objects. To do this, we assign values to them. An object acts as a container for that value.

To create an object, we need to give it a name followed by the assignment operator and the value we want to give it, for example:

temperature <- 23

We can read the code as: the value 23 is assigned (<-) to the object temperature. Note that when you run this line of code the object you just created appears on your environment tab (top-right panel).

When assigning a value to an object, R does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.

temperature = 23

We can read the code as: the value 23 is assigned (=) to the object temperature.

When assigning a value to an object, Python does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.

The assignment operator

We use an assignment operator to assign values on the right to objects on the left.

In R we use <- as the assignment operator.

In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke on a PC, while typing Option + - (push Option at the same time as the - key) does the same on a Mac.

In Python we use = as the assignment operator.


Objects can be given almost any name such as x, current_temperature, or subject_id. You want the object names to be explicit and short. There are some exceptions / considerations (see below).

Restrictions on object names

Object names can contain letters, numbers, underscores and periods.

They cannot start with a number nor contain spaces. Different people use different conventions for long variable names, two common ones being:

Underscore: my_long_named_object

Camel case: myLongNamedObject

What you use is up to you, but be consistent. Programming languages are case-sensitive so temperature is different from Temperature.

  • Some names are reserved words or keywords, because they are the names of core functions (e.g., if, else, for, see R or Python for a complete list).
  • Avoid using function names (e.g., c, T, mean, data, df, weights), even if allowed. If in doubt, check the help to see if the name is already in use.
  • Avoid full-stops (.) within an object name as in my.data. Full-stops often have meaning in programming languages, so it’s best to avoid them.
  • Use consistent styling. In R, popular style guides are:

Whatever style you use, be consistent!

4.3.1 Using objects

Now that we have the temperature in memory, we can use it to perform operations. For example, this might the temperature in Celsius and we might want to calculate it to Kelvin.

To do this, we need to add 273.15:

temperature + 273.15
[1] 296.15
temperature + 273.15
296.15

We can change an object’s value by assigning a new one:

temperature <- 36
temperature + 273.15
[1] 309.15
temperature = 36
temperature + 273.15
309.15

Finally, assigning a value to one object does not change the values of other objects. For example, let’s store the outcome in Kelvin into a new object temp_K:

temp_K <- temperature + 273.15
temp_K = temperature + 273.15

Changing the value of temperature does not change the value of temp_K.

temperature <- 14
temp_K
[1] 309.15
temperature = 14
temp_K
309.15

4.3.2 Updating objects

LO: update objects in R LO: update objects in Python & demonstrate lack of updates in tuples

4.4 Collections of data

In the examples above we have stored single values into an object. Of course we often have to deal with more than tat. Generally speaking, we can create collections of data. This enables us to organise our data, for example by creating a collection of numbers or text values.

4.4.1 Creating collections

Creating a collection of data is pretty straightforward, particularly if you are doing it manually.

The simplest collection of data in R is called a vector. This really is the workhorse of R.

A vector is composed by a series of values, which can numbers, text or any of the data types described.

We can assign a series of values to a vector using the c() function. For example, we can create a vector of temperatures and assign it to a new object temp_c:

temp_c <- c(23, 24, 31, 27, 18, 21)

temp_c
[1] 23 24 31 27 18 21

A vector can also contain text. For example, let’s create a vector that contains weather descriptions:

weather <- c("sunny", "cloudy", "partial_cloud", "cloudy", "sunny", "rainy")

weather
[1] "sunny"         "cloudy"        "partial_cloud" "cloudy"       
[5] "sunny"         "rainy"        

The simplest collection of data in Python is either a list or a tuple. Both can hold items of the same of different types. Whereas a tuple cannot be changed after it’s created, a list can.

We can assign a collection of numbers to a list:

temp_c = [23, 24, 31, 27, 18, 21]

temp_c
[23, 24, 31, 27, 18, 21]

A list can also contain text. For example, let’s create a list that contains weather descriptions:

weather = ["sunny", "cloudy", "partial_cloud", "cloudy", "sunny", "rainy"]

weather
['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']

We can also create a tuple. Remember, this is like a list, but it cannot be altered after creating it. Note the difference in the type of brackets, where we use ( ) round brackets instead of [ ] square brackets:

temp_c_tuple = (23, 24, 31, 27, 18, 21)

Note that when we define text (e.g. "cloudy" or "sunny"), we need to use quotes.

When we deal with numbers - whole or decimal (e.g. 23, 18.5) - we do not use quotes.

Having a type

Different data types result in slightly different types of objects. It can be quite useful to check how your data is viewed by the computer.

We can use the class() function to find out how R views our data. This function also works for more complex data structures.

Let’s do this for our examples:

class(temp_c)
[1] "numeric"
class(weather)
[1] "character"

We can use the type() function to find out how Python views our data. This function also works for more complex data structures.

Let’s do this for our examples:

type(temp_c)
<class 'list'>
type(weather)
<class 'list'>
type(temp_c_tuple)
<class 'tuple'>

4.4.2 Making changes

Quite often we would want to make some changes to a collection of data. There are different ways we can do this.

Let’s say we gathered some new temperature data and wanted to add this to the original temp_c data.

We’d use the c() function to combine the new data:

c(temp_c, 22, 34)
[1] 23 24 31 27 18 21 22 34

We take the original temp_c list and add the new values:

temp_c + [22, 34]
[23, 24, 31, 27, 18, 21, 22, 34]

Let’s consider another scenario. Again, we went out to gather some new temperature data, but this time we stored the measurements into an object called temp_new and wanted to add these to the original temp_c data.

temp_new <- c(5, 16, 8, 12)

Next, we wanted to combine these new data with the original data, which we stored in temp_c.

Again, we can use the c() function:

c(temp_c, temp_new)
 [1] 23 24 31 27 18 21  5 16  8 12
temp_new = [5, 16, 8, 12]

We can use the + operator to add the two lists together:

temp_c + temp_new
[23, 24, 31, 27, 18, 21, 5, 16, 8, 12]

4.4.3 Number sequences

We often need to create sequences of numbers when analysing data. There are some useful shortcuts available to do this, which can be used in different situations. Run the following code to see the output.

1:10                                # integers from 1 to 10
10:1                                # integers from 10 to 1
seq(1, 10, by = 2)                  # from 1 to 10 by steps of 2
seq(10, 1, by = -0.5)               # from 10 to 1 by steps of -0.5
seq(1, 10, length.out = 20)         # 20 equally spaced values from 1 to 10

Python has some built-in functionality to deal with number sequences, but the numpy library is particularly helpful. We installed and loaded it previously, but if needed, re-run the following:

import numpy as np

Next, we can create several different number sequences:

list(range(1, 11))                 # integers from 1 to 10
list(range(10, 0, -1))             # integers from 10 to 1
list(range(1, 11, 2))              # from 1 to 10 by steps of 2
list(np.arange(10, 1, -0.5))       # from 10 to 1 by steps of -0.5
list(np.linspace(1, 10, num = 20)) # 20 equally spaced values from 1 to 10

4.4.4 Subsetting

Sometimes we want to extract one or more values from a collection of data. We will go into more detail later, but for now we’ll see how to do this on the simple data structures we’ve covered so far.

In the course materials we keep R and Python separate in most cases. However, if you end up using both languages at some point then it’s important to be aware about some key differences. One of them is indexing.

Each item in a collection of data has a number, called an index. Now, it would be great if this was consistent across all programming languages, but it’s not.

R uses 1-based indexing whereas Python uses zero-based indexing. What does this mean? Compare the following:

plants <- c("tree", "shrub", "grass") # the index of "tree" is 1, "shrub" is 2 etc.
plants = ["tree", "shrub", "grass"]   # the index of "tree" is 0, "shrub" is 1 etc.  

Behind the scenes of any programming language there is a lot of counting going on. So, it matters if you count starting at zero or one. So, if I’d ask:

“Hey, R - give me the items with index 1 and 2 in plants” then I’d get tree and shrub.

If I’d ask that question in Python, then I’d get shrub and grass. Fun times.

In R we can use square brackets [ ] to extract values. Let’s explore this using our weather object.

weather          # remind ourselves of the data
[1] "sunny"         "cloudy"        "partial_cloud" "cloudy"       
[5] "sunny"         "rainy"        
weather[2]       # extract the second value
[1] "cloudy"
weather[2:4]     # extract the second to fourth value
[1] "cloudy"        "partial_cloud" "cloudy"       
weather[c(3, 1)] # extract the third and first value
[1] "partial_cloud" "sunny"        
weather[-1]      # extract all apart from the first value
[1] "cloudy"        "partial_cloud" "cloudy"        "sunny"        
[5] "rainy"        

Let’s explore this using our weather object.

weather          # remind ourselves of the data
['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']
weather[1]       # extract the second value
'cloudy'
weather[1:4]     # extract the second to fourth value (end index is exclusive)
['cloudy', 'partial_cloud', 'cloudy']
weather[2], weather[0] # extract the third and first value
('partial_cloud', 'sunny')
weather[1:]      # extract all apart from the first value
['cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']

4.5 Dealing with missing data

It may seem weird that you have to consider what isn’t there, but that’s exactly what we do when we have missing data. Ideally, when we’re collecting data we entries for every single thing we measure. But, alas, life is messy. That one patient may have missed an appointment, or one eppendorf tube got dropped, or etc etc.

R includes the concept of missing data, meaning we can specify that a data point is missing. Missing data are represented as NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This makes it harder to overlook the cases where you are dealing with missing data. This is a good thing!

For example, let’s look at the following data, where we have measured six different patients and recorded their systolic blood pressure.

systolic_pressure <- c(125, 134, NA, 145, NA, 141)

We can see that we’re missing measurements for two of them. If we want to calculate the average systolic blood pressure across these patients, then we could use the mean() function. However, this does not result in NA.

mean(systolic_pressure)
[1] NA

You can add the argument na.rm = TRUE to various functions - including mean() - to calculate the result while ignoring the missing values. This stands for “remove missing values”.

mean(systolic_pressure, na.rm = TRUE)
[1] 136.25

There are quite a few ways that you can deal with missing data and we’ll discuss more of them in later sessions.

The built-in functionality of Python is not very good at dealing with missing data. This means that you normally need to deal with them manually.

One of the ways you can denote missing data in Python is with None. Let’s look at the following data, where we have measured six different patients and recorded their systolic blood pressure.

systolic_pressure = [125, 134, None, 145, None, 141]

Next, we’d have to filter out the missing values (don’t worry about the exact meaning of the code at this point):

filtered_data = [x for x in systolic_pressure if x is not None]

And lastly we would be able to calculate the mean value:

sum(filtered_data) / len(filtered_data)
136.25

There are quite a few (easier!) ways that you can deal with missing data and we’ll discuss more of them in later sessions, once we start dealing with tabular data.

To exclude or not exclude?

It may be tempting to simply remove all observations that contain missing data. It often makes the analysis easier! However, there is good reason to be more subtle: throwing away good data.

Let’s look at the following hypothetical data set, where we use NA to denote missing values. We are interested in the average weight and age across the patients.

patient_id    weight_kg   age
N982          72          47
N821          68          49
N082          NA          63
N651          78          NA

We could remove all the rows that contain any missing data, thereby getting rid of the last two observations. However, that would mean we’d lose data on age from the penultimate row, and data on weight_kg from the last row.

Instead, it would be better to tell the computer to ignore missing values on a variable-by-variable basis and calculate the averages on the data that is there.

4.6 Summary

Key points
  • The most common data types include numerical, text and logical data.
  • We can store data in single objects, enabling us to use the data
  • Multiple data points and types can be stored as different collections of data
  • We can make changes to objects and collections of data
  • We need to be explicit about missing data