<- 23 temperature
4 Data types & structures
- Create familiarity with the most common data types
- Know about basic data structures
- Create, use and make changes to objects
- Create, use and make changes to collections of data
- Deal with missing data
4.1 Context
We’ve seen examples where we entered data directly into a function. Most of the time we have data from elsewhere, such as a spreadsheet. In the previous section we created single objects. We’ll build up from this and introduce vectors and tabular data. We’ll also briefly mention other data types, such as matrices, arrays.
4.2 Explained: Data types & structures
Computers are picky when it comes to data and they like consistency. As such, it’s good to be aware of the fact that data can be viewed or interpreted in different ways by the computer.
For example, you might have research data where the presence or absence of a tumour is scored. This would often be recorded as 0
when absent and 1
as present. Your computer views these values as numbers and would happily calculate the average of those values. Not ideal, because a tumour being, on average, 0.3
present makes no sense!
So, it is important to spend a bit of time looking at your data to make sure that the computer sees it in the correct way.
4.2.1 Quantitative data
Discrete data
Discrete data are numerical data that can only take distinct values. They can be counted and only take whole numbers. Examples of discrete data include, for example, the number of planets in a solar system or the number of questions answered on an exam.
Description | |
The number of questions answered on an exam (e.g. 12 out of 20) | |
If somebody has completed a survey (binary data; yes/no) | |
The number of students in a class (e.g. 20, 32) |
Continuous data
Continuous data can take any value within a given range. These data can be measured and can include decimals or fractions.
Description | |
Temperature of a liquid (e.g. 20 °C) | |
Height of people in a cohort (e.g. 168 cm) | |
Average heart rate in a patient (e.g. 70 beats per minute) | |
Water levels in an aquifer (e.g. 2.4 metres) |
4.2.2 Qualitative data
Qualitative data are data that describe qualities which can’t be measured or quantified numerically. We can roughly split these data into two types: ones with an inherent order to them, and ones without.
Nominal data: categories
These are categorical data that represent categories or distinct groups, without any inherent order or ranking.
Description | |
Eye colour (e.g. blue, brown) | |
Education level (e.g. primary school, secondary school) | |
Treatment group (e.g. control, treatment) |
Ordinal data: categories with ranking or ordering
Ordinal data are similar to nominal data, in that they represent different categories or groups. However, these also have an inherent ordering to them.
Description | |
Rating scale (e.g., 1 to 5 stars for difficulty levels) | |
Rank or position (e.g., 1st, 2nd, 3rd place in a tournament) | |
Order or progression (e.g., low, medium, high priority) |
4.2.3 Getting the computer to see the right way
In general, computers can view these different types of data in specific ways.
R has the following main data types:
Data type | Description |
numeric | Represents numbers; can be whole (integers) or decimals (e.g., 19 or 2.73 ). |
integer | Specific type of numeric data; can only be an integer (e.g., 7L where L indicates an integer). |
character | Also called text or string (e.g., "Rabbits are great!" ). |
logical | Also called boolean values; takes either TRUE or FALSE . |
factor | A type of categorical data that can have inherent ordering (e.g., low , medium , high ). |
Python has the following main data types:
Data type | Description |
int | Specific type of numeric data; can only be an integer (e.g., 7 or 56 ). |
float | Decimal numbers (e.g., 3.92 or 9.824 ). |
str | Text or string data (e.g., "Rabbits are great!" ). |
bool | Logical or boolean values; takes either True or False . |
4.2.4 Data structures
In the section on running code we saw how we can run code interactively. However, we frequently need to save values so we can work with them. We’ve just seen that we can have different types of data. We can save these into different data structures. Which data structure you need is often determined by the type of data and the complexity.
In the following sections we look at simple data structures.
4.3 Objects
We can store values into objects. To do this, we assign values to them. An object acts as a container for that value.
To create an object, we need to give it a name followed by the assignment operator and the value we want to give it, for example:
We can read the code as: the value 23
is assigned (<-
) to the object temperature
. Note that when you run this line of code the object you just created appears on your environment tab (top-right panel).
When assigning a value to an object, R does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.
= 23 temperature
We can read the code as: the value 23
is assigned (=
) to the object temperature
When assigning a value to an object, Python does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.
We use an assignment operator to assign values on the right to objects on the left.
In R we use <-
as the assignment operator.
In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <-
in a single keystroke on a PC, while typing Option + - (push Option at the same time as the - key) does the same on a Mac.
In Python we use =
as the assignment operator.
Objects can be given almost any name such as x
, current_temperature
, or subject_id
. You want the object names to be explicit and short. There are some exceptions / considerations (see below).
Object names can contain letters, numbers, underscores and periods.
They cannot start with a number nor contain spaces. Different people use different conventions for long variable names, two common ones being:
Underscore: my_long_named_object
Camel case: myLongNamedObject
What you use is up to you, but be consistent. Programming languages are case-sensitive so temperature
is different from Temperature.
- Some names are reserved words or keywords, because they are the names of core functions (e.g.,
, see R or Python for a complete list). - Avoid using function names (e.g.,
), even if allowed. If in doubt, check the help to see if the name is already in use. - Avoid full-stops (
) within an object name as inmy.data
. Full-stops often have meaning in programming languages, so it’s best to avoid them. - Use consistent styling. In R, popular style guides are:
Whatever style you use, be consistent!
4.3.1 Using objects
Now that we have the temperature
in memory, we can use it to perform operations. For example, this might the temperature in Celsius and we might want to calculate it to Kelvin.
To do this, we need to add 273.15
+ 273.15 temperature
[1] 296.15
+ 273.15 temperature
We can change an object’s value by assigning a new one:
<- 36
temperature + 273.15 temperature
[1] 309.15
= 36
temperature + 273.15 temperature
Finally, assigning a value to one object does not change the values of other objects. For example, let’s store the outcome in Kelvin into a new object temp_K
<- temperature + 273.15 temp_K
= temperature + 273.15 temp_K
Changing the value of temperature
does not change the value of temp_K
<- 14
temperature temp_K
[1] 309.15
= 14
temperature temp_K
4.4 Collections of data
In the examples above we have stored single values into an object. Of course we often have to deal with more than tat. Generally speaking, we can create collections of data. This enables us to organise our data, for example by creating a collection of numbers or text values.
4.4.1 Creating collections
Creating a collection of data is pretty straightforward, particularly if you are doing it manually.
The simplest collection of data in R is called a vector. This really is the workhorse of R.
A vector is composed by a series of values, which can numbers, text or any of the data types described.
We can assign a series of values to a vector using the c()
function. For example, we can create a vector of temperatures and assign it to a new object temp_c
<- c(23, 24, 31, 27, 18, 21)
[1] 23 24 31 27 18 21
A vector can also contain text. For example, let’s create a vector that contains weather descriptions:
<- c("sunny", "cloudy", "partial_cloud", "cloudy", "sunny", "rainy")
[1] "sunny" "cloudy" "partial_cloud" "cloudy"
[5] "sunny" "rainy"
The simplest collection of data in Python is either a list or a tuple. Both can hold items of the same of different types. Whereas a tuple cannot be changed after it’s created, a list can.
We can assign a collection of numbers to a list:
= [23, 24, 31, 27, 18, 21]
[23, 24, 31, 27, 18, 21]
A list can also contain text. For example, let’s create a list that contains weather descriptions:
= ["sunny", "cloudy", "partial_cloud", "cloudy", "sunny", "rainy"]
['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']
We can also create a tuple. Remember, this is like a list, but it cannot be altered after creating it. Note the difference in the type of brackets, where we use ( )
round brackets instead of [ ]
square brackets:
= (23, 24, 31, 27, 18, 21) temp_c_tuple
Note that when we define text (e.g. "cloudy"
or "sunny"
), we need to use quotes.
When we deal with numbers - whole or decimal (e.g. 23
, 18.5
) - we do not use quotes.
Different data types result in slightly different types of objects. It can be quite useful to check how your data is viewed by the computer.
We can use the class()
function to find out how R views our data. This function also works for more complex data structures.
Let’s do this for our examples:
[1] "numeric"
[1] "character"
We can use the type()
function to find out how Python views our data. This function also works for more complex data structures.
Let’s do this for our examples:
<class 'list'>
<class 'list'>
<class 'tuple'>
4.4.2 Making changes
Quite often we would want to make some changes to a collection of data. There are different ways we can do this.
Let’s say we gathered some new temperature data and wanted to add this to the original temp_c
We’d use the c()
function to combine the new data:
c(temp_c, 22, 34)
[1] 23 24 31 27 18 21 22 34
We take the original temp_c
list and add the new values:
+ [22, 34] temp_c
[23, 24, 31, 27, 18, 21, 22, 34]
Let’s consider another scenario. Again, we went out to gather some new temperature data, but this time we stored the measurements into an object called temp_new
and wanted to add these to the original temp_c
<- c(5, 16, 8, 12) temp_new
Next, we wanted to combine these new data with the original data, which we stored in temp_c
Again, we can use the c()
c(temp_c, temp_new)
[1] 23 24 31 27 18 21 5 16 8 12
= [5, 16, 8, 12] temp_new
We can use the +
operator to add the two lists together:
+ temp_new temp_c
[23, 24, 31, 27, 18, 21, 5, 16, 8, 12]
4.4.3 Number sequences
We often need to create sequences of numbers when analysing data. There are some useful shortcuts available to do this, which can be used in different situations. Run the following code to see the output.
1:10 # integers from 1 to 10
10:1 # integers from 10 to 1
seq(1, 10, by = 2) # from 1 to 10 by steps of 2
seq(10, 1, by = -0.5) # from 10 to 1 by steps of -0.5
seq(1, 10, length.out = 20) # 20 equally spaced values from 1 to 10
Python has some built-in functionality to deal with number sequences, but the numpy
library is particularly helpful. We installed and loaded it previously, but if needed, re-run the following:
import numpy as np
Next, we can create several different number sequences:
list(range(1, 11)) # integers from 1 to 10
list(range(10, 0, -1)) # integers from 10 to 1
list(range(1, 11, 2)) # from 1 to 10 by steps of 2
list(np.arange(10, 1, -0.5)) # from 10 to 1 by steps of -0.5
list(np.linspace(1, 10, num = 20)) # 20 equally spaced values from 1 to 10
4.4.4 Subsetting
Sometimes we want to extract one or more values from a collection of data. We will go into more detail later, but for now we’ll see how to do this on the simple data structures we’ve covered so far.
In the course materials we keep R and Python separate in most cases. However, if you end up using both languages at some point then it’s important to be aware about some key differences. One of them is indexing.
Each item in a collection of data has a number, called an index. Now, it would be great if this was consistent across all programming languages, but it’s not.
R uses 1-based indexing whereas Python uses zero-based indexing. What does this mean? Compare the following:
<- c("tree", "shrub", "grass") # the index of "tree" is 1, "shrub" is 2 etc. plants
= ["tree", "shrub", "grass"] # the index of "tree" is 0, "shrub" is 1 etc. plants
Behind the scenes of any programming language there is a lot of counting going on. So, it matters if you count starting at zero or one. So, if I’d ask:
“Hey, R - give me the items with index 1 and 2 in plants
” then I’d get tree
and shrub
If I’d ask that question in Python, then I’d get shrub
and grass
. Fun times.
In R we can use square brackets [ ]
to extract values. Let’s explore this using our weather
# remind ourselves of the data weather
[1] "sunny" "cloudy" "partial_cloud" "cloudy"
[5] "sunny" "rainy"
2] # extract the second value weather[
[1] "cloudy"
2:4] # extract the second to fourth value weather[
[1] "cloudy" "partial_cloud" "cloudy"
c(3, 1)] # extract the third and first value weather[
[1] "partial_cloud" "sunny"
-1] # extract all apart from the first value weather[
[1] "cloudy" "partial_cloud" "cloudy" "sunny"
[5] "rainy"
Let’s explore this using our weather
# remind ourselves of the data weather
['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']
1] # extract the second value weather[
1:4] # extract the second to fourth value (end index is exclusive) weather[
['cloudy', 'partial_cloud', 'cloudy']
2], weather[0] # extract the third and first value weather[
('partial_cloud', 'sunny')
1:] # extract all apart from the first value weather[
['cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']
4.5 Dealing with missing data
It may seem weird that you have to consider what isn’t there, but that’s exactly what we do when we have missing data. Ideally, when we’re collecting data we entries for every single thing we measure. But, alas, life is messy. That one patient may have missed an appointment, or one eppendorf tube got dropped, or etc etc.
R includes the concept of missing data, meaning we can specify that a data point is missing. Missing data are represented as NA
When doing operations on numbers, most functions will return NA
if the data you are working with include missing values. This makes it harder to overlook the cases where you are dealing with missing data. This is a good thing!
For example, let’s look at the following data, where we have measured six different patients and recorded their systolic blood pressure.
<- c(125, 134, NA, 145, NA, 141) systolic_pressure
We can see that we’re missing measurements for two of them. If we want to calculate the average systolic blood pressure across these patients, then we could use the mean()
function. However, this does not result in NA
[1] NA
You can add the argument na.rm = TRUE
to various functions - including mean()
- to calculate the result while ignoring the missing values. This stands for “remove missing values”.
mean(systolic_pressure, na.rm = TRUE)
[1] 136.25
There are quite a few ways that you can deal with missing data and we’ll discuss more of them in later sessions.
The built-in functionality of Python is not very good at dealing with missing data. This means that you normally need to deal with them manually.
One of the ways you can denote missing data in Python is with None
. Let’s look at the following data, where we have measured six different patients and recorded their systolic blood pressure.
= [125, 134, None, 145, None, 141] systolic_pressure
Next, we’d have to filter out the missing values (don’t worry about the exact meaning of the code at this point):
= [x for x in systolic_pressure if x is not None] filtered_data
And lastly we would be able to calculate the mean value:
sum(filtered_data) / len(filtered_data)
There are quite a few (easier!) ways that you can deal with missing data and we’ll discuss more of them in later sessions, once we start dealing with tabular data.
It may be tempting to simply remove all observations that contain missing data. It often makes the analysis easier! However, there is good reason to be more subtle: throwing away good data.
Let’s look at the following hypothetical data set, where we use NA
to denote missing values. We are interested in the average weight and age across the patients.
patient_id weight_kg age
N982 72 47
N821 68 49
N082 NA 63
N651 78 NA
We could remove all the rows that contain any missing data, thereby getting rid of the last two observations. However, that would mean we’d lose data on age
from the penultimate row, and data on weight_kg
from the last row.
Instead, it would be better to tell the computer to ignore missing values on a variable-by-variable basis and calculate the averages on the data that is there.
4.6 Summary
- The most common data types include numerical, text and logical data.
- We can store data in single objects, enabling us to use the data
- Multiple data points and types can be stored as different collections of data
- We can make changes to objects and collections of data
- We need to be explicit about missing data