example_df <- data.frame(
id = 1:5,
age = c(35, 32, 26, 40, 34),
weight_kg = c(75, 64, 71, 45, 85)
)
example_df id age weight_kg
1 1 35 75
2 2 32 64
3 3 26 71
4 4 40 45
5 5 34 85
So far we have used individual vectors as data, but of course it is not particularly convenient to work with real data like that. Whether your data originally is stored in an Excel file or some sort of database system, it is usually stored in a tabular format, ideally with two key properties:
In the real world, data tends to come in all shapes and sizes, but for the purposes of analysis this is the shape to strive for.
Just like we made up some patient data to store as vectors, we can use the same approach to create tabular data, using the data.frame() function. Each argument of this function takes a vector just like we used before, so in a sense the data.frame is just a way to organize vectors that all must have the same length!
example_df <- data.frame(
id = 1:5,
age = c(35, 32, 26, 40, 34),
weight_kg = c(75, 64, 71, 45, 85)
)
example_df id age weight_kg
1 1 35 75
2 2 32 64
3 3 26 71
4 4 40 45
5 5 34 85
We manipulated vectors by indexing them with [ ], things are more complicated with data.frames but fundamentally the idea is the same!
We can access individual columns of the data.frame like the vectors they are using either $ or double brackets for indexing [[ ]] where the index can be the integer index or the name of the variable, so these are all identical and result in a vector:
[1] 35 32 26 40 34
[1] 35 32 26 40 34
[1] 35 32 26 40 34
[1] 1 2 3 4 5
[1] 75 64 71 45 85
Since this gives us a vector, we can do the usual vector things:
[1] 35 32 26 40 34
[1] 26
[1] 26
This becomes inconvenient really quickly though, which is why there are better ways we’ll see soon.
Note that this way we can only ever get one vector, so this does not work as we might expect:
We can however use single brackets to access one or more columns, rows, or both, and it get’s a little more complex:
id
1 1
2 2
3 3
4 4
5 5
id
1 1
2 2
3 3
4 4
5 5
id age weight_kg
3 3 26 71
[1] 5
[1] 5
When we use single brackets, the result is still a data.frame unless in the last case where the result is a scalar. This can become somewhat tricky, but if you insist on the result being a data.frame you can add drop = FALSE:
This is a “good to know and easy to forget” bit of information. Soon we will introduce mechanisms to make working with tabular data a bit more intuitive, but since you will most likely encounter this type of code in the wild, we’ll practice it a little soon.
R comes with some built-in datasets that are “just there”:
They are built-in in the sense that they are part of the R installation - you can not accidentally delete them and you don’t need to do anything special to access them, which makes them the standard case for tutorials and examples — and of course also a little boring!
We can also install packages that provide additional data, and many packages that provide some sort of statistical or visualization functionality often include toy datasets to illustrate their use.
We’ll see examples of that later as well!
A usual part of the workflow is to import data, clean it up, modify it in some way, and then storing it as a file for later use or to share it with collaborators.
There are many formats we could use, and they vary in how simple they are to use within R, how easy it is to read them with other software, and how certain you want to be that people are going to be able to read the data in 30 years.
For simplicity we’ll focus on two kinds of formats: R’s built-in format RDS, and plain-text delimited formats like CSV and TSV.
RDS has the benefit of being able to losslessly store any kind of R objects, which means that whatever you store as a file on your computer you’ll most likely be able to just read into R again and use it as if nothing happened.
This is why RDS is the most popular for “within R” workflow where you want to store intermediate results for example, or cleaned datasets you want to use later for analysis in a different R project.
We can store and restore data to RDS like this, using the example_df and the data directory in this project:
[1] "/Users/Lukas/repos/github/bips-hb/course_introduction_to_r"
We can check if the two are identical:
test_dfwith 3 variablesdata/test.rdsread_testtest_df <- data.frame(
letter = c("A", "B", "C", "hello"),
height_cm = c(123, 346, 123, 7457),
logical = c(TRUE, FALSE, TRUE, TRUE)
)
test_df letter height_cm logical
1 A 123 TRUE
2 B 346 FALSE
3 C 123 TRUE
4 hello 7457 TRUE
saveRDS(test_df, file = "data/test.rds")
read_test <- readRDS(file = "data/test.rds")
identical(test_df, read_test)[1] TRUE
Comma separated values (CSV) are maybe the most common way to share data that’s not large enough to require database or strong compression. If you want to share data in a way that others can import it into Excel without a big headache, this is probably the way to go!
Also if you want to make sure that your great-great-great-grandchildren are still able to read your data without any special software.
It’s the lowest common denominator in many ways, and it can be very annoying to deal with.
In the simplest case we can read and write CSVs with read.csv()
Note that ?read.csv and ?write.csv show a few option that will be useful at some point, like sep. By default, we write and read comma-delimited files, but we could also do sep = ";" to use ; as a separating character!
When you try to read data with Excel for example, then using ; might make things easier.
When it comes to reading external data, tons of things can go wrong and for now, we just assume that it went okay.
For future reference though:
?read.csv, if you run into errors or unexpected behaviorreadr package might be of interest herehaven has functions to read and write data in formats used by SAS and SPSSTo keep a project organized, it’s good practice to
data-raw and never modify it!dataThat way, you should always your original data, your final data, and most importantly: Reproducible R code you can share with others that explains how to get from point A to point B!
Before we move on, it’s time to take a look at this open source ecosystem we mentioned in the beginning. In R, this mostly means we can install packages. A package is a collection of functions and sometimes other objects which we can use for additional capabilities, and lets us benefit from other people’s work such that we don’t have to reinvent the wheel every time we do an analysis.
We can install packages using either RStudio’s “Packages” pane to the right, or we can use R functions directly.
For example, there is a package called "readxl" which allows us to Excel files (.xlsx), and we can install it like this:
We can then load the package using the library() function.
We can now use functions from the package, and if we don’t know where to start, usually ?readxl is a good entrypoint.
Keep in mind:
update.packages(ask = FALSE).As an example for what we would use readxl for, let’s read an example dataset stored on data-raw in this folder in the .xlsx format.
Here, the function read_excel() is provided by the package readxl and only available when that package is installed and loaded! An error message like
could not find function "read_excel"
indicated that you either have not loaded the required package or maybe mistyped the function name.
Once a package is installed, it is also possible to use one of its functions without loading it beforehand by using the :: syntax, like readxl::read_excel()
We won’t do that here yet, but in the wild you might find people referring to a function in the format somePackage::some_fun() to clarify they mean some_fun() from the package somePackage!
We will now move on to using more packages, which of course first need installation. Unless you have already done so, install the following packages:
Alternatively you can install the package pak first, and then install the packages above using pkg_install():
pak should be a bit faster and depending on your operating system also more robust, and it brings a few features not relevant yet.
If you use a Linux distribution like Ubuntu, pak can save you a lot of annoyances.
In any case, this might take a moment!
As mentioned, accessing tabular data is a bit more complex than it is for vectors, and in the next section we will learn about convenient tools for this case. However, there are some common built-in tools that R provides for this which you should be at least aware of.
We’ll start with a new dataset from one of the packages you just installed: gapminder This dataset is based on the Gapminder data, and also part of Hans Rosling’s almost 20 year old TED talk.
First we load the data which in this case means to just load the gapminder package to access the dataset with the same name (confusingly so)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
We can also view the dataset in RStudio with View(), and look at its documentation with ?:
To see the number of rows and columns, we can use nrow(), ncol(), or dim().
We often want to see the first or last few rows using head() and tail() (note: these also work on vectors!)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Zimbabwe Africa 1982 60.4 7636524 789.
2 Zimbabwe Africa 1987 62.4 9216418 706.
3 Zimbabwe Africa 1992 60.4 10704340 693.
4 Zimbabwe Africa 1997 46.8 11404948 792.
5 Zimbabwe Africa 2002 40.0 11926563 672.
6 Zimbabwe Africa 2007 43.5 12311143 470.
We can also use str() to get a basic overview of the data, including each variables type:
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Don’t forget we can also refer to the documentation using ?gapminder to learn about what each variable means!
There is also something new there: A type of data we have not encountered before: The factor! factors are for categorical data, particularly the kind where we know beforehand how many different categories there are and what they are called.
A quick example:
[1] apple lemon apple
Levels: apple lemon
[1] "apple" "lemon"
We can set the levels of a factor manually or change them later.
A factor can have a level that does not appear on the data, but it will “know” that there is an extra category, which helps us keep an overvoew on what we expect!
fruit <- factor(
c("orange", "apple", "apple", "orange", "apple", "orange"),
levels = c("apple", "banana", "orange")
)
fruit[1] orange apple apple orange apple orange
Levels: apple banana orange
[1] "apple" "banana" "orange"
The table() function gives us a simple frequency table of the data, and since it expects there to be "banana", it shows us a frequency of 0:
fruit
apple banana orange
3 0 3
fruit_char
apple orange
3 3
factors are like characters but with additional information stored in them, and dealing with them can be a bit counter-intuitive. For now, we’re happy with knowing that they exists and that they are useful for categorical data!
In gapminder, there are two factor variables as we have seen using str(): country and continent.
We can use the aforementioned table() to get a broad overview of what we are dealing with
We will get a better overview later, but for now we’ll practice the basics for a moment
Using the gapminder data, try to find out the following:
country, pop and lifeExpHint: The function unique() is similar to levels() for factors, but more general – both can be combined with length()!