3: Tabular Data

So far we have used individual vectors as data, but of course it is not particularly convenient to work with real data like that. Whether your data originally is stored in an Excel file or some sort of database system, it is usually stored in a tabular format, ideally with two key properties:

  1. Each column of your data is a variable (like age, weight, …)
  2. Each row of your data is an observation (like a single patient, a hospital, a country, …)

In the real world, data tends to come in all shapes and sizes, but for the purposes of analysis this is the shape to strive for.

Just like we made up some patient data to store as vectors, we can use the same approach to create tabular data, using the data.frame() function. Each argument of this function takes a vector just like we used before, so in a sense the data.frame is just a way to organize vectors that all must have the same length!

example_df <- data.frame(
  id = 1:5,
  age = c(35, 32, 26, 40, 34),
  weight_kg = c(75, 64, 71, 45, 85)
)

example_df
  id age weight_kg
1  1  35        75
2  2  32        64
3  3  26        71
4  4  40        45
5  5  34        85

Working with Tabular Data

We manipulated vectors by indexing them with [ ], things are more complicated with data.frames but fundamentally the idea is the same!

We can access individual columns of the data.frame like the vectors they are using either $ or double brackets for indexing [[ ]] where the index can be the integer index or the name of the variable, so these are all identical and result in a vector:

example_df$age
[1] 35 32 26 40 34
example_df[[2]]
[1] 35 32 26 40 34
example_df[["age"]]
[1] 35 32 26 40 34
example_df$id
[1] 1 2 3 4 5
example_df$weight_kg
[1] 75 64 71 45 85

Since this gives us a vector, we can do the usual vector things:

# First 5 age values
example_df$age[1:5]
[1] 35 32 26 40 34
# All age values under 30
example_df$age[example_df$age < 30]
[1] 26
# 2nd column, 3rd element
example_df[[2]][3]
[1] 26

This becomes inconvenient really quickly though, which is why there are better ways we’ll see soon.

Note that this way we can only ever get one vector, so this does not work as we might expect:

example_df[[c(1, 2)]]
example_df[[c("id", "age")]]

We can however use single brackets to access one or more columns, rows, or both, and it get’s a little more complex:

# First column, result is a data.frame
example_df[1]
  id
1  1
2  2
3  3
4  4
5  5
example_df["id"]
  id
1  1
2  2
3  3
4  4
5  5
# 3rd row
example_df[3, ]
  id age weight_kg
3  3  26        71
# First column, 5th row
example_df[5, 1]
[1] 5
example_df[5, "id"]
[1] 5

When we use single brackets, the result is still a data.frame unless in the last case where the result is a scalar. This can become somewhat tricky, but if you insist on the result being a data.frame you can add drop = FALSE:

example_df[5, 1]
[1] 5
example_df[5, 1, drop = FALSE]
  id
5  5

This is a “good to know and easy to forget” bit of information. Soon we will introduce mechanisms to make working with tabular data a bit more intuitive, but since you will most likely encounter this type of code in the wild, we’ll practice it a little soon.

example_df[example_df$age >= 30, ]
  id age weight_kg
1  1  35        75
2  2  32        64
4  4  40        45
5  5  34        85

External Data

Built-in

R comes with some built-in datasets that are “just there”:

sleep
mtcars
iris

They are built-in in the sense that they are part of the R installation - you can not accidentally delete them and you don’t need to do anything special to access them, which makes them the standard case for tutorials and examples — and of course also a little boring!

We can also install packages that provide additional data, and many packages that provide some sort of statistical or visualization functionality often include toy datasets to illustrate their use.
We’ll see examples of that later as well!

Reading and Writing Data from/to Files

A usual part of the workflow is to import data, clean it up, modify it in some way, and then storing it as a file for later use or to share it with collaborators.

There are many formats we could use, and they vary in how simple they are to use within R, how easy it is to read them with other software, and how certain you want to be that people are going to be able to read the data in 30 years.

For simplicity we’ll focus on two kinds of formats: R’s built-in format RDS, and plain-text delimited formats like CSV and TSV.

RDS

RDS has the benefit of being able to losslessly store any kind of R objects, which means that whatever you store as a file on your computer you’ll most likely be able to just read into R again and use it as if nothing happened.
This is why RDS is the most popular for “within R” workflow where you want to store intermediate results for example, or cleaned datasets you want to use later for analysis in a different R project.

We can store and restore data to RDS like this, using the example_df and the data directory in this project:

getwd()
[1] "/Users/Lukas/repos/github/bips-hb/course_introduction_to_r"
saveRDS(example_df, file = "data/example_df.rds")
example_rds <- readRDS(file = "data/example_df.rds")

We can check if the two are identical:

identical(example_df, example_rds)
[1] TRUE
  1. Create a data.frame called test_dfwith 3 variables
  2. Save to data/test.rds
  3. Read it back and name the variable read_test
test_df <- data.frame(
  letter = c("A", "B", "C", "hello"),
  height_cm = c(123, 346, 123, 7457),
  logical = c(TRUE, FALSE, TRUE, TRUE)
)
test_df
  letter height_cm logical
1      A       123    TRUE
2      B       346   FALSE
3      C       123    TRUE
4  hello      7457    TRUE
saveRDS(test_df, file = "data/test.rds")
read_test <- readRDS(file = "data/test.rds")

identical(test_df, read_test)
[1] TRUE

CSV

Comma separated values (CSV) are maybe the most common way to share data that’s not large enough to require database or strong compression. If you want to share data in a way that others can import it into Excel without a big headache, this is probably the way to go!
Also if you want to make sure that your great-great-great-grandchildren are still able to read your data without any special software.

It’s the lowest common denominator in many ways, and it can be very annoying to deal with.

In the simplest case we can read and write CSVs with read.csv()

write.csv(example_df, file = "data/example_csv123.csv", 
          row.names = FALSE)
example_csv <- read.csv(file = "data/example_csv123.csv")

Note that ?read.csv and ?write.csv show a few option that will be useful at some point, like sep. By default, we write and read comma-delimited files, but we could also do sep = ";" to use ; as a separating character!
When you try to read data with Excel for example, then using ; might make things easier.

When it comes to reading external data, tons of things can go wrong and for now, we just assume that it went okay.

For future reference though:

  • When in doubt, google around for how to best read the given data into R
  • Ensure to glance at the relevant documentation, e.g. for ?read.csv, if you run into errors or unexpected behavior
  • Once you’re used to installing packages, the readr package might be of interest here
  • Refer to this chapter in R4DS for more information and examples!
  • The package haven has functions to read and write data in formats used by SAS and SPSS
Tip

To keep a project organized, it’s good practice to

  • Store original, raw data in a dedicated directory like data-raw and never modify it!
  • Read it, clean it, and store a clean version of your data in data

That way, you should always your original data, your final data, and most importantly: Reproducible R code you can share with others that explains how to get from point A to point B!

Excursion: Packages

Before we move on, it’s time to take a look at this open source ecosystem we mentioned in the beginning. In R, this mostly means we can install packages. A package is a collection of functions and sometimes other objects which we can use for additional capabilities, and lets us benefit from other people’s work such that we don’t have to reinvent the wheel every time we do an analysis.

We can install packages using either RStudio’s “Packages” pane to the right, or we can use R functions directly.

For example, there is a package called "readxl" which allows us to Excel files (.xlsx), and we can install it like this:

install.packages("readxl")

We can then load the package using the library() function.

library(readxl)

We can now use functions from the package, and if we don’t know where to start, usually ?readxl is a good entrypoint.

Keep in mind:

  1. You have to install a package once into your package “library”
  2. You have to load the package every time you restart R
  3. You should occasionally update your installed packages using update.packages(ask = FALSE).

As an example for what we would use readxl for, let’s read an example dataset stored on data-raw in this folder in the .xlsx format.

example_df <- readxl::read_excel(path = "data-raw/example.xlsx")

Here, the function read_excel() is provided by the package readxl and only available when that package is installed and loaded! An error message like

could not find function "read_excel"

indicated that you either have not loaded the required package or maybe mistyped the function name.

Note

Once a package is installed, it is also possible to use one of its functions without loading it beforehand by using the :: syntax, like readxl::read_excel()

We won’t do that here yet, but in the wild you might find people referring to a function in the format somePackage::some_fun() to clarify they mean some_fun() from the package somePackage!

Your Turn

We will now move on to using more packages, which of course first need installation. Unless you have already done so, install the following packages:

  • “tidyverse”
  • “gapminder”
  • “gtsummary”
  • “table1”
  • “ggstatsplot”
  • “ggthemes”
  • “readr”

Alternatively you can install the package pak first, and then install the packages above using pkg_install():

library(pak)
pkg_install("tidyverse")

pak should be a bit faster and depending on your operating system also more robust, and it brings a few features not relevant yet.
If you use a Linux distribution like Ubuntu, pak can save you a lot of annoyances.

In any case, this might take a moment!

Knowing Your Data

As mentioned, accessing tabular data is a bit more complex than it is for vectors, and in the next section we will learn about convenient tools for this case. However, there are some common built-in tools that R provides for this which you should be at least aware of.

We’ll start with a new dataset from one of the packages you just installed: gapminder This dataset is based on the Gapminder data, and also part of Hans Rosling’s almost 20 year old TED talk.

First we load the data which in this case means to just load the gapminder package to access the dataset with the same name (confusingly so)

library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

We can also view the dataset in RStudio with View(), and look at its documentation with ?:

View(gapminder)
?gapminder

To see the number of rows and columns, we can use nrow(), ncol(), or dim().

nrow(gapminder)
[1] 1704
ncol(gapminder)
[1] 6
dim(gapminder)
[1] 1704    6

We often want to see the first or last few rows using head() and tail() (note: these also work on vectors!)

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
tail(gapminder)
# A tibble: 6 × 6
  country  continent  year lifeExp      pop gdpPercap
  <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
1 Zimbabwe Africa     1982    60.4  7636524      789.
2 Zimbabwe Africa     1987    62.4  9216418      706.
3 Zimbabwe Africa     1992    60.4 10704340      693.
4 Zimbabwe Africa     1997    46.8 11404948      792.
5 Zimbabwe Africa     2002    40.0 11926563      672.
6 Zimbabwe Africa     2007    43.5 12311143      470.

We can also use str() to get a basic overview of the data, including each variables type:

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Don’t forget we can also refer to the documentation using ?gapminder to learn about what each variable means!

There is also something new there: A type of data we have not encountered before: The factor! factors are for categorical data, particularly the kind where we know beforehand how many different categories there are and what they are called.

A quick example:

fruit <- factor(c("apple", "lemon", "apple"))
fruit
[1] apple lemon apple
Levels: apple lemon
levels(fruit)
[1] "apple" "lemon"

We can set the levels of a factor manually or change them later.
A factor can have a level that does not appear on the data, but it will “know” that there is an extra category, which helps us keep an overvoew on what we expect!

fruit <- factor(
   c("orange", "apple", "apple", "orange", "apple", "orange"), 
   levels = c("apple", "banana", "orange")
)
fruit
[1] orange apple  apple  orange apple  orange
Levels: apple banana orange
levels(fruit)
[1] "apple"  "banana" "orange"

The table() function gives us a simple frequency table of the data, and since it expects there to be "banana", it shows us a frequency of 0:

table(fruit)
fruit
 apple banana orange 
     3      0      3 
fruit_char <- c("orange", "apple", "apple", "orange", "apple", "orange")

table(fruit_char)
fruit_char
 apple orange 
     3      3 

factors are like characters but with additional information stored in them, and dealing with them can be a bit counter-intuitive. For now, we’re happy with knowing that they exists and that they are useful for categorical data!

In gapminder, there are two factor variables as we have seen using str(): country and continent.

We can use the aforementioned table() to get a broad overview of what we are dealing with

table(gapminder$continent)

  Africa Americas     Asia   Europe  Oceania 
     624      300      396      360       24 

We will get a better overview later, but for now we’ll practice the basics for a moment

Your Turn

Using the gapminder data, try to find out the following:

  1. Select the variables country, pop and lifeExp
  2. Filter the dataset to the year 2007
  3. For the country of Mauritius, what is the population in the year 1997?
  4. How many different countries are in the dataset?

Hint: The function unique() is similar to levels() for factors, but more general – both can be combined with length()!