The Tidyverse is a collection of R packages created and maintained by the same group of people, largely from the same company that also created RStudio: Posit!
The Tidyverse is designed to make data science easier, more efficient, and more fun.
It is built around the idea of tidy data, which in essence just means that data is organized in a way that makes it easy to work with. In the previous section we have already heard the two core properties of tidy data:
Each column of your data is a variable
Each row of your data is an observation
From R4DS Chapter 5
And the tidyverse is created in part to make it easy to get your data in that shape and work with data in that shape. If R in general is a language, then we can think of “tidyverse” as a kind of dialect — it’s still R, but it has a specific context and use case and it is spoken by a community of people roughly doing similar kinds of data analysis.
Let’s start by loading the tidyverse meta-package.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You’ll see a lot of information there, and that’s nothing to worry about. What happened here is the following:
The tidyverse package’s main purpose is to load other packages, like dplyr, ggplot2, tidyr, … -> These are the actual tidyverse packages that provide the functionality we want!
It warns you about conflicting function names, like filter and lag. There are built-in functions in R of the same name that do very different things than they do in the tidyverse, and for the most part that is not a problem unless you try to use the function filter from the dplyr package but forgot to load dplyr beforehand!
Note that in the tidyverse, the tibble enhances the data.frame!
Tibbles are similar to data.frames, but they look nice and avoid some potentially confusing issues.
For now we don’t need to know more, but fur future reference you should not worry when you encounter a tibble!
Tidyverse Basics: Pipes and Verbs
We start by loading the gapminder package again for its dataset, just like we did before, but now we can glimpse it rather than str it:
# A tibble: 33 × 3
country lifeExp pop
<fct> <dbl> <int>
1 Afghanistan 34.0 11537966
2 Bahrain 59.9 202182
3 Bangladesh 43.5 62821884
4 Cambodia 45.4 6960067
5 China 58.4 754550000
6 Hong Kong, China 70 3722800
7 India 47.2 506000000
8 Indonesia 46.0 109343000
9 Iran 52.5 26538000
10 Iraq 54.5 8519282
# ℹ 23 more rows
In the tidyverse, or specifically dplyr syntax, we would write it like this:
gapminder |>filter(continent =="Asia"& year ==1967) |>select(country, lifeExp, pop)
# A tibble: 33 × 3
country lifeExp pop
<fct> <dbl> <int>
1 Afghanistan 34.0 11537966
2 Bahrain 59.9 202182
3 Bangladesh 43.5 62821884
4 Cambodia 45.4 6960067
5 China 58.4 754550000
6 Hong Kong, China 70 3722800
7 India 47.2 506000000
8 Indonesia 46.0 109343000
9 Iran 52.5 26538000
10 Iraq 54.5 8519282
# ℹ 23 more rows
What happened here? Let’s dissect this one by one:
We started with the gapminder dataset, and then, via the “pipe” |>,
We used filter() to select rows using a logical expression, and then
We used select() to select specific columns (without having to quote than with "!)
This is a combination of the pipe-syntax, which passes things down to the next function, and the most common tidyverse verbs that are at the core of most data operations.
The pipe |> works everywhere in R, and x |> foo() is just a different way to write foo(x):
x <-1:10length(x)
[1] 10
x |>length()
[1] 10
This becomes really powerful once we chain many functions.
Consider a made-up example:
x |>do_the_thing() |>do_the_other_thing() |>twist_it() |>shake_it_around() |>do_a_little_dance()
Here you can read the code top to bottom, and understand the sequence of events just by following the code. Consider what this would look like without the |>:
# A tibble: 1,704 × 3
year continent country
<int> <fct> <fct>
1 1952 Asia Afghanistan
2 1952 Europe Albania
3 1952 Africa Algeria
4 1952 Africa Angola
5 1952 Americas Argentina
6 1952 Oceania Australia
7 1952 Europe Austria
8 1952 Asia Bahrain
9 1952 Asia Bangladesh
10 1952 Europe Belgium
# ℹ 1,694 more rows
Or sorting descendingly with the desc() helper function:
# A tibble: 1,704 × 3
year continent country
<int> <fct> <fct>
1 2007 Asia Afghanistan
2 2007 Europe Albania
3 2007 Africa Algeria
4 2007 Africa Angola
5 2007 Americas Argentina
6 2007 Oceania Australia
7 2007 Europe Austria
8 2007 Asia Bahrain
9 2007 Asia Bangladesh
10 2007 Europe Belgium
# ℹ 1,694 more rows
For a numeric variable like year we could also just sort by a negative of the variable:
# A tibble: 1,704 × 3
year continent country
<int> <fct> <fct>
1 2007 Asia Afghanistan
2 2007 Europe Albania
3 2007 Africa Algeria
4 2007 Africa Angola
5 2007 Americas Argentina
6 2007 Oceania Australia
7 2007 Europe Austria
8 2007 Asia Bahrain
9 2007 Asia Bangladesh
10 2007 Europe Belgium
# ℹ 1,694 more rows
But desc() has the benefit of also working for character (sorted alphabetically) or factor variables (sorted by their levels), which makes desc() applicable in more cases.
# A tibble: 1,704 × 3
year continent country
<int> <fct> <fct>
1 1952 Oceania Australia
2 1957 Oceania Australia
3 1962 Oceania Australia
4 1967 Oceania Australia
5 1972 Oceania Australia
6 1977 Oceania Australia
7 1982 Oceania Australia
8 1987 Oceania Australia
9 1992 Oceania Australia
10 1997 Oceania Australia
# ℹ 1,694 more rows
Note that in filter() you use logical expressions as we’ve seen in section 2!
You can combine multiple conditions by passing them as separate arguments with , which inside filter() is the same as using the logical AND with &:
# Explicitly using AND & to combine year and country conditionsgapminder |>filter(year ==1997& country =="Iceland")
# A tibble: 1 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Iceland Europe 1997 79.0 271192 28061.
# Identical:gapminder |>filter( year >1990, country =="Iceland" )
# A tibble: 4 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Iceland Europe 1992 78.8 259012 25144.
2 Iceland Europe 1997 79.0 271192 28061.
3 Iceland Europe 2002 80.5 288030 31163.
4 Iceland Europe 2007 81.8 301931 36181.
As a more advanced example, we can use select()-helper functions to select variables that start with a certain pattern or contain a certain word:
gapminder |>select(starts_with("c"))
# A tibble: 1,704 × 2
country continent
<fct> <fct>
1 Afghanistan Asia
2 Afghanistan Asia
3 Afghanistan Asia
4 Afghanistan Asia
5 Afghanistan Asia
6 Afghanistan Asia
7 Afghanistan Asia
8 Afghanistan Asia
9 Afghanistan Asia
10 Afghanistan Asia
# ℹ 1,694 more rows
# A tibble: 1,704 × 8
country continent year lifeExp pop gdpPercap pop_m gdpPercap_m
<fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 8.43 0
2 Afghanistan Asia 1957 30.3 9240934 821. 9.24 0
3 Afghanistan Asia 1962 32.0 10267083 853. 10.3 0
4 Afghanistan Asia 1967 34.0 11537966 836. 11.5 0
5 Afghanistan Asia 1972 36.1 13079460 740. 13.1 0
6 Afghanistan Asia 1977 38.4 14880372 786. 14.9 0
7 Afghanistan Asia 1982 39.9 12881816 978. 12.9 0
8 Afghanistan Asia 1987 40.8 13867957 852. 13.9 0
9 Afghanistan Asia 1992 41.7 16317921 649. 16.3 0
10 Afghanistan Asia 1997 41.8 22227415 635. 22.2 0
# ℹ 1,694 more rows
Other Useful Verbs
Changing Shapes
A more advanced topic but often needed for data manipulation is reshaping your data using the tidyr package.
This might be best explained by example.
Say we want to calculate the difference in life expectancy between 1952 and 2007 for each country and continent, but using the verbs we’ve seen so far we can’t do that with a simple mutate(). We can instead use pivot_wider() which is a function that takes a “long” dataset and pivots it into a “wide” dataset:
# A tibble: 284 × 7
country continent year lifeExp pop gdpPercap pop_m
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 8.43
2 Albania Europe 1952 55.2 1282697 1601. 1.28
3 Algeria Africa 1952 43.1 9279525 2449. 9.28
4 Angola Africa 1952 30.0 4232095 3521. 4.23
5 Argentina Americas 1952 62.5 17876956 5911. 17.9
6 Australia Oceania 1952 69.1 8691212 10040. 8.69
7 Austria Europe 1952 66.8 6927772 6137. 6.93
8 Bahrain Asia 1952 50.9 120447 9867. 0.120
9 Bangladesh Asia 1952 37.5 46886859 684. 46.9
10 Belgium Europe 1952 68 8730405 8343. 8.73
# ℹ 274 more rows
bind_cols() combined datasets by columns, similar to base R’s cbind().
For example, here is a way to create to year-specific variables similar to the pivot_wider() example above