More Tidyverse

library(tidyverse)
set.seed(1234)

The tidyverse is a collection of packages built for a common style of data analysis. The idea is simple: keep data in tidy columns, then solve tasks by chaining clear verbs.

Some data to work on

n <- 1000
df <- tibble(
  id = 1:n,
  date = sample(
    seq(ymd("2024-01-01"), ymd("2024-12-31"), by = "days"), 
    n, replace = TRUE
  ),
  age = rnorm(n, mean = 40, sd = 10) |> round(),
  sex = sample(
    c("M", "F"), 
    n, replace = TRUE
  ),
  weight = rnorm(n, mean = 75, sd = 10) |> round(),
  height = (rnorm(n, mean = 163, sd = 7) + (date - ymd("2024-01-01")) / 36.5 + (sex == "M")*7) |> as.numeric() |> round(),
  note = sample(
    c("new-patient", "follow-up", "urgent"), 
    n, replace = TRUE
  ) |> factor()
)

df |> head(10) |> gt::gt(id = "patients_tbl")
id date age sex weight height note
1 2024-10-10 49 F 59 174 new-patient
2 2024-12-01 43 M 87 175 urgent
3 2024-04-10 41 M 68 174 new-patient
4 2024-04-20 34 F 84 167 new-patient
5 2024-05-12 42 F 74 170 follow-up
6 2024-04-07 38 F 76 164 urgent
7 2024-04-12 41 M 80 174 urgent
8 2024-08-01 34 M 66 170 urgent
9 2024-03-30 52 F 74 159 follow-up
10 2024-11-21 51 M 68 177 new-patient

dplyr in one minute

{dplyr} is for data manipulation with readable verbs. Most common pattern: data |> filter(...) |> mutate(...) |> summarise(...).

tidyr: reshape data

{tidyr} helps move between wide and long data and fill missing values.

pivot_longer() and pivot_wider()

These are ways of reshaping data. Let’s start with a dataset in wide format:

mortality_df <- tibble(
  group = c("A", "B"),
  mort_6mo = c(0.1,0.15),
  mort_1yr = c(0.2,0.25),
  mort_2yr = c(0.3,0.35),
  mort_5yr = c(0.5,0.55)
)
mortality_df |> gt::gt()
group mort_6mo mort_1yr mort_2yr mort_5yr
A 0.10 0.20 0.30 0.50
B 0.15 0.25 0.35 0.55

Now let’s make it long:

mortality_df_long <- mortality_df |> pivot_longer(
  cols = c(mort_6mo, mort_1yr, mort_2yr, mort_5yr),
  names_to = "time",
  values_to = "mortality",
  names_pattern = "mort_(.*)"
)

mortality_df_long |> gt::gt()
group time mortality
A 6mo 0.10
A 1yr 0.20
A 2yr 0.30
A 5yr 0.50
B 6mo 0.15
B 1yr 0.25
B 2yr 0.35
B 5yr 0.55

To go the other way, we can widen it back (notice, we have control over how column names are transformed into row values, and vice versa):

mortality_df_long |> pivot_wider(
  names_from = time,
  names_glue = "{time}_{.value}",
  values_from = mortality
)  |> gt::gt()
group 6mo_mortality 1yr_mortality 2yr_mortality 5yr_mortality
A 0.10 0.20 0.30 0.50
B 0.15 0.25 0.35 0.55

separate_wider_delim()

A simple way of splitting a column into multiple columns based on a delimiter:

data.frame(
  x = c("A_1", "B_2", "C_1", "D_2", "E_2")
) |> gt::gt()
x
A_1
B_2
C_1
D_2
E_2
data.frame(
  x = c("A_1", "B_2", "C_1", "D_2", "E_2")
) |> separate_wider_delim(
  x, delim = "_",
  names = c("letter", "number")
) |> gt::gt()
letter number
A 1
B 2
C 1
D 2
E 2

Other tidyverse packages (quick tour)

readr (import text data)

Purpose: fast, friendly file import. Useful functions: read_csv(), read_tsv(), write_csv().

Tip

Other nice packages are {readxl} for Excel files and {googlesheets4} for Google Sheets along with {tidyxl} for non-tabular Excel files.

tibble (modern data frame)

Tibbles are a special type of data.frame, providing a few benefits; cleaner printing and safer column handling. In addition, building tibbles allow you to build the columns sequentially (i.e., you can refer to columns created earlier in the same tibble() call).

tibble(
  x = 1:3,
  y = x*3
) |> gt::gt()
x y
1 3
2 6
3 9

Also comes with add_row():

df |> 
  head(5) |>
  add_row(id = -1, date = ymd("2024-05-05"), sex = "M") |> 
  gt::gt()
id date age sex weight height note
1 2024-10-10 49 F 59 174 new-patient
2 2024-12-01 43 M 87 175 urgent
3 2024-04-10 41 M 68 174 new-patient
4 2024-04-20 34 F 84 167 new-patient
5 2024-05-12 42 F 74 170 follow-up
-1 2024-05-05 NA M NA NA NA

purrr (iterate without explicit loops)

A clean way of applying functions to subsets (see previous lesson on Loops and parallelisation). Useful functions: map(), map_dbl(), map2().

df |> 
  select(weight, sex) |>
  group_by(sex) |>
  summarise(mean_weight = mean(weight)) |>
  gt::gt()
sex mean_weight
F 74.91634
M 75.69136
split(df$weight, df$sex) |> 
  purrr::map_dfr(mean) |> gt::gt()
F M
74.91634 75.69136

stringr (string handling)

Purpose: consistent string tools. Useful functions: str_detect(), str_replace(), str_extract().

c("word", "hypenated-word", "unhyphenated word") |> str_detect("-")
[1] FALSE  TRUE FALSE
df |> head(5) |> gt::gt()
id date age sex weight height note
1 2024-10-10 49 F 59 174 new-patient
2 2024-12-01 43 M 87 175 urgent
3 2024-04-10 41 M 68 174 new-patient
4 2024-04-20 34 F 84 167 new-patient
5 2024-05-12 42 F 74 170 follow-up
df |> 
  filter(
    str_detect(note, "-")
  ) |> 
  mutate(
    note_clean = str_replace(note, "-", " ")
  ) |> 
  head(5) |> gt::gt()
id date age sex weight height note note_clean
1 2024-10-10 49 F 59 174 new-patient new patient
3 2024-04-10 41 M 68 174 new-patient new patient
4 2024-04-20 34 F 84 167 new-patient new patient
5 2024-05-12 42 F 74 170 follow-up follow up
9 2024-03-30 52 F 74 159 follow-up follow up

forcats (factor handling)

Purpose: robust factor workflows. Useful functions: fct_relevel(), fct_reorder(), fct_lump().

df$note |> levels()
[1] "follow-up"   "new-patient" "urgent"     
z <- df |> mutate(
  note = fct_relevel(note, "follow-up", after = 3)
)
z$note |> levels()
[1] "new-patient" "urgent"      "follow-up"  

lubridate (dates and times)

Purpose: parse and manipulate dates. Useful functions: ymd(), wday(), floor_date().

df |>
  select(id, date) |>
  mutate(
    weekday = wday(date, label = T),
    prior_monday = floor_date(date, "week", week_start = 1)
  ) |> 
  head(5) |> gt::gt()
id date weekday prior_monday
1 2024-10-10 Thu 2024-10-07
2 2024-12-01 Sun 2024-11-25
3 2024-04-10 Wed 2024-04-08
4 2024-04-20 Sat 2024-04-15
5 2024-05-12 Sun 2024-05-06

ggplot2: a future lesson topic

library(wesanderson)
df |>
  ggplot(aes(x = date, y = height, color = sex, fill = sex)) +
  geom_point(size = 1.5, alpha = 0.5, shape =20) +
  geom_smooth(method = "loess", formula = y~x, linewidth = 1, alpha = 0.5) +
  labs(
    title = "A plot of height over time",
    subtitle = "The grammar of graphics—building plots by adding layers",
    x = "Date",
    y = "Height (cm)",
    color = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "top",
    axis.title.x = element_text(margin = margin(t = 20)),
    axis.title.y = element_text(margin = margin(r = 20))
  ) +
  scale_color_manual(values = wes_palette("Royal1"), labels = c("F" = "Female", "M" = "Male")) +
  scale_fill_manual(values = wes_palette("Royal1"), labels = c("F" = "Female", "M" = "Male"), name=NULL)

Further reading

R for Data Science: Data transform

R for Data Science: Tidy data

Tidyverse documentation