More Tidyverse

library(tidyverse)
set.seed(1234)

The tidyverse is a collection of packages built for a common style of data analysis. The idea is simple: keep data in tidy columns, then solve tasks by chaining clear verbs.

Some data to work on

n <- 1000
df <- tibble(
  id = 1:n,
  date = sample(
    seq(ymd("2024-01-01"), ymd("2024-12-31"), by = "days"), 
    n, replace = TRUE
  ),
  age = rnorm(n, mean = 40, sd = 10) |> round(),
  sex = sample(
    c("M", "F"), 
    n, replace = TRUE
  ),
  weight = rnorm(n, mean = 75, sd = 10) |> round(),
  height = (rnorm(n, mean = 163, sd = 7) + (date - ymd("2024-01-01")) / 36.5 + (sex == "M")*7) |> as.numeric() |> round(),
  note = sample(
    c("new-patient", "follow-up", "urgent"), 
    n, replace = TRUE
  ) |> factor()
)

df |> head(10) |> gt::gt(id = "patients_tbl")

id	date	age	sex	weight	height	note
1	2024-10-10	49	F	59	174	new-patient
2	2024-12-01	43	M	87	175	urgent
3	2024-04-10	41	M	68	174	new-patient
4	2024-04-20	34	F	84	167	new-patient
5	2024-05-12	42	F	74	170	follow-up
6	2024-04-07	38	F	76	164	urgent
7	2024-04-12	41	M	80	174	urgent
8	2024-08-01	34	M	66	170	urgent
9	2024-03-30	52	F	74	159	follow-up
10	2024-11-21	51	M	68	177	new-patient

`dplyr` in one minute

{dplyr} is for data manipulation with readable verbs. Most common pattern: data |> filter(...) |> mutate(...) |> summarise(...).

`tidyr`: reshape data

{tidyr} helps move between wide and long data and fill missing values.

`pivot_longer()` and `pivot_wider()`

These are ways of reshaping data. Let’s start with a dataset in wide format:

mortality_df <- tibble(
  group = c("A", "B"),
  mort_6mo = c(0.1,0.15),
  mort_1yr = c(0.2,0.25),
  mort_2yr = c(0.3,0.35),
  mort_5yr = c(0.5,0.55)
)
mortality_df |> gt::gt()

group	mort_6mo	mort_1yr	mort_2yr	mort_5yr
A	0.10	0.20	0.30	0.50
B	0.15	0.25	0.35	0.55

Now let’s make it long:

mortality_df_long <- mortality_df |> pivot_longer(
  cols = c(mort_6mo, mort_1yr, mort_2yr, mort_5yr),
  names_to = "time",
  values_to = "mortality",
  names_pattern = "mort_(.*)"
)

mortality_df_long |> gt::gt()

group	time	mortality
A	6mo	0.10
A	1yr	0.20
A	2yr	0.30
A	5yr	0.50
B	6mo	0.15
B	1yr	0.25
B	2yr	0.35
B	5yr	0.55

To go the other way, we can widen it back (notice, we have control over how column names are transformed into row values, and vice versa):

mortality_df_long |> pivot_wider(
  names_from = time,
  names_glue = "{time}_{.value}",
  values_from = mortality
)  |> gt::gt()

group	6mo_mortality	1yr_mortality	2yr_mortality	5yr_mortality
A	0.10	0.20	0.30	0.50
B	0.15	0.25	0.35	0.55

`separate_wider_delim()`

A simple way of splitting a column into multiple columns based on a delimiter:

data.frame(
  x = c("A_1", "B_2", "C_1", "D_2", "E_2")
) |> gt::gt()

x
A_1
B_2
C_1
D_2
E_2

data.frame(
  x = c("A_1", "B_2", "C_1", "D_2", "E_2")
) |> separate_wider_delim(
  x, delim = "_",
  names = c("letter", "number")
) |> gt::gt()

letter	number
A	1
B	2
C	1
D	2
E	2

Other tidyverse packages (quick tour)

`readr` (import text data)

Purpose: fast, friendly file import. Useful functions: read_csv(), read_tsv(), write_csv().

Tip

Other nice packages are {readxl} for Excel files and {googlesheets4} for Google Sheets along with {tidyxl} for non-tabular Excel files.

`tibble` (modern data frame)

Tibbles are a special type of data.frame, providing a few benefits; cleaner printing and safer column handling. In addition, building tibbles allow you to build the columns sequentially (i.e., you can refer to columns created earlier in the same tibble() call).

tibble(
  x = 1:3,
  y = x*3
) |> gt::gt()

x	y
1	3
2	6
3	9

Also comes with add_row():

df |> 
  head(5) |>
  add_row(id = -1, date = ymd("2024-05-05"), sex = "M") |> 
  gt::gt()

id	date	age	sex	weight	height	note
1	2024-10-10	49	F	59	174	new-patient
2	2024-12-01	43	M	87	175	urgent
3	2024-04-10	41	M	68	174	new-patient
4	2024-04-20	34	F	84	167	new-patient
5	2024-05-12	42	F	74	170	follow-up
-1	2024-05-05	NA	M	NA	NA	NA

`purrr` (iterate without explicit loops)

A clean way of applying functions to subsets (see previous lesson on Loops and parallelisation). Useful functions: map(), map_dbl(), map2().

df |> 
  select(weight, sex) |>
  group_by(sex) |>
  summarise(mean_weight = mean(weight)) |>
  gt::gt()

sex	mean_weight
F	74.91634
M	75.69136

split(df$weight, df$sex) |> 
  purrr::map_dfr(mean) |> gt::gt()

F	M
74.91634	75.69136

`stringr` (string handling)

Purpose: consistent string tools. Useful functions: str_detect(), str_replace(), str_extract().

c("word", "hypenated-word", "unhyphenated word") |> str_detect("-")

[1] FALSE  TRUE FALSE

df |> head(5) |> gt::gt()

id	date	age	sex	weight	height	note
1	2024-10-10	49	F	59	174	new-patient
2	2024-12-01	43	M	87	175	urgent
3	2024-04-10	41	M	68	174	new-patient
4	2024-04-20	34	F	84	167	new-patient
5	2024-05-12	42	F	74	170	follow-up

df |> 
  filter(
    str_detect(note, "-")
  ) |> 
  mutate(
    note_clean = str_replace(note, "-", " ")
  ) |> 
  head(5) |> gt::gt()

id	date	age	sex	weight	height	note	note_clean
1	2024-10-10	49	F	59	174	new-patient	new patient
3	2024-04-10	41	M	68	174	new-patient	new patient
4	2024-04-20	34	F	84	167	new-patient	new patient
5	2024-05-12	42	F	74	170	follow-up	follow up
9	2024-03-30	52	F	74	159	follow-up	follow up

`forcats` (factor handling)

Purpose: robust factor workflows. Useful functions: fct_relevel(), fct_reorder(), fct_lump().

df$note |> levels()

[1] "follow-up"   "new-patient" "urgent"

z <- df |> mutate(
  note = fct_relevel(note, "follow-up", after = 3)
)
z$note |> levels()

[1] "new-patient" "urgent"      "follow-up"

`lubridate` (dates and times)

Purpose: parse and manipulate dates. Useful functions: ymd(), wday(), floor_date().

df |>
  select(id, date) |>
  mutate(
    weekday = wday(date, label = T),
    prior_monday = floor_date(date, "week", week_start = 1)
  ) |> 
  head(5) |> gt::gt()

id	date	weekday	prior_monday
1	2024-10-10	Thu	2024-10-07
2	2024-12-01	Sun	2024-11-25
3	2024-04-10	Wed	2024-04-08
4	2024-04-20	Sat	2024-04-15
5	2024-05-12	Sun	2024-05-06

`ggplot2`: a future lesson topic

library(wesanderson)
df |>
  ggplot(aes(x = date, y = height, color = sex, fill = sex)) +
  geom_point(size = 1.5, alpha = 0.5, shape =20) +
  geom_smooth(method = "loess", formula = y~x, linewidth = 1, alpha = 0.5) +
  labs(
    title = "A plot of height over time",
    subtitle = "The grammar of graphics—building plots by adding layers",
    x = "Date",
    y = "Height (cm)",
    color = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "top",
    axis.title.x = element_text(margin = margin(t = 20)),
    axis.title.y = element_text(margin = margin(r = 20))
  ) +
  scale_color_manual(values = wes_palette("Royal1"), labels = c("F" = "Female", "M" = "Male")) +
  scale_fill_manual(values = wes_palette("Royal1"), labels = c("F" = "Female", "M" = "Male"), name=NULL)

Some data to work on

dplyr in one minute

tidyr: reshape data

pivot_longer() and pivot_wider()

separate_wider_delim()

Other tidyverse packages (quick tour)

readr (import text data)

tibble (modern data frame)

purrr (iterate without explicit loops)

stringr (string handling)

forcats (factor handling)

lubridate (dates and times)

ggplot2: a future lesson topic

Further reading

`dplyr` in one minute

`tidyr`: reshape data

`pivot_longer()` and `pivot_wider()`

`separate_wider_delim()`

`readr` (import text data)

`tibble` (modern data frame)

`purrr` (iterate without explicit loops)

`stringr` (string handling)

`forcats` (factor handling)

`lubridate` (dates and times)

`ggplot2`: a future lesson topic