Loops and parallelisation

library(dplyr)
library(purrr)
set.seed(1234)

Loops are one of the most fundamental tools in programming. They allow you to repeat a block of code — either a fixed number of times, or until some condition is met.

Some data to work on

patients <- tibble(
  id      = 1:8,
  age     = c(45, 67, 53, 71, 38, 60, 55, 49),
  systolic = c(120, 145, 132, 158, 112, 140, 128, 119)
)
patients |> gt::gt(id = "patients")
id age systolic
1 45 120
2 67 145
3 53 132
4 71 158
5 38 112
6 60 140
7 55 128
8 49 119

for loops

A for loop runs a block of code once for every element in a vector; i.e., it runs a fixed/prespecified number of times:

for (i in 1:5) {
  print(
    paste0(
      "Age of row ", i, ": ", 
      patients$age[i]
    )
  )
}
[1] "Age of row 1: 45"
[1] "Age of row 2: 67"
[1] "Age of row 3: 53"
[1] "Age of row 4: 71"
[1] "Age of row 5: 38"

You can loop over any vector — including character vectors:

columns <- c("age", "systolic")
for (col in columns) {
  print(paste0(
    "Mean ", col, ": ",
    mean(patients[[col]]))
  )
}
[1] "Mean age: 54.75"
[1] "Mean systolic: 131.75"

A common pattern is to build up a result vector as you go:

high_bp <- c()

for (i in 1:nrow(patients)) {
  high_bp <- c(high_bp, patients$age[i])
}

high_bp
[1] 45 67 53 71 38 60 55 49

This might be useful if you’re bootstrapping, for instance:

results <- c()

mean_age <- function(df) mean(df$age)

for (i in 1:10000) {
  dataset_i <- slice_sample(df, n = nrow(df), replace = TRUE) 
  result_i <- mean_age(dataset_i) # or any prespecified analysis function
  results <- c(results, result_i)
}
conf.int <- quantile(results, probs = c(0.025,0.975))

paste0("Mean age: ", mean(df$age))
conf.int

while loops

A while loop keeps running as long as a condition (in this example x < 1000) is TRUE:

x <- 2
while(x < 1000){
  x <- x^2
  print(x)
}
[1] 4
[1] 16
[1] 256
[1] 65536
Warning

Be careful: if the condition never becomes FALSE, the loop runs forever.

The apply() family

for loops are clear but can be verbose. The apply() family lets you apply a function to every element (or row/column) of an object in one go.

sapply() and lapply()

Both apply a function to each element of a vector or list. The difference is the output format.

lapply() always returns a list

df <- patients |> select(-id)

lapply(X = df, FUN = mean)
$age
[1] 54.75

$systolic
[1] 131.75

sapply() tries to simplify to a vector or matrix

sapply(X = df, FUN = mean)
     age systolic 
   54.75   131.75 

This is the same as manually doing lapply + unlist:

lapply(X = df, FUN = mean) |> unlist()
     age systolic 
   54.75   131.75 

The apply() family also easily work with custom functions:

squared <- function(x) {
  x^2
}
sapply(X = c(1:5), FUN = squared)
[1]  1  4  9 16 25

apply() for matrices/data frames

apply() works on rows (MARGIN = 1) or columns (MARGIN = 2):

mat <- matrix(1:12, nrow = 3)
mat
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
print("Row sums:")
[1] "Row sums:"
apply(X = mat, MARGIN = 1, FUN = sum)
[1] 22 26 30
print("Column sums:")
[1] "Column sums:"
apply(X = mat, MARGIN = 2, FUN = sum)
[1]  6 15 24 33

The problem with the apply() family is that the function names and output types are different every time, which makes them hard to remember.

purrr::map()

The {purrr} package gives you a consistent, predictable alternative. map() always returns a list; suffix variants guarantee the output type:

Function Output
map() list
map_dbl() numeric vector
map_chr() character vector
map_lgl() logical vector
map_df() data frame

map() always returns a list

map(df, mean)
$age
[1] 54.75

$systolic
[1] 131.75

Want a numeric vector? Be explicit:

map_dbl(df, mean)
     age systolic 
   54.75   131.75 

map() works naturally with pipes

df |> map(mean)
$age
[1] 54.75

$systolic
[1] 131.75

map2() — mapping over two inputs at once

This function requires both inputs to be the same length, and element i of input .x will be mapped to element i of input .y; that is, you don’t get every possible combination of .x and .y

names <- c("Alice", "Bob", "Carol")
scores <- c(88, 95, 72)

myPaste <- function(nm, scr) paste0(nm, " scored ", scr)

map2_chr(.x = names, .y = scores, .f = myPaste)
[1] "Alice scored 88" "Bob scored 95"   "Carol scored 72"

For all combinations, try something like:

with(
  expand.grid(names, scores),
  paste0(Var1, " scored ", Var2)
)
[1] "Alice scored 88" "Bob scored 88"   "Carol scored 88" "Alice scored 95"
[5] "Bob scored 95"   "Carol scored 95" "Alice scored 72" "Bob scored 72"  
[9] "Carol scored 72"

Vectorize()

Last session we wrote functions. Most built-in R functions are already vectorised — they work on a whole vector at once:

sqrt(c(4, 9, 16))
[1] 2 3 4

But if you write your own function with an if / else inside, it usually expects a single value:

bp_category <- function(systolic) {
  if (systolic >= 130) "High" else "Normal"
}

c(120, 145, 132) |> bp_category() # doesn't work on a vector
Error in `if (systolic >= 130) ...`:
! the condition has length > 1

Vectorize() wraps your function so it loops automatically:

bp_category <- bp_category |> Vectorize()

c(120, 145, 132) |> bp_category()
[1] "Normal" "High"   "High"  

This is equivalent to writing a for loop, but in one line. (For the tidyverse crowd, dplyr::case_when() is often a cleaner solution inside mutate().)

Parallel loops with foreach and %dopar%

So far, all loops run sequentially — one step at a time. For computationally expensive work (e.g., bootstrapping, simulations), you can run iterations in parallel across multiple CPU cores using {doParallel} with the for-like syntax of foreach().

library(foreach)
library(doParallel)
cl <- makeCluster(10)          # open 10 workers
registerDoParallel(cl)

mean_age <- function(df) mean(df$age)

results <- foreach(i = 1:1000, 
                   .packages = "dplyr",
                   .combine = c) %dopar% {
  dataset_i <- slice_sample(df, n = nrow(df), replace = TRUE) 
  result_i <- mean_age(dataset_i)
  return(result_i)
}

stopCluster(cl)
closeAllConnections()

results |> quantile(probs = c(0.025,0.975))
    2.5%    97.5% 
47.75000 61.87812 

.combine = c means the result of each loop iteration is returned in a combined vector (c()) in the end.

Tip

For simpler parallelism inside purrr, the {furrr} package lets you swap map() for future_map() with almost no code change.

Honourable mention: Recursion

A function can call itself. This is called recursion, and it’s another way to repeat an operation. Classic example: factorials.

factorial <- function(n) {
  if (n <= 1) return(1)
  else return(n * factorial(n - 1))
}
factorial(5)
[1] 120

Recursion is elegant for certain problems (tree traversal, divide-and-conquer algorithms) but can be slow and hard to debug in R. Loops or map() are almost always what you need.

Further reading

R for Data Science — Iteration

purrr documentation

foreach + doParallel vignette