Loops and parallelisation

library(dplyr)
library(purrr)
set.seed(1234)

Loops are one of the most fundamental tools in programming. They allow you to repeat a block of code — either a fixed number of times, or until some condition is met.

Some data to work on

patients <- tibble(
  id      = 1:8,
  age     = c(45, 67, 53, 71, 38, 60, 55, 49),
  systolic = c(120, 145, 132, 158, 112, 140, 128, 119)
)
patients |> gt::gt(id = "patients")

id	age	systolic
1	45	120
2	67	145
3	53	132
4	71	158
5	38	112
6	60	140
7	55	128
8	49	119

`for` loops

A for loop runs a block of code once for every element in a vector; i.e., it runs a fixed/prespecified number of times:

for (i in 1:5) {
  print(
    paste0(
      "Age of row ", i, ": ", 
      patients$age[i]
    )
  )
}

[1] "Age of row 1: 45"
[1] "Age of row 2: 67"
[1] "Age of row 3: 53"
[1] "Age of row 4: 71"
[1] "Age of row 5: 38"

You can loop over any vector — including character vectors:

columns <- c("age", "systolic")
for (col in columns) {
  print(paste0(
    "Mean ", col, ": ",
    mean(patients[[col]]))
  )
}

[1] "Mean age: 54.75"
[1] "Mean systolic: 131.75"

A common pattern is to build up a result vector as you go:

high_bp <- c()

for (i in 1:nrow(patients)) {
  high_bp <- c(high_bp, patients$age[i])
}

high_bp

[1] 45 67 53 71 38 60 55 49

This might be useful if you’re bootstrapping, for instance:

results <- c()

mean_age <- function(df) mean(df$age)

for (i in 1:10000) {
  dataset_i <- slice_sample(df, n = nrow(df), replace = TRUE) 
  result_i <- mean_age(dataset_i) # or any prespecified analysis function
  results <- c(results, result_i)
}
conf.int <- quantile(results, probs = c(0.025,0.975))

paste0("Mean age: ", mean(df$age))
conf.int

`while` loops

A while loop keeps running as long as a condition (in this example x < 1000) is TRUE:

x <- 2
while(x < 1000){
  x <- x^2
  print(x)
}

[1] 4
[1] 16
[1] 256
[1] 65536

Warning

Be careful: if the condition never becomes FALSE, the loop runs forever.

The `apply()` family

for loops are clear but can be verbose. The apply() family lets you apply a function to every element (or row/column) of an object in one go.

`sapply()` and `lapply()`

Both apply a function to each element of a vector or list. The difference is the output format.

lapply() always returns a list

df <- patients |> select(-id)

lapply(X = df, FUN = mean)

$age
[1] 54.75

$systolic
[1] 131.75

sapply() tries to simplify to a vector or matrix

sapply(X = df, FUN = mean)

     age systolic 
   54.75   131.75

This is the same as manually doing lapply + unlist:

lapply(X = df, FUN = mean) |> unlist()

     age systolic 
   54.75   131.75

The apply() family also easily work with custom functions:

squared <- function(x) {
  x^2
}
sapply(X = c(1:5), FUN = squared)

[1]  1  4  9 16 25

`apply()` for matrices/data frames

apply() works on rows (MARGIN = 1) or columns (MARGIN = 2):

mat <- matrix(1:12, nrow = 3)
mat

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

print("Row sums:")

[1] "Row sums:"

apply(X = mat, MARGIN = 1, FUN = sum)

[1] 22 26 30

print("Column sums:")

[1] "Column sums:"

apply(X = mat, MARGIN = 2, FUN = sum)

[1]  6 15 24 33

The problem with the apply() family is that the function names and output types are different every time, which makes them hard to remember.

`purrr::map()`

The {purrr} package gives you a consistent, predictable alternative. map() always returns a list; suffix variants guarantee the output type:

Function	Output
`map()`	list
`map_dbl()`	numeric vector
`map_chr()`	character vector
`map_lgl()`	logical vector
`map_df()`	data frame

map() always returns a list

map(df, mean)

$age
[1] 54.75

$systolic
[1] 131.75

Want a numeric vector? Be explicit:

map_dbl(df, mean)

     age systolic 
   54.75   131.75

map() works naturally with pipes

df |> map(mean)

$age
[1] 54.75

$systolic
[1] 131.75

`map2()` — mapping over two inputs at once

This function requires both inputs to be the same length, and element i of input .x will be mapped to element i of input .y; that is, you don’t get every possible combination of .x and .y

names <- c("Alice", "Bob", "Carol")
scores <- c(88, 95, 72)

myPaste <- function(nm, scr) paste0(nm, " scored ", scr)

map2_chr(.x = names, .y = scores, .f = myPaste)

[1] "Alice scored 88" "Bob scored 95"   "Carol scored 72"

For all combinations, try something like:

with(
  expand.grid(names, scores),
  paste0(Var1, " scored ", Var2)
)

[1] "Alice scored 88" "Bob scored 88"   "Carol scored 88" "Alice scored 95"
[5] "Bob scored 95"   "Carol scored 95" "Alice scored 72" "Bob scored 72"  
[9] "Carol scored 72"

`Vectorize()`

Last session we wrote functions. Most built-in R functions are already vectorised — they work on a whole vector at once:

sqrt(c(4, 9, 16))

[1] 2 3 4

But if you write your own function with an if / else inside, it usually expects a single value:

bp_category <- function(systolic) {
  if (systolic >= 130) "High" else "Normal"
}

c(120, 145, 132) |> bp_category() # doesn't work on a vector

Error in `if (systolic >= 130) ...`:
! the condition has length > 1

Vectorize() wraps your function so it loops automatically:

bp_category <- bp_category |> Vectorize()

c(120, 145, 132) |> bp_category()

[1] "Normal" "High"   "High"

This is equivalent to writing a for loop, but in one line. (For the tidyverse crowd, dplyr::case_when() is often a cleaner solution inside mutate().)

Parallel loops with `foreach` and `%dopar%`

So far, all loops run sequentially — one step at a time. For computationally expensive work (e.g., bootstrapping, simulations), you can run iterations in parallel across multiple CPU cores using {doParallel} with the for-like syntax of foreach().

library(foreach)
library(doParallel)

cl <- makeCluster(10)          # open 10 workers
registerDoParallel(cl)

mean_age <- function(df) mean(df$age)

results <- foreach(i = 1:1000, 
                   .packages = "dplyr",
                   .combine = c) %dopar% {
  dataset_i <- slice_sample(df, n = nrow(df), replace = TRUE) 
  result_i <- mean_age(dataset_i)
  return(result_i)
}

stopCluster(cl)
closeAllConnections()

results |> quantile(probs = c(0.025,0.975))

    2.5%    97.5% 
47.75000 61.87812

.combine = c means the result of each loop iteration is returned in a combined vector (c()) in the end.

Tip

For simpler parallelism inside purrr, the {furrr} package lets you swap map() for future_map() with almost no code change.

Honourable mention: Recursion

A function can call itself. This is called recursion, and it’s another way to repeat an operation. Classic example: factorials.

factorial <- function(n) {
  if (n <= 1) return(1)
  else return(n * factorial(n - 1))
}
factorial(5)

[1] 120

Recursion is elegant for certain problems (tree traversal, divide-and-conquer algorithms) but can be slow and hard to debug in R. Loops or map() are almost always what you need.

Some data to work on

for loops

while loops

The apply() family

sapply() and lapply()

apply() for matrices/data frames

purrr::map()

map2() — mapping over two inputs at once

Vectorize()

Parallel loops with foreach and %dopar%

Honourable mention: Recursion

Further reading

`for` loops

`while` loops

The `apply()` family

`sapply()` and `lapply()`

`apply()` for matrices/data frames

`purrr::map()`

`map2()` — mapping over two inputs at once

`Vectorize()`

Parallel loops with `foreach` and `%dopar%`