library(dplyr)
library(purrr)
set.seed(1234)Loops and parallelisation
Loops are one of the most fundamental tools in programming. They allow you to repeat a block of code — either a fixed number of times, or until some condition is met.
Some data to work on
patients <- tibble(
id = 1:8,
age = c(45, 67, 53, 71, 38, 60, 55, 49),
systolic = c(120, 145, 132, 158, 112, 140, 128, 119)
)
patients |> gt::gt(id = "patients")| id | age | systolic |
|---|---|---|
| 1 | 45 | 120 |
| 2 | 67 | 145 |
| 3 | 53 | 132 |
| 4 | 71 | 158 |
| 5 | 38 | 112 |
| 6 | 60 | 140 |
| 7 | 55 | 128 |
| 8 | 49 | 119 |
for loops
A for loop runs a block of code once for every element in a vector; i.e., it runs a fixed/prespecified number of times:
for (i in 1:5) {
print(
paste0(
"Age of row ", i, ": ",
patients$age[i]
)
)
}[1] "Age of row 1: 45"
[1] "Age of row 2: 67"
[1] "Age of row 3: 53"
[1] "Age of row 4: 71"
[1] "Age of row 5: 38"
You can loop over any vector — including character vectors:
columns <- c("age", "systolic")
for (col in columns) {
print(paste0(
"Mean ", col, ": ",
mean(patients[[col]]))
)
}[1] "Mean age: 54.75"
[1] "Mean systolic: 131.75"
A common pattern is to build up a result vector as you go:
high_bp <- c()
for (i in 1:nrow(patients)) {
high_bp <- c(high_bp, patients$age[i])
}
high_bp[1] 45 67 53 71 38 60 55 49
This might be useful if you’re bootstrapping, for instance:
results <- c()
mean_age <- function(df) mean(df$age)
for (i in 1:10000) {
dataset_i <- slice_sample(df, n = nrow(df), replace = TRUE)
result_i <- mean_age(dataset_i) # or any prespecified analysis function
results <- c(results, result_i)
}
conf.int <- quantile(results, probs = c(0.025,0.975))
paste0("Mean age: ", mean(df$age))
conf.intwhile loops
A while loop keeps running as long as a condition (in this example x < 1000) is TRUE:
x <- 2
while(x < 1000){
x <- x^2
print(x)
}[1] 4
[1] 16
[1] 256
[1] 65536
Be careful: if the condition never becomes FALSE, the loop runs forever.
The apply() family
for loops are clear but can be verbose. The apply() family lets you apply a function to every element (or row/column) of an object in one go.
sapply() and lapply()
Both apply a function to each element of a vector or list. The difference is the output format.
lapply() always returns a list
df <- patients |> select(-id)
lapply(X = df, FUN = mean)$age
[1] 54.75
$systolic
[1] 131.75
sapply() tries to simplify to a vector or matrix
sapply(X = df, FUN = mean) age systolic
54.75 131.75
This is the same as manually doing lapply + unlist:
lapply(X = df, FUN = mean) |> unlist() age systolic
54.75 131.75
The apply() family also easily work with custom functions:
squared <- function(x) {
x^2
}
sapply(X = c(1:5), FUN = squared)[1] 1 4 9 16 25
apply() for matrices/data frames
apply() works on rows (MARGIN = 1) or columns (MARGIN = 2):
mat <- matrix(1:12, nrow = 3)
mat [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
print("Row sums:")[1] "Row sums:"
apply(X = mat, MARGIN = 1, FUN = sum)[1] 22 26 30
print("Column sums:")[1] "Column sums:"
apply(X = mat, MARGIN = 2, FUN = sum)[1] 6 15 24 33
The problem with the apply() family is that the function names and output types are different every time, which makes them hard to remember.
purrr::map()
The {purrr} package gives you a consistent, predictable alternative. map() always returns a list; suffix variants guarantee the output type:
| Function | Output |
|---|---|
map() |
list |
map_dbl() |
numeric vector |
map_chr() |
character vector |
map_lgl() |
logical vector |
map_df() |
data frame |
map() always returns a list
map(df, mean)$age
[1] 54.75
$systolic
[1] 131.75
Want a numeric vector? Be explicit:
map_dbl(df, mean) age systolic
54.75 131.75
map() works naturally with pipes
df |> map(mean)$age
[1] 54.75
$systolic
[1] 131.75
map2() — mapping over two inputs at once
This function requires both inputs to be the same length, and element i of input .x will be mapped to element i of input .y; that is, you don’t get every possible combination of .x and .y
names <- c("Alice", "Bob", "Carol")
scores <- c(88, 95, 72)
myPaste <- function(nm, scr) paste0(nm, " scored ", scr)
map2_chr(.x = names, .y = scores, .f = myPaste)[1] "Alice scored 88" "Bob scored 95" "Carol scored 72"
For all combinations, try something like:
with(
expand.grid(names, scores),
paste0(Var1, " scored ", Var2)
)[1] "Alice scored 88" "Bob scored 88" "Carol scored 88" "Alice scored 95"
[5] "Bob scored 95" "Carol scored 95" "Alice scored 72" "Bob scored 72"
[9] "Carol scored 72"
Vectorize()
Last session we wrote functions. Most built-in R functions are already vectorised — they work on a whole vector at once:
sqrt(c(4, 9, 16))[1] 2 3 4
But if you write your own function with an if / else inside, it usually expects a single value:
bp_category <- function(systolic) {
if (systolic >= 130) "High" else "Normal"
}
c(120, 145, 132) |> bp_category() # doesn't work on a vectorError in `if (systolic >= 130) ...`:
! the condition has length > 1
Vectorize() wraps your function so it loops automatically:
bp_category <- bp_category |> Vectorize()
c(120, 145, 132) |> bp_category()[1] "Normal" "High" "High"
This is equivalent to writing a for loop, but in one line. (For the tidyverse crowd, dplyr::case_when() is often a cleaner solution inside mutate().)
Parallel loops with foreach and %dopar%
So far, all loops run sequentially — one step at a time. For computationally expensive work (e.g., bootstrapping, simulations), you can run iterations in parallel across multiple CPU cores using {doParallel} with the for-like syntax of foreach().
library(foreach)
library(doParallel)cl <- makeCluster(10) # open 10 workers
registerDoParallel(cl)
mean_age <- function(df) mean(df$age)
results <- foreach(i = 1:1000,
.packages = "dplyr",
.combine = c) %dopar% {
dataset_i <- slice_sample(df, n = nrow(df), replace = TRUE)
result_i <- mean_age(dataset_i)
return(result_i)
}
stopCluster(cl)
closeAllConnections()
results |> quantile(probs = c(0.025,0.975)) 2.5% 97.5%
47.75000 61.87812
.combine = c means the result of each loop iteration is returned in a combined vector (c()) in the end.
For simpler parallelism inside purrr, the {furrr} package lets you swap map() for future_map() with almost no code change.
Honourable mention: Recursion
A function can call itself. This is called recursion, and it’s another way to repeat an operation. Classic example: factorials.
factorial <- function(n) {
if (n <= 1) return(1)
else return(n * factorial(n - 1))
}
factorial(5)[1] 120
Recursion is elegant for certain problems (tree traversal, divide-and-conquer algorithms) but can be slow and hard to debug in R. Loops or map() are almost always what you need.