library(dplyr)
library(data.table)
set.seed(12345)Coding paradigms
A coding paradigm is a style of coding that follows a specific set of principles; think of them like different dialects of the language. R has three common ones for data manipulation:
| Paradigm | Main package(s) | Tagline |
|---|---|---|
| Base R | Built-in | Explicit, no dependencies |
| Tidyverse | {dplyr}, {tidyr} |
Readable, pipe-friendly |
| data.table | {data.table} |
Concise, fast, memory-efficient |
What’s “best” depends on the use case. You’ll encounter all three in the wild, although many guides use the Tidyverse.
Some data to work on
df <- tibble(
id = 1:10,
age = rnorm(10, mean = 40, sd = 10) |> round(),
sex = sample(c("M", "F"), 10, replace = TRUE),
weight = rnorm(10, mean = 75, sd = 10) |> round(),
height = rnorm(10, mean = 1.66, sd = 0.1) |> round(2)
)
DT <- as.data.table(df) # data.table copy
df |> gt::gt(id = "df_table")| id | age | sex | weight | height |
|---|---|---|---|---|
| 1 | 46 | F | 83 | 1.84 |
| 2 | 47 | M | 66 | 1.61 |
| 3 | 39 | F | 72 | 1.72 |
| 4 | 35 | F | 86 | 1.72 |
| 5 | 46 | M | 78 | 1.64 |
| 6 | 22 | F | 83 | 1.74 |
| 7 | 46 | F | 90 | 1.88 |
| 8 | 37 | F | 69 | 1.86 |
| 9 | 37 | F | 59 | 1.82 |
| 10 | 31 | F | 59 | 1.69 |
Indexing styles
Base R
Base R uses explicit indexing with $ for columns and [] for rows and columns. You need to specify exactly what you want, which can be verbose. data.frame[rows, columns] is the general form.
df[df$age <= 35, "weight"]# A tibble: 3 × 1
weight
<dbl>
1 86
2 83
3 59
Alternatively, you can use $ to select a column, but this gives you the column as a standalone vector rather than as a one-column dataframe:
df[df$age <= 35,]$weight[1] 86 83 59
Or the double-bracket [[ to select a column by name, which also returns a vector - this can be useful when you can’t explicitly write the column name but have it stored in a variable:
df[df$age <= 35,][["weight"]][1] 86 83 59
column_names <- "weight"
df[df$age <= 35,][[column_names]][1] 86 83 59
Tidyverse
The tidyverse uses something called non-standard evaluation, allowing you to write column names directly without quoting. The pipe operator |> (or previously* %>%) lets you chain operations together in a readable way. You subset data using filter() (rows) and select() (columns).
df |> filter(age <= 35) |> select(weight)# A tibble: 3 × 1
weight
<dbl>
1 86
2 83
3 59
Data.table
{data.table} extends the df[] bracket notation into DT[i, j, by], where
ifilters rows,jspecifies an operation (select / compute columns)bygroups the data.
For example:
DT[age <= 35, weight][1] 86 83 59
Add a column
Base R
In base R, you can add a new column by direct assignment:
df$bmi <- (df$weight / df$height^2) |> round(1)
head(df)# A tibble: 6 × 6
id age sex weight height bmi
<int> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 46 F 83 1.84 24.5
2 2 47 M 66 1.61 25.5
3 3 39 F 72 1.72 24.3
4 4 35 F 86 1.72 29.1
5 5 46 M 78 1.64 29
6 6 22 F 83 1.74 27.4
Tidyverse
In the tidyverse, you can use mutate() to add a new column:
df <- df |> mutate(
bmi = (weight / height^2) |> round(1)
)
head(df)# A tibble: 6 × 6
id age sex weight height bmi
<int> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 46 F 83 1.84 24.5
2 2 47 M 66 1.61 25.5
3 3 39 F 72 1.72 24.3
4 4 35 F 86 1.72 29.1
5 5 46 M 78 1.64 29
6 6 22 F 83 1.74 27.4
Data.table
In data.table, you can add a new column using :=:
DT[, bmi := (weight / height^2) |> round(1)]
head(DT) id age sex weight height bmi
<int> <num> <char> <num> <num> <num>
1: 1 46 F 83 1.84 24.5
2: 2 47 M 66 1.61 25.5
3: 3 39 F 72 1.72 24.3
4: 4 35 F 86 1.72 29.1
5: 5 46 M 78 1.64 29.0
6: 6 22 F 83 1.74 27.4
:= modifies the table in place (no copy is made), which is why data.table is so memory-efficient.
Group-wise operations
Base R
Base R doesn’t have built-in group-wise operations, but you can use aggregate() to summarise by group:
aggregate(weight ~ sex, data = df, FUN = mean) sex weight
1 F 75.125
2 M 72.000
Tidyverse
In the tidyverse, you can easily group by a variable and summarise:
df |>
group_by(sex) |>
summarise(mean_weight = mean(weight))# A tibble: 2 × 2
sex mean_weight
<chr> <dbl>
1 F 75.1
2 M 72
Data.table
In {data.table}, you can specify the grouping variable very naturally in the by argument:
DT[, .(mean_weight = mean(weight)), by = sex] sex mean_weight
<char> <num>
1: F 75.125
2: M 72.000
Here, .() is shorthand for list() and names the output columns.
When to use which?
Base R
Base R uses explicit indexing and standard evaluation. What you see is what you get — there are no hidden conventions or non-standard evaluation tricks.
Pros: Zero dependencies; great for simple scripts, package development, and teaching fundamentals.
Cons: Can get messy for complex data manipulation; less efficient on large datasets due to copying.
Tidyverse
The tidyverse (centred on {dplyr}) is declarative and uses non-standard evaluation — you write column names without quoting them or prefixing with $. Operations are naturally chained together with the pipe |>, reading top-to-bottom like a sentence; Each so-called verb takes a data.frame as its first argument and returns a data.frame — making them easy to compose.
Pros: The most readable option; ideal for clarity/readability. Rich ecosystem of packages; everything from data manipulation to visualization.
Cons: Like base R, it can be slower than data.table for large datasets due to copying.
data.table
data.table is concise and fast. It brings the DT[i, j, by] syntax, allowinng you to filter rows, operate on columns, and group by variables all in one place. It modifies data in place, making it memory-efficient. It also uses non-standard evaluation, so you can write column names directly without quoting.
Pros: Best when working with millions of rows; in-place modification and optimised grouping make it the fastest and most memory-efficient option for large datasets. It also has powerful features for complex data manipulation.
Cons: The concise syntax can be less intuitive for beginners, and it has a steeper learning curve. Speed gains are often exaggerated, as base R and tidyverse have gained performance improvements in recent years.
Most real-world R code mixes paradigms. You might load data with base R, clean it with {dplyr}, and run a heavy aggregation with {data.table}. Pick the right tool for each job.