Coding paradigms

library(dplyr)
library(data.table)
set.seed(12345)

A coding paradigm is a style of coding that follows a specific set of principles; think of them like different dialects of the language. R has three common ones for data manipulation:

Paradigm Main package(s) Tagline
Base R Built-in Explicit, no dependencies
Tidyverse {dplyr}, {tidyr} Readable, pipe-friendly
data.table {data.table} Concise, fast, memory-efficient

What’s “best” depends on the use case. You’ll encounter all three in the wild, although many guides use the Tidyverse.

Some data to work on

df <- tibble(
  id = 1:10,
  age = rnorm(10, mean = 40, sd = 10) |> round(),
  sex = sample(c("M", "F"), 10, replace = TRUE),
  weight = rnorm(10, mean = 75, sd = 10) |> round(),
  height = rnorm(10, mean = 1.66, sd = 0.1) |> round(2)
)

DT <- as.data.table(df)   # data.table copy

df |> gt::gt(id = "df_table")
id age sex weight height
1 46 F 83 1.84
2 47 M 66 1.61
3 39 F 72 1.72
4 35 F 86 1.72
5 46 M 78 1.64
6 22 F 83 1.74
7 46 F 90 1.88
8 37 F 69 1.86
9 37 F 59 1.82
10 31 F 59 1.69

Indexing styles

Base R

Base R uses explicit indexing with $ for columns and [] for rows and columns. You need to specify exactly what you want, which can be verbose. data.frame[rows, columns] is the general form.

df[df$age <= 35, "weight"]
# A tibble: 3 × 1
  weight
   <dbl>
1     86
2     83
3     59

Alternatively, you can use $ to select a column, but this gives you the column as a standalone vector rather than as a one-column dataframe:

df[df$age <= 35,]$weight
[1] 86 83 59

Or the double-bracket [[ to select a column by name, which also returns a vector - this can be useful when you can’t explicitly write the column name but have it stored in a variable:

df[df$age <= 35,][["weight"]]
[1] 86 83 59
column_names <- "weight"
df[df$age <= 35,][[column_names]]
[1] 86 83 59

Tidyverse

The tidyverse uses something called non-standard evaluation, allowing you to write column names directly without quoting. The pipe operator |> (or previously* %>%) lets you chain operations together in a readable way. You subset data using filter() (rows) and select() (columns).

df |> filter(age <= 35) |> select(weight)
# A tibble: 3 × 1
  weight
   <dbl>
1     86
2     83
3     59

Data.table

{data.table} extends the df[] bracket notation into DT[i, j, by], where

  • i filters rows,
  • j specifies an operation (select / compute columns)
  • by groups the data.

For example:

DT[age <= 35, weight]
[1] 86 83 59

Add a column

Base R

In base R, you can add a new column by direct assignment:

df$bmi <- (df$weight / df$height^2) |> round(1)
head(df)
# A tibble: 6 × 6
     id   age sex   weight height   bmi
  <int> <dbl> <chr>  <dbl>  <dbl> <dbl>
1     1    46 F         83   1.84  24.5
2     2    47 M         66   1.61  25.5
3     3    39 F         72   1.72  24.3
4     4    35 F         86   1.72  29.1
5     5    46 M         78   1.64  29  
6     6    22 F         83   1.74  27.4

Tidyverse

In the tidyverse, you can use mutate() to add a new column:

df <- df |> mutate(
  bmi = (weight / height^2) |> round(1)
)
head(df)
# A tibble: 6 × 6
     id   age sex   weight height   bmi
  <int> <dbl> <chr>  <dbl>  <dbl> <dbl>
1     1    46 F         83   1.84  24.5
2     2    47 M         66   1.61  25.5
3     3    39 F         72   1.72  24.3
4     4    35 F         86   1.72  29.1
5     5    46 M         78   1.64  29  
6     6    22 F         83   1.74  27.4

Data.table

In data.table, you can add a new column using :=:

DT[, bmi := (weight / height^2) |> round(1)]
head(DT)
      id   age    sex weight height   bmi
   <int> <num> <char>  <num>  <num> <num>
1:     1    46      F     83   1.84  24.5
2:     2    47      M     66   1.61  25.5
3:     3    39      F     72   1.72  24.3
4:     4    35      F     86   1.72  29.1
5:     5    46      M     78   1.64  29.0
6:     6    22      F     83   1.74  27.4

:= modifies the table in place (no copy is made), which is why data.table is so memory-efficient.

Group-wise operations

Base R

Base R doesn’t have built-in group-wise operations, but you can use aggregate() to summarise by group:

aggregate(weight ~ sex, data = df, FUN = mean)
  sex weight
1   F 75.125
2   M 72.000

Tidyverse

In the tidyverse, you can easily group by a variable and summarise:

df |>
  group_by(sex) |>
  summarise(mean_weight = mean(weight))
# A tibble: 2 × 2
  sex   mean_weight
  <chr>       <dbl>
1 F            75.1
2 M            72  

Data.table

In {data.table}, you can specify the grouping variable very naturally in the by argument:

DT[, .(mean_weight = mean(weight)), by = sex]
      sex mean_weight
   <char>       <num>
1:      F      75.125
2:      M      72.000

Here, .() is shorthand for list() and names the output columns.

When to use which?

Base R

Base R uses explicit indexing and standard evaluation. What you see is what you get — there are no hidden conventions or non-standard evaluation tricks.

Pros: Zero dependencies; great for simple scripts, package development, and teaching fundamentals.
Cons: Can get messy for complex data manipulation; less efficient on large datasets due to copying.

Tidyverse

The tidyverse (centred on {dplyr}) is declarative and uses non-standard evaluation — you write column names without quoting them or prefixing with $. Operations are naturally chained together with the pipe |>, reading top-to-bottom like a sentence; Each so-called verb takes a data.frame as its first argument and returns a data.frame — making them easy to compose.

Pros: The most readable option; ideal for clarity/readability. Rich ecosystem of packages; everything from data manipulation to visualization.
Cons: Like base R, it can be slower than data.table for large datasets due to copying.

data.table

data.table is concise and fast. It brings the DT[i, j, by] syntax, allowinng you to filter rows, operate on columns, and group by variables all in one place. It modifies data in place, making it memory-efficient. It also uses non-standard evaluation, so you can write column names directly without quoting.

Pros: Best when working with millions of rows; in-place modification and optimised grouping make it the fastest and most memory-efficient option for large datasets. It also has powerful features for complex data manipulation.
Cons: The concise syntax can be less intuitive for beginners, and it has a steeper learning curve. Speed gains are often exaggerated, as base R and tidyverse have gained performance improvements in recent years.

Tip

Most real-world R code mixes paradigms. You might load data with base R, clean it with {dplyr}, and run a heavy aggregation with {data.table}. Pick the right tool for each job.

Further reading

R intro (CRAN)

Tidyverse paper

dplyr intro vignette

data.table intro vignette