Minimal, reproducible example

Where to ask for help

Consider StackOverflow! Always look for extisting answers to your question first, then post! AI (LLMs) can be a good help for simple cases but for more complex things will often give you code that breaks easily. Can also at times be esoteric and difficult to debug.

There is no shame in using LLMs as long as one is aware of these issues. In fact, LLMs can be a great way of turning your headache into a well-formed question.

What is a minimal, reproducible example (MRE)

When presenting your problem to others, it can be very helpful to move away from the full dataset and ALL the code, and create a miniature version of your problem; an MRE.

  1. Minimal: The example should use as little code and data as possible to produce the problem

  2. Complete: Your question should contain ALL the information needed to reproduce the problem.

  3. Reproducible: Make sure the code and data provided ACTUALLY reproduce the same problem (and not a different one)

Create the code example by building it up step-by-step until the problem appears. Alternatively, build up the whole code and remove bits at a time, until the problem disappears - then reinsert the last part that was removed.

How to simulate a dataset

Here are some good functions to know when creating a mock dataset.

Setting the random seed

Before doing any random operations, setting the seed to a fixed value will ensure that the code produces exactly the same output every time:

set.seed(12345)
rpois(5, 10)
[1] 11 12  9  8 11
set.seed(12345)
rpois(5, 10) 
[1] 11 12  9  8 11
set.seed(23)
rpois(5, 10)
[1] 10  8 12 15 13

Sampling from a distribution

N <- 20

The normal distribution

# The normal distribution
rnorm(n = N, mean = 65, sd = 7)
 [1] 72.75243 63.05340 72.13444 65.31806 76.03046 66.52802 57.67425 62.97918
 [9] 68.37085 56.48537 67.15696 61.35875 61.90380 60.80481 74.06204 70.84774
[17] 61.03789 70.51894 56.83849 61.28426

The binomial distribution

# size = number of flips of the coin. E.g., size=1 and prob = 0.3 gives you a binary variable, where approx. 60% of values are 1, the remaining 0.
rbinom(n = N, size = 1, prob = 0.6)
 [1] 1 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 0 1 0
# size = 3, prob = 0.5 counts how many heads you'd get if you flipped a fair count thrice (1/8 = 0, 3/8 = 1, 3/8 = 2, 1/8 = 3)
rbinom(n = N, size = 3, prob = 0.5)
 [1] 2 2 3 2 0 1 2 1 2 3 2 0 3 2 2 0 1 3 0 1

The poisson distribution

rpois(n = N, lambda = 15)
 [1]  8 15 13 15 10 14  7 17 13 15 15 16 13 16 18 12 15 22 10 16

And whatever else

Read about other distributions using e.g., ?rnorm, and notice the other ways of extracting info about a distribution (dnorm, pnorm, qnorm for the normal density function, cumulative distribution function and quantile function, respectively).

?rweibull
?rcauchy
?rchisq
# ...

An example dataset

The simple way:

df <- data.frame(
  trt = rbinom(n = 5, size = 1, prob = 0.5),
  age = rnorm(n = 5, mean = 65, sd = 7) |> round(1),
  tte = rpois(n = 5, 15)
)
df |> gt::gt()
trt age tte
0 70.1 16
1 62.0 19
1 64.1 20
0 72.2 18
0 65.0 14

Tibbles allow you to build up columns sequentially; i.e. use info from one column in building the next:

library(tibble)
df_tibble <- tibble::tibble(
  trt = rbinom(n = 5, size = 1, prob = 0.5),
  age = rnorm(n = 5, mean = 65, sd = 7) |> round(1),
  tte_death = rpois(n = 5, lambda = 15+trt),
  tte_censor = rpois(n = 5, lambda = 13),
  tte = pmin(tte_death, tte_censor),
  event = ifelse(tte_death <= tte_censor, 1, 0)
)
df_tibble |> gt::gt()
trt age tte_death tte_censor tte event
0 69.8 13 14 13 1
0 74.4 13 16 13 1
1 76.8 12 6 6 0
0 69.4 17 14 14 0
0 57.9 17 11 11 0

Sampling from a set of values

Sample from a vector (a form of list in R) of values, using sample().

vec <- c(1,5,8)
sample(
  x = vec,
  size = 10,
  replace = TRUE,
  prob = c(0.2,0.3,0.5)
)
 [1] 5 5 8 5 8 8 5 5 8 8

This can also be used to sample row-indices to extract entire rows from a dataset.

row_indices <- sample(
  x = 1:nrow(df_tibble),
  size = 3
)
row_indices
[1] 2 4 5
df_tibble[row_indices,] |> gt::gt()
trt age tte_death tte_censor tte event
0 74.4 13 16 13 1
0 69.4 17 14 14 0
0 57.9 17 11 11 0

Or use the {dplyr} function sample_n() to sample rows from a table directly:

dplyr::sample_n(df_tibble, size = 3) |> gt::gt()
trt age tte_death tte_censor tte event
0 57.9 17 11 11 0
0 69.4 17 14 14 0
0 74.4 13 16 13 1