set.seed(7)
sample(3)
[1] 2 1 3
Lucy D’Agostino McGowan
January 22, 2018
Recently I tweeted a small piece of advice re: when to set a seed in your script:
tip: always
set.seed()
AFTER loading ggplot2
Jenny pointed out that this may be blog post-worthy, so here we are!
A little bit about where this tweet came from: I was out to lunch with my friend Jonathan – over our scrumptious Thai food, he mentioned he was having some simulation trouble. In particular, some of his simulations were breaking, but when he tried to run the script locally, everything seemed fine. He was using the same seed in both instances, and in fact was running the exact same script – what could be causing this discrepancy! I recalled that ggplot2 would generate a random message about 10% the time, and asked whether he thought loading this package could somehow be affecting his simulation results. After some further investigation 🕵️♂️, it looked like indeed this may be the culprit!
The set.seed()
function sets the starting number used to generate a sequence of random numbers – it ensures that you get the same result if you start with that same seed each time you run the same process. For example, if I use the sample()
function immediately after setting a seed, I will always get the same sample.
If I run sample()
twice after setting a seed, however, I would not expect them to be the same. I’d expect the first result to match those above, and the second to be different.
The second is different because I have already performed one random process, so now my starting point prior to running the latter sample()
function is no longer 1
.
There is a small function in ggplot2 that runs when the library is loaded for the first time.
Note: This is a little different than this code looks now, it has been updated since this discussion began!
.onAttach <- function(...) {
if (!interactive() || stats::runif(1) > 0.1) return()
tips <- c(
"RStudio Community is a great place to get help: https://community.rstudio.com/c/tidyverse.",
"Find out what's changed in ggplot2 at https://github.com/tidyverse/ggplot2/releases.",
"Use suppressPackageStartupMessages() to eliminate package startup messages.",
"Need help? Try Stackoverflow: https://stackoverflow.com/tags/ggplot2.",
"Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/",
"Want to understand how all the pieces fit together? See the R for Data Science book: http://r4ds.had.co.nz/"
)
tip <- sample(tips, 1)
packageStartupMessage(paste(strwrap(tip), collapse = "\n"))
}
This function has a sample()
call, which will move the starting place of your random sequence of numbers. The main piece that caused Jonathan some 😫 is the !interactive()
logic, which only runs the remainder of the code if the session is interactive. Another thing that can cause a bit of confusion is this .onAttach()
function is only run the first time the library is loaded, so if I run what looks like the exact same code twice during the same session, I can get different results. For example,
Note: If you are quite clever, you will notice that this is not an interactive document and therefore neither the first nor the second chunk should run .onAttach()
– this is true! I’ve run them interactively and included the output for demonstration purposes 🙊.
[1] 3 2 1
Notice the second chunk gave us the same result as above, but the first chunk was different. That is because the first chunk runs the .onAttach()
function between when I set my seed and when I drew my sample.
It turns out this isn’t cause for concern re: reproducibility, thanks to Jim Hester’s new patch to ggplot2 phew! Prior to this update, non-interactive scripts will be reproducible (if always run non-interactively), however interactive scripts can cause some issues, as shown above. In general, I like to provide future Lucy with as much assistance as possible, so I will likely avoid setting seeds prior to loading packages on the off chance that it will make my debugging trickier in the future.
withr::with_preserve_seed
, love it!), so you can go download the development version and set your seed wherever you please.withr::with_preserve_seed()
function 🎉