Håvard Karlsen - A tale of two data analysis workflows

I’m taking a break from my usual zealous proselytising of the virtues of R. Instead, let me tell you a story of two researchers.

Miles’ tale

We start with Miles, who’s doing a PhD on psychopenguinology — the study of penguins’ mental health. Currently he’s working on a paper that looks at penguinian depression as a function of body mass and sex. For this paper he collaborates with his supervisor, Dr Olivia, and he uses data from penguins collected by Dr Gorman¹. Let’s look at Miles’ workflow.

The to do list looks like this:

Create composite scores for anxiety and depression
Filter dataset to only retain the Gentoo and Adélie penguins
Run regression analysis
Create table showing the results of the regression analysis
Write out the results in the paper

Initial data wrangling

Miles starts by loading the data into Microsoft Excel.

Anxiety and depression have been measured using 14 items from the Hospital Anxiety and Depression Scale for Penguins (HADS), and he needs to calculate composite depression and anxiety scores from them. While this is possible to do in his favourite program for statistical solutions (FPSS), it’s far easier to do in Excel. Miles creates composite scores by taking the mean of the 7 items that belong to anxiety and depression, respectively.

Further data wrangling

Now Miles loads the revised dataset into his FPSS, SPSS.

He uses the filter function to filter out any penguin that has the species type “Chinstrap”. To do this, he clicks on

Data,
Select cases,
If,
- specifies the filter condition,
Continue, and finally
OK.

Data analysis

Now for the fun part, the analysis! Still in SPSS, Miles clicks on

Analyze,
Regression,
Linear,
the dependent variable,
first right arrow,
the first predictor,
second right arrow,
the second predictor,
second right arrow, and finally
the OK button.

Create table

The tables have to be created in a different software. Despite SPSS claiming to be able to output to the APA² style, Miles has found that he still needs to edit the table extensively to be consistent with the style. After some harrowing experiences with creating tables in Word, Miles now creates all his tables in Excel and then copy-pastes this into Word documents. This usually works fine, apart from when it doesn’t.

He writes up the results from the SPSS output in a table and formats the text in it.

Reporting results in text

The final step. Miles opens the Word document that he uses to write the paper. In the Results section he writes text explaing the effects he found and repeating some of the numbers he found in the analysis.

Submit, get accepted, publish, win at life?

Having sent the paper to his supervisor for final approval, Miles receives a reply from Dr Olivia that his method for creating composite scores of anxiety and depression is wrong and has to be corrected.

Miles clears his Friday afternoon, goes back to the initial Excel file with his data, and creates new composite scores. Then he imports the data back into SPSS, recreates the filter, does the data analysis again, puts the new results into a table, and updates the text in the results section of his paper.

The paper is ready to be submitted when Miles discovers a potential error. Some of the penguins have missing values on species.

Miles’ filter excluded the Chinstrap penguins, assuming that the rest would be Gentoo or Adélie. With the presence of penguins with unknown species, this means that he cannot be sure his analysis only contains Gentoo or Adélie penguins. Realising the flaw in his logic, he improves on the filter by explicitly selecting only those rows where species is Gentoo or Adélie.

Of course, now he has to repeat the rest of the steps after this: re-run the analysis, re-input the values in the table, and update the text in the results section.

Late Friday night, just as Miles has finished all this, an e-mail ticks in from Dr Olivia

I just found this extra part of the dataset hiding under my sofa. We have to include it in the paper as well.

Miles sighs, pours himself a glass of akvavit and accepts his fate.

He goes back to the initial Excel file. Here he adds the extra cases to the datasets and repeats the process from before. Create composite variables, import to SPSS, add filter, run analysis, create table, update results, call a doctor to get diagnose for Carpal Tunnel syndrome.

What happens next in the story of Miles? Who knows. Upon submitting he’ll probably receive major revisions, with reviewer 2 suggesting he perform a multilevel model. Necessitating another trip back to SPSS and a thousand clicks through menus and submenus.

Gwen’s tale

We move across the multiverse and end up in a universe quite like that of Miles. Gwen is working on the same PhD, under similar conditions. The only real difference between Gwen and Miles is that Gwen’s supervisor, Dr Otto, convinced her of the wonder that is the R³ programming language. Let’s take a look at Gwen’s workflow.

Intial data wrangling

Gwen loads the data into R.

# Import data
peng <- read.csv(file = "penguins.csv")

# Take a look at the data
head(peng)

  X species    island bill_length_mm bill_depth_mm flipper_length_mm
1 1  Adelie Torgersen           40.3          18.0               195
2 2  Adelie Torgersen             NA            NA                NA
3 3  Adelie Torgersen           36.7          19.3               193
4 4  Adelie Torgersen           39.3          20.6               190
5 5  Adelie Torgersen           38.9          17.8               181
6 6  Adelie Torgersen           39.2          19.6               195
  body_mass_g    sex year hads1 hads2 hads3 hads4 hads5 hads6 hads7 hads8 hads9
1        3250 female 2007     3     1     2     0     1     0     0     1     0
2          NA   <NA> 2007     1     1     0     3     0     1     2     3     0
3        3450 female 2007     3     1     1     0     3     1     0     2     0
4        3650   male 2007     2     0     1     3     2     3     0     0     2
5        3625 female 2007     1     0     1     1     1     1     2     1     2
6        4675   male 2007     1     2     1     2     2     2     2     2     2
  hads10 hads11 hads12 hads13 hads14
1      2      0      0      2      0
2      1      3      2      2      2
3      3      1      3      0      2
4      3      1      1      1      3
5      0      3      3      1      0
6      1      1      0      1      3

She creates composite scores for anxiety and depression.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Create sum scores
peng <- peng |> 
  rowwise() |> 
  mutate(
    anxiety = mean(c(hads1, hads3, hads5, hads7, hads9, hads11, hads13)),
    depression = mean(c(hads2, hads4, hads6, hads8, hads10, hads12, hads14))
  ) |> 
  ungroup() |> 
  select(-c(hads1:hads14))
glimpse(peng)

Rows: 304
Columns: 11
$ X                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm    <dbl> 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, …
$ bill_depth_mm     <dbl> 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, …
$ flipper_length_mm <int> 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 191…
$ body_mass_g       <int> 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, …
$ sex               <chr> "female", NA, "female", "male", "female", "male", NA…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ anxiety           <dbl> 1.142857, 1.142857, 1.142857, 1.285714, 1.571429, 1.…
$ depression        <dbl> 0.5714286, 1.8571429, 1.7142857, 1.8571429, 0.857142…

Further data wrangling

She filters out the Gentoo penguins.

# Filter out the Chinstrap penguins
peng <- peng |> 
  filter(species != "Chinstrap")
glimpse(peng)

Rows: 229
Columns: 11
$ X                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm    <dbl> 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, …
$ bill_depth_mm     <dbl> 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, …
$ flipper_length_mm <int> 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 191…
$ body_mass_g       <int> 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, …
$ sex               <chr> "female", NA, "female", "male", "female", "male", NA…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ anxiety           <dbl> 1.142857, 1.142857, 1.142857, 1.285714, 1.571429, 1.…
$ depression        <dbl> 0.5714286, 1.8571429, 1.7142857, 1.8571429, 0.857142…

Data analysis

She runs the analysis, regressing depression on body mass and sex.

# Predict depression from body mass and sex
fit <- lm(data = peng,
          formula = depression ~ body_mass_g + sex)

summary(fit)


Call:
lm(formula = depression ~ body_mass_g + sex, data = peng)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.06768 -0.23027 -0.01457  0.24301  0.93671 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.401e+00  1.376e-01  10.183   <2e-16 ***
body_mass_g 1.462e-05  3.358e-05   0.435    0.664    
sexmale     3.484e-02  5.651e-02   0.617    0.538    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.374 on 216 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.004658,  Adjusted R-squared:  -0.004558 
F-statistic: 0.5054 on 2 and 216 DF,  p-value: 0.604

Create table

She sends the model fit created for the analysis to a function that creates a table from it.

# Create and save table
apaTables::apa.reg.table(fit, 
                         table.number = 1,
                         filename = "table1.doc")

A table created with the apaTable package

Ideally you’d want to fiddle slightly more with the knobs on the package than I did here.

Report the results

Gwen is the kind of person that takes efficiency far beyond the stage where it is actually efficient. Thus, she writes her paper in rmarkdown and she uses the model fit object directly when reporting her findings. Let me try to show how that works.

She writes this monstrous mess in her rmarkdown-document:

The model explained only 
`r round(summary(fit)$r.squared * 100, 1)` percent of penguin 
depression. 
Neither body mass 
(b  = `r format(unname(fit$coefficients[2]), scientific = FALSE)`, 
*p* = `r round(summary(fit)$coefficients[2, 4], 3)` nor sex 
(b  = `r format(unname(fit$coefficients[3]), scientific = FALSE)`, 
*p* = `r round(summary(fit)$coefficients[3, 4], 3)` had 
significant effects on depression.

And it produces this:

The model explained only 0.5 percent of penguin depression. Neither body mass (b = 0.00001461875, p = 0.664 nor sex (b = 0.03483884, p = 0.538 had significant effects on depression.

Submit, get accepted, publish, win at life?

What happens when Dr Otto grumbles about the correct way to create composite scores for the HADS factors? Gwen updates a line in her code and re-runs the entire thing. (Second paragraph.)

# Import data
peng <- read.csv(file = "penguins.csv")

# Create sum scores
# Updated to multiply result by 7, as per instructions.
peng <- peng |> 
  rowwise() |> 
  mutate(
    anxiety = mean(c(hads1, hads3, hads5, hads7, hads9, hads11, hads13) * 7),
    depression = mean(c(hads2, hads4, hads6, hads8, hads10, hads12, hads14) * 7)
  ) |> 
  select(-c(hads1:hads14))

# Filter out the Chinstrap penguins
peng <- peng |> 
  filter(species != "Chinstrap")

# Predict depression from body mass and sex
fit <- lm(data = peng,
          formula = depression ~ body_mass_g + sex)

# Create and save table
apaTables::apa.reg.table(fit, 
                         table.number = 1,
                         filename = "table1.doc")

What happens when she discovers the error in the filter? She updates a line in her code and re-runs the entire thing. (Third paragraph.)

# Import data
peng <- read.csv(file = "penguins.csv")

# Create sum scores
# Updated to multiply result by 7, as per instructions.
peng <- peng |> 
  rowwise() |> 
  mutate(
    anxiety = mean(c(hads1, hads3, hads5, hads7, hads9, hads11, hads13) * 7),
    depression = mean(c(hads2, hads4, hads6, hads8, hads10, hads12, hads14) * 7)
  ) |> 
  select(-c(hads1:hads14))

# Filter updated to include only Gentoo or Adélie penguins
peng <- peng |> 
  filter(species == "Gentoo" | species == "Adelie")

# Predict depression from body mass and sex
fit <- lm(data = peng,
          formula = depression ~ body_mass_g + sex)

# Create and save table
apaTables::apa.reg.table(fit, 
                         table.number = 1,
                         filename = "table1.doc")

What happens when Dr Otto discovers that the pointy object under his sofa pillow is a missing piece of their dataset? She updates a line in her code and re-runs the entire thing. (Second paragraph.)

# Import data
peng <- read.csv(file = "penguins.csv")

# Updated to import new data, then merge with existing data.
new_peng <- read.csv(file = "new_penguins.csv")
peng <- rbind(peng, new_peng)

# Create sum scores
# Updated to multiply result by 7, as per instructions.
peng <- peng |> 
  rowwise() |> 
  mutate(
    anxiety = mean(c(hads1, hads3, hads5, hads7, hads9, hads11, hads13) * 7),
    depression = mean(c(hads2, hads4, hads6, hads8, hads10, hads12, hads14) * 7)
  ) |> 
  select(-c(hads1:hads14))

# Filter updated to include only Gentoo or Adélie penguins
peng <- peng |> 
  filter(species == "Gentoo" | species == "Adelie")

# Predict depression from body mass and sex
fit <- lm(data = peng,
          formula = depression ~ body_mass_g + sex)

# Create and save table
apaTables::apa.reg.table(fit, 
                         table.number = 1,
                         filename = "table1.doc")

Every time she reruns the commands, the table and text are also updated to reflect the updated model fit and regression coefficients.

She finishes the entire thing before happy hour starts Friday evening, goes to her local baR and high-fives all the R aficionados she meets there.

Epilogue

Before I say anything else, let me just acknowledge that, yes, the above text was obviously hyperbole. A supervisor would never get back to you on the same day you emailed them.

With that caveat, what I’m trying to show here is that a workflow where you create code instead of static objects has potential for strong gains. Gwen is not penalised whenever something happens further up in her workflow, like adding more data. She simply slots in the adjustment where applicable and re-runs the entire thing. Miles has to start from scratch every time.⁴

You can recreate a lot of Gwen’s workflow with SPSS’ syntax. But why pay⁵ to use a cumbersome proprietary coding language when you can adopt a fun, free, more intuitive open source coding language?

Lest it be construed that my whole point in this post was to make fun of people who use Excel and SPSS: I started my data workflow as Miles. The road that eventually led me to where I am today started with SPSS syntax, went into Stata, took a detour through Python and eventually landed me in R.

Resources

If you are a Miles who’s looking to turn more into a Gwen, you’ll be wanting some tips and resources to help you on your way. There’s an abundance of resources online, so here’s just a smattering.

Hadley Wickham’s R for Data Science is a kind introduction to R. It focuses on the tidyverse packages. Hadley is wonderfully pedagogic in his teachings.
An introduction to R from CRAN is a slightly more technical introduction and focuses more on base R.
If your workflow involved a lot of Excel files, Bruno Rodrigues has many helpful blog posts for you.
If you know Norwegian, you can check out my R guide cryptically called R for byplankontoret.
Follow the hashtag #rstats on Fosstodon or whatever the name is of that website that Elon Musk owns. You’ll discover lots of cool utilities of R.

Appendix: Recreating the penguin dataset

# Install the palmerpenguin package if you haven't already.
# install.packages("palmerpenguins")

penguins <- palmerpenguins::penguins

# Introduce missing values on species at random rows ----
# To challenge myself I did this in base R.

set.seed(24)
# Creates a random vector that indicates at which index to insert NA
drop_here <- sample(c(TRUE, FALSE), 
                    nrow(penguins), 
                    replace = TRUE, 
                    prob = c(0.05, 0.95))

# Insert NA in species at the index positions indicated by the vector.
penguins$species[drop_here == TRUE] <- NA

# Add HADS items ----

# Item values
# Created by creating a random integer between 0 and 3.
populate_items <- function(n) {
  round(runif(n, 0, 3))
}

# I'm already missing dplyr and purrr.
library(dplyr)
library(purrr)

# Creates 14 new columns that are populated by the function we just made.
penguins <- penguins |> 
  mutate(
    map_dfc(1:14, ~tibble("hads{.x}" := populate_items(nrow(penguins))))
  ) 

# Split off a small group for latter addition ----

set.seed(42)

# Randomly mark penguins for relocation.
penguins$split_off <- sample(c(TRUE, FALSE), 
                             nrow(penguins), 
                             replace = TRUE, 
                             prob = c(0.1, 0.90))

# Put those marked into a new group
new_penguins <- penguins |> 
  filter(split_off == TRUE) |> 
  select(-split_off)

# Remove those marked from the old group, and delete the marking variable.
penguins <- penguins |> 
  filter(split_off == FALSE) |> 
  select(-split_off)

# Clean up 
rm(drop_here, populate_items)

Footnotes

The data is the Palmer penguins dataset that I’ve modified for this example. For code to reproduce the dataset, see the appendix at the bottom.↩︎
the American Penguin Association↩︎
Busted, this post really is about R after all.↩︎
Miles could of course save versions of his dataset whenever he made changes to them. This is also problematic. You lose the trace from the data’s initial state. If you make a mistake in calculating the sum scores of depression initially, you might never discover this unless you were to re-create them later. In Gwen’s approach the recipe she used is always visible and easy to scrutinise and adjust. Not so in Miles’ approach. If you never make any mistakes in your data handling, this is fine. Does this describe you? If so, please get in touch as I’m planning an experiment on people lacking self-awareness.↩︎
Or let your institution spend its budget/the taxpayers’ money it?↩︎