I was browsing some survey results related to a project I worked on some time ago. We had surveyed some students in various context. It was important to know the date at which the students were surveyed, but the survey tool did not include a way to record this meta-data. I won’t shame the tool by naming it, mostly because I can’t rule out that the function did in fact exist and I just couldn’t find it.

As a consequence, we had to include this rather easy question at the start:

Which date is it today?

As I looked through the data, I was surprised by one respondent’s answer:

library(tidyverse)
df %>% tail(15)

## # A tibble: 15 x 1
##    date      
##    <date>    
##  1 2021-11-15
##  2 2021-11-15
##  3 2021-11-15
##  4 2021-11-15
##  5 2002-08-13
##  6 2021-11-15
##  7 2021-11-20
##  8 2021-11-20
##  9 2021-11-20
## 10 2021-11-20
## 11 2021-11-20
## 12 2021-11-20
## 13 2021-11-27
## 14 2021-11-27
## 15 2021-11-27

They had gotten today’s date wrong by about twenty years. Upon further inspection, I was surprised by how many failed this question and how far away from the true date they were. Take a look at the top six dates when sorted from the lowest.

df %>% count(date) %>% head()

## # A tibble: 6 x 2
##   date           n
##   <date>     <int>
## 1 1994-05-02     1
## 2 1995-08-29     1
## 3 1998-11-21     1
## 4 1998-12-08     1
## 5 1999-09-19     1
## 6 1999-10-19     1

These people all seem to live twenty years in the past. How could this be? To be clear, all had access to the computer clock at the computer task bar, and none showed the tell-tale signs of having recently time-travelled.

Pictured: People showing signs of recent time travel.

A closer look

How many got the answer wrong? No surveys were sent out before 2021, so let’s count the number of people who gave an earlier date than that:

df %>% count(date < "2021-01-01")

## # A tibble: 2 x 2
##   `date < "2021-01-01"`     n
##   <lgl>                 <int>
## 1 FALSE                   178
## 2 TRUE                     13

13 people. This means that 7.3% got it wrong. That is a surprisingly large percentage.

Let’s look at which dates these participants reported instead of today’s date:

df %>% filter(date < "2021-01-01")

## # A tibble: 13 x 1
##    date      
##    <date>    
##  1 2000-09-01
##  2 2001-06-03
##  3 2000-09-14
##  4 1999-09-19
##  5 1998-12-08
##  6 2001-02-09
##  7 1999-10-19
##  8 1995-08-29
##  9 1998-11-21
## 10 2020-11-01
## 11 2000-02-23
## 12 1994-05-02
## 13 2002-08-13

With the background knowledge that this is a student sample (ie. I know what their expected age is), it’s easy to figure out what happened here. They skimmed the question and filled out the date that they are most used to filling out in the endless sea of online surveys that fill their online presence: their birthday. They didn’t actually get the question wrong, they didn’t even read the question. Only one person got the date actually wrong: they did what everyone does in the month of January - report the previous year instead of the current one.

df %>% filter(date < "2021-01-01" & date > "2003-01-01")

## # A tibble: 1 x 1
##   date      
##   <date>    
## 1 2020-11-01

Implications

I sympathise with the participants. Undoubtably I’ve also filled out surveys on auto-pilot and gotten obvious things wrong. Yet, it’s a challenge to the validity of a survey if you cannot even assume that the participants are reading the questions. It’s such a basic assumption of survey work that most people take it as a given. It’s also hard to get around. If your participants are not reading the text put in front of them, you can’t put more text there imploring them to read it.

Looking at the last 15 years of digital behavioural trends, it seems to me that there’s been an increase in survey bombardment. I can hardly buy a toothbrush online without being peppered with marketing e-mails asking me to fill out a questionnaire on the user interface of the online storefront I used. If I made a reservation at a restaurant, they same happens. If you have kids at university, you bet you’ll be asked to fill out some online survey for their coursework. No wonder your eyes starte to glaze over as you skim the “information for participants” page.

Some proposed solutions: Taking a page from the social desirability and faking playbook, you could include a control question to root out those not paying attention. Something as simple as an item that requires the minimal amount of cognitive work, and won’t let you proceed until you get it right. Oh no, I guess I just advocated the use of CAPTCHAs in surveys. Disregard!

To be honest, I don’t think this is a big problem. My impression is that most survey participants do read the text in front of them. At least 92.7% of the above do. Yet, it’s helpful to sometimes stop and question the basic, underlying assumptions of your data collection. On commercially available programs like Amazon’s mTurk, where people are paid to fill out surveys, are the participants not incentivised to prioritise speed rather than honesty? How does one know that they are answering in line with their convictions? I assume Amazon is aware of that and have measures in place. Most of the work my students and I do are with surveys with no immediate benefits from participating. Sometimes we’ll scrape together the money for some gift cards that go to a lucky few participants.

The easiest solution, of course, would be for the survey software to simply include what would be obvious to anyone apart from the nameless software I’m stuck with: meta-data about when the survey was completed!

About the data

The data is based on real participant data, but is anonymised. I have scrambled the dates by adding a random integer to each date. It’s still representative of the real data.