A Guide to Analyzing Experimental Data | by Martina Pocchiari | Towards Data Science

Hands-on Tutorials, Thoughts and Theory

A Guide to Analyzing Experimental Data

How to overcome the start-up panic, and deliver insightful results from your experiments

Martina Pocchiari
Towards Data Science
10 min readJul 26, 2021

--

Photo by Lukas Blazek on Unsplash

Have you ever run an experimental study, or performed some A/B testing? If so, you should be familiar with the pre-analysis panic: how can you make the data reveal whether your experiment has worked? Every day — in economics, public policy, marketing, and business analytics — we face the challenges that come from running experiments and analyzing what comes out of them.

As researchers — who struggle with a clean and efficient experimental workflow ourselves — we have decided to share with you a practical guide, complete with all the steps you need to follow when you want to analyze experimental data. We cannot promise that the journey will be short, but we assure you it will be fun!

As usual, you can find all the code used in this guide in its Github Repository.

Who Can Benefit from this Guide?

This guide specifically develops a protocol for the analysis of experimental data, and is especially helpful if you often find yourself blanking in front of your laptop.

We will provide a brief description of what an experiment is and why — if well designed — it overcomes the common problems of observational studies. After that, we will generate our simulated data with a simple experimental flow on Qualtrics (it does not really matter which software you use to generate the data).

Finally, we will write a ready-to-use and pretty flexible R code to apply to your experimental data. In particular, we will emphasize the mapping between the standard experimental analysis you have in mind, and the code to write to achieve your most desired outputs.

Ideally, anyone who conducts or is planning to conduct experiments could have a look at our guide!

The Experimental Flow

The main advantage of using an experiment compared to observational data is that well-designed experiments allow you to measure causal effects.

The main feature of an experiment that ensures the estimate of a causal effect is that people are randomly allocated to an experimental condition. This feature is called “randomization” and it prevents people with certain characteristics from self-selecting into the Treatment and the Control group.

Treatment and Control groups follow the same experimental flow, but the participants in the Treatment group face a different step compared to the Control: the manipulation of the independent variable.

By comparing how the outcome variable changes across Treatment and Control, we will assess whether the manipulation of the independent variable affects causally the outcome variable.

And Now, Let’s Code!

For this analysis, we will use a few R packages to import, reshape, visualize, and analyze our (simulated) experimental data. Remember: if you did not install the packages already, do it using the code “install.packages()” as illustrated in the commented line below.

# install.packages('packagename')library(tidyverse);library(data.table) # data wrangling and reshaping
library(ggplot2) # plotting
library(QuantPsyc) # To generate results
library(grf)
library(viridis) # For esthetics
library(stargazer) # optional, for reporting

Step 0: Import and Inspect data

The data we will use in this tutorial are generated with Qualtrics, a popular website used for designing questionnaires and experimental surveys.

We developed an experimental survey based on the flow we described earlier. Then, we generated 500 automated (“test”) responses for the purpose of our analysis.

We downloaded the test responses in CSV format, and are now importing them in R. We are also renaming the most important variables, and transforming numeric variables into — well, numeric quantities:

dt <- fread('ExperimentalAnalysis_TestData.csv', stringsAsFactors = F, header = T) %>% as_tibble()dt <- dt %>% 
rename(Treatment = FL_8_DO,
Outcome_T = Q3_1,
Outcome_C = Q7_1,
Initial_Beliefs = Q15_1,
AttnCheck = Q16,
Age = Q8,
Nationality = Q9,
Internet_use = Q10,
Last_election = Q11,
Which_politician = Q12)
dt <- dt %>%
mutate_at(.vars = vars(Outcome_T,Outcome_C,Initial_Beliefs), .funs = as.numeric)

We will be working with the tibble” data structure, because our dataset is reasonably small, and tibble allows us to nicely preview the data directly in the console. For larger datasets (i.e., datasets exceeding a few million rows), you may want to switch to a “data.table” structure.

In reality, you may be working not only with different data sizes, but also with different software, or with hundreds or thousands of responses — it doesn’t matter much, as long as you can import your data matrix in R, and recover the Treament-Control-Outcome variables from your data.

Initial Inspection

Upon first inspection, we need to remove the first two lines that are not useful for our analysis (slice function). However, we will save the first line (slice(1)), because it contains the descriptions of our variables.

dt
colnames(dt)
dt_question_labels <- dt %>% slice(1)dt_question_labels <- dt_question_labels %>%
t() %>%
as.data.frame() %>%
mutate(Label = rownames(.))
(dt <- dt %>%
slice(-c(1,2)))

Incomplete Cases

We need to make sure that we only analyze data from respondents who completed the survey. If you notice that many respondents in your sample did not complete the survey (usually, about 5% of missing cases are tolerated), this may be a sign that something is wrong with your survey, with your import routine, or with both.

It is good practice to check whether the participants allocated in the Treatment or in the Control have the same completion rate. If this is not the case, the internal validity of your experiment might be in danger.

dt %>% 
group_by(Finished) %>%
summarise(n = n())
# If you have 'False' in the result, kick out those peopledt <- dt %>%
filter(Finished == 'True')

Test Responses

We also need to remove the answers that were given in any “Preview” mode or “Test” mode, because those were not generated from your experimental sample. (Note: in our case, all answers were generated in “Test” mode).

dt <- dt %>% 
# filter(!(Status %in% c('Survey Preview', 'Survey Test')))
filter(!(Status %in% c('Survey Preview')))

Missing Answers

We need to check that there are no missing data in the treatment or outcome variables. Here, you can find the responses with missing data in the treatment or outcome, and identify the respondentID. These responses are not usable since there is no treatment.

If you find missing data, you need to understand the cause of the missing data as soon as possible. It may be a technical error, or a systematic problem with your survey.

(test_NA <- dt %>% 
# select respondent ID's, outcome variables, and treatment indicator
select(ResponseId, Treatment, Outcome_C, Outcome_T) %>%
# mutate everything into numeric
mutate_at(., vars(Outcome_C, Outcome_T), as.numeric) %>%
mutate(Treatment_numeric = as.numeric(as.factor(Treatment))) %>%
# if a cell is empty/missing, flag with NA
mutate_all(., function(x) ifelse(x == '', NA, x)) %>%
bind_rows(tibble(Outcome_C = NA, Outcome_T = NA, Treatment_numeric = NA)) %>%
rowwise() %>%
mutate(sum = sum(Outcome_C, Outcome_T, Treatment_numeric, na.rm = T)) %>%
filter(sum == 0))
# If ResponseID is equal to NA, then you're fine and you can skip this part.# if you have missing data, you can collect the ID's here:if (sum(test_NA$sum != 0)) {

message("Your data has missing values! Find the cause ASAP.")

Missing_ids <- dt %>%
select(ResponseId, Treatment1, Treatment2, ControlGroup) %>%
bind_rows(tibble(Treatment1 = NA, Treatment2 = NA, ControlGroup = NA)) %>%
rowwise() %>%
mutate(sum = sum(Treatment1,Treatment2,ControlGroup, na.rm = T)) %>%
filter(sum == 0)

Missing_ids <- Missing_ids$ResponseId

# and remove them from the sample

dt <- dt %>%
filter(!(ResponseId %in% Missing_ids))

}

Failed Attention Checks

We also need to exclude respondents who failed our attention checks (Label AttnCheck in our dataset). In our simulated sample, 339 bots failed the attention check. We are left with 161 usable responses.

Once again, it is good practice to check if specific categories of people have failed the attention check in order to preserve the internal validity of the study.

# How many people failed the attention check?
dt %>%
filter(AttnCheck != 'A') %>%
tally()
dt <- dt %>%
filter(AttnCheck == 'A')

Outliers

As a final health check on our data, we will check whether some respondents took an unusually long time to complete our survey. We adopt an arbitrary but common threshold: we will flag and exclude respondents whose completion time is larger or smaller than 3 standard deviations from the mean.

In our test data, the completion time of 6 bots is larger or smaller than 3 standard deviations from the mean.

# How many outliers are there?
dt %>%
filter(`Duration (in seconds)` < sd(`Duration (in seconds)`)*3 & `Duration (in seconds)` > -(sd(`Duration (in seconds)`)*3)) %>%
tally()
dt <- dt %>%
filter(`Duration (in seconds)` < sd(`Duration (in seconds)`)*3 & `Duration (in seconds)` > -(sd(`Duration (in seconds)`)*3))

Outcome variable

If, while coding the Qualtrics survey, you measured the outcome variable separately for the treatment and for the control groups of respondents, then we need to also add one single column with the value of the two outcomes.

dt <- dt %>% 
rowwise() %>%
mutate(Outcome = sum(Outcome_T, Outcome_C, na.rm = T))

Control variables

In our very last data-preparation step, we transform categorical variables into factors. This step is not strictly necessary, but it will be helpful in the data visualization and analysis.

dt <- dt %>% 
mutate_at(c('Age', 'Nationality', 'Internet_use', 'Last_election'), as.factor)

Integrity of randomization

We need to ensure that the participants in the two groups, Treatment and Control, share similar characteristics on average — in other words, that the participants are randomly allocated to the experimental conditions.

We can run a two-sample independent t-test to assess whether binary variables and continuous variables differ across conditions.

When the variables are categorical, then we will use a Chi-square test to compare how these variables differ across groups. For example:

chisq.test(dt$Treatment, dt$Age, correct=FALSE)

If the p-value of these tests is higher than 0.05, then we can assume that the randomization worked.

Step 1: Data Visualization

Before formally analyzing the experimental data, it is important that we visualize it. Visualization is a powerful tool to spot any unconvincing situations — such as a failed randomization, a failed manipulation, or ceiling and floor effects — and to have an initial sense of the effect’s direction.

First, we will visualize the distribution of the outcome variable across our experimental conditions. We are particularly interested in a few specific moments of the distribution — such as the mean and the median.

dt %>% 
group_by(Treatment) %>%
summarise(mean_T = mean(Outcome),
median_T = median(Outcome))
dt %>%
group_by(Treatment) %>%
mutate(mean_T = mean(Outcome),
median_T = median(Outcome)) %>%
ggplot(aes(x = Outcome)) +
geom_histogram() +
geom_vline(aes(xintercept = mean_T), color = 'red', linetype= 2) +
geom_vline(aes(xintercept = median_T), color = 'blue', linetype= 2) +
facet_wrap(.~Treatment) +
theme_bw() +
labs(caption = 'Red line: mean; Blue line: median')

From the distribution, we can see that the average of the outcome variable in this simulated dataset is slightly higher in the treatment group (51.53) than in the control group (47.95).

Alternatively, we can directly plot the mean by group, along with the bootstrapped standard errors for the means:

dt %>% 
group_by(Treatment) %>%
mutate(mean_T = mean(Outcome),
median_T = median(Outcome)) %>%
ungroup %>%
ggplot(aes(x = Treatment, y = Outcome)) +
stat_summary(fun.data = mean_cl_boot) +
theme_bw()

We can also check how the outcome changes across people with different background characteristics. Let’s look, for example, at age:

dt <- dt %>%
mutate(Age = factor(Age, levels = c('18 or younger', '19-24', '25-35',
'36-45', '46-55', '56 or older')))
dt %>%
ggplot(aes(x = Age, y = Outcome, color= Treatment)) +
stat_summary(fun.data = mean_cl_boot, position = position_dodge(width = .5)) +
theme_bw() +
scale_color_viridis(discrete = T) +
coord_flip()

Step 2: Analysis

If all the checks so far gave us confidence about the randomization, comparing how the outcome variable changes between Treatment and Control will give us a causal effect.

Non-parametric Analysis

If we cannot safely assume that the data we are analyzing follow a normal distribution, we need to use a non-parametric test to study how the outcome variable varies, on average, between Treatment and Control. The test that serves this purpose is the Wilcoxon test.

Parametric Analysis

In case the data have a normal structure (see the Jarque-Bera test), we can use a independent two-sample t-test to compare the outcome variable across Treatment and Control.

t.test(dt$Outcome[dt$Treatment == 'Control'],dt$Outcome[dt$Treatment == 'Treatment1'] , alternative = "two.sided", paired = F)

In addition to the t-test, we will perform a linear regression of the outcome variable on a constant, the independent variable (treatment) — which we will transform as a binary variable, with value 1 if the participant belongs to the Treatment group — and all the other control variables.

A regression analysis is recommended for two reasons:

  1. To assess whether treatment has any effect on the outcome variable, keeping everything else constant;
  2. To assess the impact of the treatment on the outcome variable if the randomization did not work properly.
fit <- lm(Outcome ~ 1 + Treatment + Age + Internet_use, data = dt)
summary(fit)

Step 3: An Alternative (Non-Parametric) Analysis

We also recommend you to try using Generalized Random Forests for the estimation of heterogeneous treatment effects. Treatment effects are heterogeneous when they differ across strata of your data —in other words, when the treatment effect varies substantially across different sub-groups of respondents.

To use GRF for the estimation of heterogeneous treatment effects, we need to separately code the Treatment (W), the Outcome (Y), and the heterogeneity variables (X) into three separate matrices.

W <- dt[, 'Treatment'] %>%
mutate(Treatment = ifelse(Treatment == 'Treatment1', 1, 0)) %>%
as.matrix()
X <- model.matrix(~ 0 + ., dt[c('Age','Nationality', 'Internet_use')])Y <- dt %>%
select(Outcome) %>%
as.matrix()
cf <- causal_forest(X, Y, W)

Finally, we can estimate the conditional average treatment effect on the full sample (CATE), and the conditional average treatment effect on the treated sample (CATT).

average_treatment_effect(cf, target.sample = “all”)
average_treatment_effect(cf, target.sample = “treated”

Conclusions

If you made it this far: congratulations on performing your analysis of experimental data!

What would you have done differently? What other important steps would you add to the workflow? Let us know in the comments.

About the Authors

Francesco Capozza is a PhD candidate at the Erasmus School of Economics. He uses online and field experiments to understand human choices about politics and labor.

Martina Pocchiari is a PhD candidate at the Rotterdam School of Management, Erasmus University. Martina runs lab and field experiments to research people’s behavior in online communities.

--

--

Assistant Professor @ NUS Business School. Studying online communities, social networks, information consumption. More about me: http://bit.ly/mpocchiari