Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained - Microsoft, 2012

[Paper] [Mirror]

This is a fun paper from Microsoft talking about some of the interesting online experiments they ran.

“Only one third of ideas tested at Microsoft improved the metric(s) they were designed to improve”
The authors advocate the use of an A/A Test also called a Null Test
- Allows you to test the experiment system
- They found it incredibly helpful
  - I thought this was an interesting idea
OEC - Overall Evaluation Criterion
- Has to be chosen correctly
- Difficult because its much harder to measure long term effects
- They describe one experiment that resulted in showing worse results to users in Bing actually increased their key metrics of query share and revenue per search
- Turns out that if the results are bad a user needs to search more
- Key lesson: short-term metrics are not always in line with long-term goals
Click Tracking
- A piece of JavaScript to be executed every time a user clicked a result was added
- This slowed the user experience
- However, users were clicking more
- Turns out that this allowed more time for the tracking beacons to complete
- They noticed that IE users click behavior did not change
- Firefox, Chrome and Safari are aggressive about stopping requests once a user navigates away from the page
- In general, if you notice the effect is different between browsers, you should take a look at your instrumentation
Primacy and Novelty
- Sometimes when you run an experiment the initial results differ from the long terms results as users adjust to the change
- However, they show with an A/A test that the way metrics are presented can cause experimenters to believe there are primacy or novelty effects when none are present
Experiment Length
- We often feel that increasing the experiment length increases it’s power
- However, in many metrics, “the confidence interval width does not change much over time”
- To combat this you would need to run more users / day in the experiment
Carryover Effects
- Many experimentation systems use buckets of users that change infrequently
- They found that reusing the same bucket splits often leads to bad results
- An experiment with the same bucket can cause the effects of the previous experiment to leak into the new one
  - Even after several months of the first experiment being turned off!
- You can measure this with A/A experiments
Summary:
- Our experiment instrumentation is not as precises as we hope
- Offline techniques do not always work well online
- We need to understand the results
“Anyone can run online controlled experiments and generate numbers with six digits after the decimal point … the real challenge is in understanding when the results are invalid.”