Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained - Microsoft, 2012

This is a fun paper from Microsoft talking about some of the interesting online experiments they ran.

  • “Only one third of ideas tested at Microsoft improved the metric(s) they were designed to improve”
  • The authors advocate the use of an A/A Test also called a Null Test
    • Allows you to test the experiment system
    • They found it incredibly helpful
      • I thought this was an interesting idea
  • OEC - Overall Evaluation Criterion
    • Has to be chosen correctly
    • Difficult because its much harder to measure long term effects
    • They describe one experiment that resulted in showing worse results to users in Bing actually increased their key metrics of query share and revenue per search
    • Turns out that if the results are bad a user needs to search more
    • Key lesson: short-term metrics are not always in line with long-term goals
  • Click Tracking
    • A piece of JavaScript to be executed every time a user clicked a result was added
    • This slowed the user experience
    • However, users were clicking more
    • Turns out that this allowed more time for the tracking beacons to complete
    • They noticed that IE users click behavior did not change
    • Firefox, Chrome and Safari are aggressive about stopping requests once a user navigates away from the page
    • In general, if you notice the effect is different between browsers, you should take a look at your instrumentation
  • Primacy and Novelty
    • Sometimes when you run an experiment the initial results differ from the long terms results as users adjust to the change
    • However, they show with an A/A test that the way metrics are presented can cause experimenters to believe there are primacy or novelty effects when none are present
  • Experiment Length
    • We often feel that increasing the experiment length increases it’s power
    • However, in many metrics, “the confidence interval width does not change much over time”
    • To combat this you would need to run more users / day in the experiment
  • Carryover Effects
    • Many experimentation systems use buckets of users that change infrequently
    • They found that reusing the same bucket splits often leads to bad results
    • An experiment with the same bucket can cause the effects of the previous experiment to leak into the new one
      • Even after several months of the first experiment being turned off!
    • You can measure this with A/A experiments
  • Summary:
    • Our experiment instrumentation is not as precises as we hope
    • Offline techniques do not always work well online
    • We need to understand the results

    “Anyone can run online controlled experiments and generate numbers with six digits after the decimal point … the real challenge is in understanding when the results are invalid.”