This is a fun paper from Microsoft talking about some of the interesting online
experiments they ran.
- “Only one third of ideas tested at Microsoft improved the metric(s) they were
designed to improve”
- The authors advocate the use of an A/A Test also called a Null Test
- Allows you to test the experiment system
- They found it incredibly helpful
- I thought this was an interesting idea
- OEC - Overall Evaluation Criterion
- Has to be chosen correctly
- Difficult because its much harder to measure long term effects
- They describe one experiment that resulted in showing worse results to users
in Bing actually increased their key metrics of query share and revenue per
search
- Turns out that if the results are bad a user needs to search more
- Key lesson: short-term metrics are not always in line with long-term
goals
- Click Tracking
- A piece of JavaScript to be executed every time a user clicked a result was
added
- This slowed the user experience
- However, users were clicking more
- Turns out that this allowed more time for the tracking beacons to complete
- They noticed that IE users click behavior did not change
- Firefox, Chrome and Safari are aggressive about stopping requests once a
user navigates away from the page
- In general, if you notice the effect is different between browsers, you
should take a look at your instrumentation
- Primacy and Novelty
- Sometimes when you run an experiment the initial results differ from the
long terms results as users adjust to the change
- However, they show with an A/A test that the way metrics are presented can
cause experimenters to believe there are primacy or novelty effects when
none are present
- Experiment Length
- We often feel that increasing the experiment length increases it’s power
- However, in many metrics, “the confidence interval width does not change
much over time”
- To combat this you would need to run more users / day in the experiment
- Carryover Effects
- Many experimentation systems use buckets of users that change infrequently
- They found that reusing the same bucket splits often leads to bad results
- An experiment with the same bucket can cause the effects of the previous
experiment to leak into the new one
- Even after several months of the first experiment being turned off!
- You can measure this with A/A experiments
- Summary:
- Our experiment instrumentation is not as precises as we hope
- Offline techniques do not always work well online
- We need to understand the results
“Anyone can run online controlled experiments and generate numbers with six
digits after the decimal point … the real challenge is in understanding
when the results are invalid.”