[Book Notes 1] Trustworthy Online Controlled Experiments
Table of Contents
Introduction
Step-by-step walkthrough of A/B Testing
How to interpret A/B testing results?
Experimentation Trustworthiness
Experiment Deployment
Metrics
Randomization Unit
Introduction
“Trustworthy Online Controlled Experiments” is a practical guide to A/B testing. The book is a great overview of how several companies like Google, Amazon, Microsoft, and Linkedin leverage online experimentation to enhance their products. It serves as an excellent resource for individuals eager to expand their knowledge of A/B testing, especially those who have limited opportunities to practice it in their current work settings. (And that’s why I wanted to read this book!) It is also useful for leaders or analysts who would like to refine their A/B testing practices, as it delves into numerous advanced topics.
What are Online Controlled Experiments?
- They are usually called A/B testing. It’s a method of comparing two different versions of a webpage, feature, or element to determine which one performs better in terms of the metrics you set.
- It is one of the best scientific way to establish causality with high probability.
- They are able to detect small changes that are harder to detect with other techniques, such as changes over time (sensitivity).
Important Elements of Controlled Experiments:
- Overall Evaluation Criteion (OEC): A quantitative measure of the experiment’s objective. e.g., revenue, pageviews-per-user, click-through rate, conversion rate, etc.
- Parameter (= factor, variable): A controllable experimental variable that is thought to influence the OEC or other metrics of interest. In basic A/B test, there is a single parameter with two values. In Multivariate tests, there are multiple parameters involved.
- Variant: In a simple A/B test, the two variants are typically referred to as Control (A) and Treatment (B).
- Randomization Unit: Randomization is important to ensure that the populations assigned to the different variants are similar statistically. Common randomization units are: users, sessions.
(chapter 1)
Step-by-step walkthrough of A/B Testing
Design Online Controlled Experiments:
Now, let’s review the process of a practical online controlled experiment:
- Setting up the example — the feature we want to change: We want to evaluate the impact of simply adding a coupon code field to checkout on an e-commerce site.
- The process that a user will experience: It is useful to think about the online shopping process as a funnel.
3. Randomization unit: Let’s assume users.
4. Define goal metrics: Set up metrics that can measure the impact of the change. For instance, revenue-per-user as a success metric in this example.
5. What population of randomization units to be target: These are the segments of users that can be considered:
- All users who visited the site
- Only users who complete the purchase process
- Only users who start the purchase process
In this case, targeting ‘Only users who start the purchase process’ is the best choice as it include all potential affected users, but no users who never start checking out, who may dilute our results. We can also run the experiment for users with particular characteristics (e.g., geo, platform, or device type)
6. Size of the experiment: In here, the number of users. Essentially, we need to consider the statistical power and desired percentage of change to determine the appropriate sample size. Running a larger experiment with more users becomes necessary when aiming to detect a small change or increase confidence in the conclusion.
7. Length of the experiment: Here are some factors to consider:
- More users: The user accumulation rate over time is likely to be sub-linear given that the same user may return.
- Day-of-week effect: It is important to ensure the experiments captures the weekly cycle.
- Seasonality
- Primacy and Novelty effects
How to interpret A/B testing results?
After having the results from the expeirment, the goal is to translate the reuslts into a launch/no-launch decision. The book provides a practical way for decision:
- Statistical Significance: P-value + Confidence Interval
- Practical Significance: the metric you set (eg: a >= 1% increase in revenue-per-user
Scenarios:
- X Statisitcally X Practically → Not to launch
- V Statistically V Practically → Launch!
- V Statistically X Practically → confident about the magnitude of change, but the manitude may not be sufficient to outweight other factors such as cost
- Confidence intervals are outside of what is practically significant → The result could either increase or decrease revenue; Repeat the test with more units, providing greater statistical power
- X Statistically △ Practically → The change may have an impact, but it may also have no impact at all; Repeat the test with more power
- V Statstically △ Practically → Like 5, it is possible the change is not practically significant, repeat the test with greater power. Choosing to lauch may be more preferrable.
(chapter 2)
Experimentation Trustworthiness
Just like Twyman’s Lawm, which states "Any statistic that appears interesting is almost certainly a mistake”, experience has shown that many extreme results are more likley to be the result of an error in the test.
I highlight some common error in interpreting the statistics behind A/B testing from the book:
Misinterpreting P-values: The correct definition for P-value is: The probabillity of obtaining a result equal to or more extreme than what was observed, assuming that the Null hypothesis is true. The conditioning on the Null hypothesis is critical. One common misinterpretation is that the p-value captures the probability that the Null hypothesis is true given the data observed.
Misinterpreting Confidence Intervals: The correct definition for a 95% confidence interval is the range that covers the true difference 95% of the time and has an equivalence to a p-value of 0.05
Sample Ratio Mismatch (SRM): The observed allocation of users between the test groups significantly differs from the expected allocation under the specified allocation proportions (sample ratio). When browser redirects cause the Treatment group to experience additional redirects or delays, it may cause Sample Ratio Mismatch.
Simpson’s Paradox: Treatment may be better than Control in the first phase and in the second phase, but worse overall when the two periods are combined. An example is shown below:
(chapter 3)
Experiment Deployment
Exployment deployment usually involves two componets:
- Variant Assignment: The validity of A/B testing is rooted in the assumption that each variant is assigned to a random member. So Variant Assignment basically means that given a user request and its attributes (e.g., country, OS, platform), which experiment and variant combinations is that request assigned to? Generally, a user ID is needed to ensure the assignment is consistent for a user.
- Production code, system parameters, and value: Production code changes that implement variant behavior according to the assignment and ensures the user received the appropriate experience.
Scaling Experimentation
To scale the number of experiments. users must be in multiple experiments. How does this work?
In the Single-Layer method, each user is only in a single experiment. So the good thing is that it is simple and aloows multiple experiments to run simultaneously. But the main drawback is that we may not have enough traffic/users for adequate power in there are several concurrent experiments going on.
To enable scalable experimentation, companies often adopt a concurrent experiment system, allowing users to participate in multiple experiments simultaneously. One approach is the Multiple Experiment Layer method, where a user is included in all experiments concurrently. Each experiment is assigned a unique layer ID, and iterations of the same experiments share a consistent user ID to ensure a seamless user experience. However, this design has a potential downside: it may lead to collisions, where the combination of Treatments from different experiments results in a suboptimal user experience. For instance, if Experiment One tests blue text and Experiment Two tests a blue background, users assigned to both Treatments would have a poor experience.
The main thing to think about is that whether the reduction on statistical power when splitting up traffic (Single-Layer method) outweights the potential concern of interaction (Multi-Layer method).
(chapter 4)
Metrics
Organizational Metrics:
There are three main types of organization metrics:
- Goal Metrics: What the company ultimately cares about. Goal Metrics should be simple and stable.
- Driver Metrics: Driver metrics are leading indicators for goal metrics. They tend to be short-term.
- Guardrail Metrics: Guardrail metrics are designed to alert experimenters about a violated assumption. They measure aspects of a product or business that may be impacted by optimizing around your success metrics. They can prevent things from going terribly wrong. There are two types of Guardrail metrics: Trustworthiness-related guardrail and Organizational guardrail metircs. Frequent organizational guardrail metrics include: latency, HTML response size per page, JavaScript erros per page, Revenue-per-user, Pageviews-per-user, Crash rate
Tips to formulate metrics:
- Take ‘Quality’ into consideration: For example, a click on a search result is a ‘bad’ click if the user clicks the back button right away; a ‘good’ Linkedin profile contains sufficient information to reporesent the user, ushc as education and employment history
- It’s important to value user value and actions when formulating metrics.
- Take ‘Related Metrics’ into account: A metric’s movement can often be explained by other related metrics. For instance, when click-through rate (CTR) is up, is it because clicks are up or because pageviews are down? Another example is metrics with high variance such as revenue. Having a more sensitive, lower variance version such as trimmed revenue or other indicators, allows more informed decisions.
Tips to deal with multiple metrics:
When faced with multiple goal and driver metrics, there are steps you can take to address the situation. One approach is to combine them into an Overall Evaluation Criterion (OEC), where the OEC is calculated as the weighted sum of the normalized metrics. By doing so, you can classify your decisions into four main groups:
- If all key metrics are flat (not statistically significant) or positive (statistically significant), with at least one metric positive → ship the change
- If all key metrics are flat or negative, with at least one metric negative → don’t ship the change
- If all key metrics are flat → don’t ship the change or increase the experiment power
- If some key metrics are positive and some key metrics are negative → decide based on the tradeoffs
*a “flat” result indicates that there is no statistically significant difference, while a positive or negative result indicates a statistically significant finding.
(chapter 6,16)
Randomization Unit
The selection of the randomization unit plays a crucial role in experiment design as it has implications for both the user experience and the metrics that can be utilized to assess the impact of the experiment.
Granularities of randomization unit:
- Page-level: Each new web page viewed on a site is a unit
- Session-level: The group of webpages viewed on a single visit
- User-level: All events from a single user is a unit. User-level randomization is the most common as it avoids inconsistent user experience.
When deciding on the granularity, there are two questions to consider:
- How improtant is the consistency of the user experience?
For example, when the experiment is on font door, if page-level is the granularity, the font color could change with every page and create bad and inconsistent user experience.
2. Which metrics matter?
In general, it is advisable to use the same unit for randomization as the unit of analysis for the metrics of interest, as this simplifies the interpretation. For example, if your metrics include click-through rate (clicks-per-pageview), it is recommended to randomize at the page level. On the other hand, if your metrics involve clicks-per-user or pageviews-per-user, randomizing at the user level would be more appropriate.
(chapter 14)
📚
I hope you found this overview of the basic components and general process of A/B testing informative and helpful. In my upcoming article, I will delve into more advanced topics from the same book, so stay tuned for further exploration and deeper insights into this fascinating subject!
ABOUT ME
Thank you so much for reading my article! You are welcome to follow me and give claps to me if you find it inspiring :) I am Branda, a Data Analyst working in entertainment tech industry and a graduate from MSc Business Analytics at UT Austin. Don’t hesitate to email me at branda.huang@utexas.edu or connect me on Linkedin to discuss more interesting ideas!