Fundamentals of large-scale sequential experimentation

This is the supporting webpage for the tutorial at KDD on Aug 4, 2019 in Anchorage, Alaska. Location: Kahtnu 1, Level 2, Dena'ina. Download Slides (updated Aug 4).

Large-scale sequential hypothesis testing (A/B-testing) is rampant in the tech industry, with internet companies running hundreds of thousands of tests per year. About 6 years ago, Microsoft claimed that such experiments on Bing increased revenues in the order of hundreds of millions of dollars (Kohavi et al., 2013), and even 9 years ago, Google claimed that such experimentation was basically a mantra (Tang et al., 2010). This experimentation is actually “doubly-sequential”, since it consists of a sequence of sequential experiments.

In this tutorial, the audience will learn about the various problems encountered in large-scale, asynchronous, doubly-sequential experimentation, both for the inner sequential process (a single sequential test) and for the outer sequential process (the sequence of tests), and learn about recently developed principles to tackle these problems. We will discuss error metrics both within and across experiments, and present state-of-the-art methods that provably control these errors, both with and without resorting to parametric or asymptotic assumptions. In particular, we will demonstrate how the practices of peeking and marginal testing fail to control errors both within and across experiments, but how these can be alleviated using simple yet nuanced changes to the experimentation setup. We will also briefly discuss the role of multi-armed bandit methods for testing hypotheses (as opposed to minimizing regret), and the potential pitfalls due to selection bias introduced by adaptive sampling.

This tutorial is timely because while almost every single internet company runs such tests, most practitioners in the tech industry focus mainly on how to run a single test correctly. However, ignoring the interplay with the outer sequential process could unknowingly inflate the number of false discoveries, as we will carefully explain in the second half of the tutorial.

Technically, everything in this tutorial will be applicable outside the tech industry, for example in the applied sciences (psychology, clinical trials, etc), because they too typically sequentially collect data within an experiment subject-by-subject, and often run a series of experiments. The tutorial is critical because, unlike the pharma industry which is heavily regulated, experimentation in the tech industry is far more flexible with almost no oversight. If the recent advances made in these areas, as covered in the tutorial, are adopted, it can greatly improve the reproducibility and reliability of sequential experimentation in the tech industry.

Objectives

To educate the community broadly about problem of peeking at p-values (and confidence intervals) and the perils of marginal testing ignoring multiplicity.
To summarize advances made over the last few years in the data mining, ML and Statistics literatures to: (a) allow for optional stopping and continuation of experiments, (b) permit online corrections for multiple testing, and (c) to provide a unified framework for doubly-sequential experimentation with end-to-end guarantees on error rates.
To encourage immediate adoption by practitioners of several concrete mature ideas, as well as suggest avenues for short-term future investigation by researchers.

Target audience

We expect broad interest from academics (students, professors), researchers working with applied scientists, and industry participants (researchers and practitioners) alike. There will be enough new concepts for all attendees, and connections to other areas of data mining and machine learning research will be pointed out as appropriate. The mathematical background required would be basic probability and statistics. Most topics will be introduced as needed at both an intuitive and mathematical level. No special expertise or prior knowledge is needed to understand this tutorial, and we expect that even beginning masters or PhD students will be able to grasp the material.

Tutorial outline (4 hours)

Introduction [15 mins]: introduction to doubly-sequential experimentation, motivations, and outline of tutorial.

I. The inner process: a single experiment [1 hr]

Hypothesis testing basics: null/alternative, p-value, type-1 error (false positives), power (recall), duality with confidence intervals (15 mins).
The threat caused by peeking: why repeatedly running batch tests on accumulating data invalidates the p-value and inflates type-1 error, the same holds for confidence intervals (15 mins).
How sequential tests avoid the peeking problem, SPRT (sequential probability ratio tests), constructing always-valid p-values (15 mins).
Constructing confidence sequences (uniformly valid sequence of confidence intervals), nonparametric extensions to SPRT (15 mins).

-------Minor questions [5 mins]-------half hour coffee break -------(1 hr 50 min mark)-------

II. The outer process: a sequence of experiments [40 mins]

The issue of multiplicity: why running thousands of experiments necessitates a correction, even if the experiments and hypotheses are all independent, same holds true for confidence intervals (15 mins).
False discovery rate (FDR) and false coverage rate (FCR) as natural error metrics (5 mins).
Why Benjamini-Hochberg cannot be used online (5 mins).
A simple algorithm for online control of the FDR and the FCR (10 mins).
Handling local dependence (5 mins).

III. The joint doubly-sequential process [10 mins]

Putting the two modular pieces together: algorithms for fully asynchronous doubly-sequential experimentation.

-------Minor questions [5 mins]-------(2hr 45 min mark)-------

IV. Advanced topics (inner sequential process) [25 mins]

The use of multi-armed bandits (MAB) as adaptive hypothesis tests (5 mins).
Sequential estimation of quantiles: why this is a good idea (10 mins).
Running minimum of p-values or running intersection of intervals: pros and cons (5 mins).
Sequential average treatment effect estimation in A/B tests (or equivalently randomized clinical trials) with adaptive randomization (5 mins).

V. Advanced topics (outer sequential process) [15 mins]

Decaying memory (forgetting the past gracefully)
Utilizing prior information about hypotheses
Post-hoc analysis: answering questions in hindsight
Online control of the false sign rate

-------Minor questions [5 mins]-------(3hr 30 min mark)-------

Open directions for theory and practice [5 mins]: hierarchical error rates for large organizations, utilizing contextual information, systems that fail loudly.

Summary [15 mins]: Historical context, recap of tutorial, takeaway messages, software.

-------Longer questions [10 mins]-------End of tutorial--------(4hr mark)-------

Tutor biography

Aaditya Ramdas (aramdas@cmu.edu) is an assistant professor in the Department of Statistics and Data Science, and affiliated with the Machine Learning Department at Carnegie Mellon University. Previously, he was a postdoctoral researcher in Statistics and EECS at UC Berkeley from 2015-18, hosted by Martin Wainwright and Michael Jordan. He finished his PhD at CMU in Statistics and Machine Learning with Aarti Singh and Larry Wasserman, winning the Umesh K. Gavaskar Memorial Thesis Award. His undergraduate degree was in Computer Science from IIT Bombay. A lot of his research focuses on modern aspects of reproducibility in science and technology, involving statistical testing and false discovery rate control in static and dynamic settings. He also works on problems in sequential decision-making and online uncertainty quantification. Curriculum Vitae

The tutor is incredibly grateful to some excellent coauthors on the topic of this tutorial, including but not only (in alphabetical order): S. Howard, K. Jamieson, J. Tian, A. Weinstein, F. Yang, T. Zrnic.

Some references (not exhaustive)

The inner sequential process (confidence sequences and always-valid p-values):
1. Confidence sequences for mean, variance, and median. D. Darling and H. Robbins. Proceedings of the National Academy of Sciences, 1967.
2. On confidence sequences. T. Lai. The Annals of Statistics, 1976.
3. Always valid inference: bringing sequential analysis to A/B testing. R. Johari, L. Pekelis, and D. Walsh. arXiv, 2015.
4. Peeking at A/B tests: why it matters, and what to do about it. R. Johari, P. Koomen, L. Pekelis, and D. Walsh. 23rd International Conference on Knowledge Discovery and Data Mining, 2017.
5. Uniform, nonparametric, non-asymptotic confidence sequences. S. Howard, A. Ramdas, J. McAuliffe, and J. Sekhon. arXiv, 2018.
6. Sequential estimation of quantiles with applications to A/B-testing and best-arm identification. S. Howard and A. Ramdas. arXiv, 2019.
7. Safe Testing. P. Grünwald, R. de Heide, W. Koolen. arXiv:1906.07801, 2019.
8. Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. A. Deng, J. Lu, S. Chen. DSAA, 2016.
9. Why optional stopping is a problem for Bayesians. R. de Heide, P. Grünwald. arXiv, 2017.
10. The bias of the sample mean in multi-armed bandits can be positive or negative. J. Shin, A. Ramdas, A. Rinaldo. arXiv, 2019.
The outer sequential process (online multiple testing):
1. α-investing: a procedure for sequential control of expected false discoveries. D. Foster and R. Stine. Journal of the Royal Statistical Society, Series B, 2008.
2. Generalized α-investing: definitions, optimality results and application to public databases E. Aharoni and S. Rosset. Journal of the Royal Statistical Society, Series B, 2014.
3. Online rules for control of false discovery rate and false discovery exceedance. A. Javanmard and A. Montanari. The Annals of Statistics, 2018.
4. Online control of the false discovery rate with decaying memory. A. Ramdas, F. Yang, M. Wainwright, and M. Jordan. 31st Conference on Neural Information Processing Systems, 2017.
5. SAFFRON: an adaptive algorithm for online control of the false discovery rate. A. Ramdas, T. Zrnic, M. Wainwright, and M. Jordan. 35th International Conference on Machine Learning, 2018.
6. Online control of the false coverage rate and false sign rate. A. Weinstein and A. Ramdas. arXiv, 2019.
7. ADDIS: an adaptive discarding algorithm for online FDR control with conservative nulls. J. Tian, A. Ramdas. arXiv, 2019.
Handling some doubly-sequential aspects:
1. A framework for Multi-A(rmed)/B(andit) testing with online FDR control. F. Yang, A. Ramdas, K. Jamieson, and M. Wainwright. 31st Conference on Neural Information Processing Systems, 2017.
2. Asynchronous online testing of multiple hypotheses. T. Zrnic, A. Ramdas, and M. Jordan. arXiv:1812.05068, 2018.
Other tutorials on A/B testing:
1. Tutorial by Yandex at The Web Conference 2018.
2. Tutorial by Microsoft at The Web Conference 2019.

Publicly available code and packages

An R package onlineFDR that has implementations of most online multiple testing procedures ([7-14]). Maintained by D. Robertson.
A Python package confseq that has implementations of most always-valid confidence sequences for means and quantiles ([1-6]). Maintained by S. Howard.
Python code to reproduce all plots in [11]. Maintained by T. Zrnic.
Python code to reproduce all plots in [13]. Maintained by J. Tian.
Python code to reproduce all plots in [15]. Maintained by F. Yang.
Python code to reproduce all plots in [16]. Maintained by T. Zrnic.