## Bayesian A/B Tests

05/21/2013 • Topics: Bayesian, Engineering Culture

Here at RichRelevance we regularly run live tests to ensure that our algorithms are providing top-notch performance. Our RichRecs engine, for example, displays personalized product recommendations to consumers, and the tests we run pit our recommendations against recommendations generated by a competing algorithm, or no recommendations at all. We test for metrics like click-through-rate, average order value, and revenue per session. Historically, we have used null hypothesis tests to analyze the results of our tests, but are now looking ahead to the next generation of statistical models. Frequentist is out, and Bayesian is in!

Why are null hypothesis tests under fire? There are many reasons [e.g. here or here], and a crucial one is that null hypothesis tests and p-values are hard to understand and hard to explain. There are arbitrary thresholds (0.05?) and the results are binary - you can either reject the null hypothesis or fail to reject the null hypothesis. And is that what you really care about? Which of these two statements is more appealing:

(1) "We rejected the null hypothesis that with a p-value of 0.043."

(2) "There is an 85% chance that has a 5% lift over ."

Bayesian modeling can answer questions like (2) directly.

What's Bayesian, anyway? Here's a short but thorough summary [source]:

The Bayesian approach is to write down exactly the probability we want to infer, in terms only of the data we know, and directly solve the resulting equation [...] One distinctive feature of a Bayesian approach is that if we need to invoke uncertain parameters in the problem, we do not attempt to make point estimates of these parameters; instead, we deal with uncertainty more rigorously, by integrating over all possible values a parameter might assume.

Let's think this through with an example. Assume your parameter-of-interest is click-through rate (CTR), and your test is pitting two different product recommendation engines against one another. With null hypothesis testing, you assume that there exist true-but-unknown click-through rates for and which we will write as and and the goal is to figure out if they are different or not.

With Bayesian statistics we we will instead model the and as *random variables*, and specify their entire distributions (I'll go through this example in more detail in the next section). and are no longer two numbers, but are now two distributions.

Here's a quick dictionary of Bayesian terms:

**prior**- a distribution that encodes your prior belief about the parameter-of-interest**likelihood**- a function that encodes how likely your data is given a range of possible parameters**posterior**- a distribution of the parameter-of-interest given your data, combining the prior and likelihood

So forget everything you know about statistical testing for now. Let's start from scratch and answer our customer's most important question directly: what is the probability that is larger than given the data from the experiment (i.e. a sequence of 0s and 1s in the case of click-through-rate)?

To compute this probability, we'll first need to find the joint distribution (a.k.a. the posterior):

and then integrate across area-of-interest. What does that mean? Well, is a two-dimensional function of and So to find we have to add up all the probabilities in the region where :

To actually calculate this integral will require a few insights. The first is that for many standard tests, and are independent because they are observed by non-overlapping populations. Keeping this in mind, we have:

This means we can do our computations separately for and and then combine them at the very end to find the probability that Then, applying Bayes rule to both and we get:

The next step is to define the models and (We don't need a model for because, in practice, we'll never have to use it to compute the probabilities of interest.) The models are different for every type of test, and the simplest is...

### BINARY A/B TESTS

If your data is a sequence of 0s and 1s, a binomial coin-flip model is appropriate. In this case we can summarize each side of the test by the parameters and where is the probability of a 1 on the side.

We'll need some more notation. Let and be the number of clicks and the total number of views, respectively, on the side. The likelihood is then:

with a similar looking equation for the side. Choosing the prior is a bit of a black art, but let's just use the conjugate Beta distribution for mathematical & computational convenience (see here and here for more about conjugate priors). Also, for the sake of fairness, we will use the same prior for and (unless there is a good reason to think otherwise):

where is the beta function (confusingly, not the same as a Beta distribution), and and can be set to reflect your prior belief on what should be. Note that has the same form as - that's precisely the meaning of conjugacy - and we can now write the posterior probability directly as:

(In practice it doesn't really matter what prior we choose - we have so much experimental data that the likelihood will overwhelm the prior easily. But we chose the Beta prior because it simplifies the math and computations.)

Now we have two Beta distributions the product of which is proportional to our posterior - what's next? We can numerically compute the integral we wrote down earlier! In particular, nothing that , let's find

To do so just draw independent samples of and (Monte Carlo style) from

and

as follows (in Python):

from numpy.random import beta as beta_dist import numpy as np N_samp = 10000 # number of samples to draw clicks_A = 450 # insert your own data here views_A = 56000 clicks_B = 345 # ditto views_B = 49000 alpha = 1.1 # just for the example - set your own! beta = 14.2 A_samples = beta_dist(clicks_A+alpha, views_A-clicks_A+beta, N_samp) B_samples = beta_dist(clicks_B+alpha, views_B-clicks_B+beta, N_samp)

Now you can compute the posterior probability that given the data simply as:

np.mean(A_samples > B_samples)

Or maybe you're interested in computing the probability that the lift of relative to is at least 3%. Easy enough:

np.mean( 100.*(A_samples - B_samples)/B_samples > 3 )

Pretty neat, eh? Stay tuned for the next blog post where I will cover Bayesian A/B tests for Log-normal data!

PS, How should you set your and in the Beta prior? You can set them both to be 1 - that's like throwing your hands up and saying "all values are equally likely!" Alternatively you can set and such that the mean or mode of the Beta prior is roughly where you expect and to be.

Reference: Bayesian Data Analysis, Chapter 2

## 9 Comments

Your posts are brilliant! I became of fan of you from Quora and these are indeed interesting problems to work on!

Thanks Sergey - super interesting post.

I am implementing a simple experiment platform for my team. There are two metrics I'd appreciate your advice on calculating:

1. With a 95% credible interval, the lift that group A has over group B (different perspective to the lift calc you describe above)

2. Probability that Group X has the best conversion rate (with >2 groups)

Many thanks,

Mark

Hi Mark,

Since we're dealing with probabilities, lift isn't a single number - it's a probability distribution. So you have to formulate your questions in ways that allow you to compute with the two posteriors you have.

Similarly, for the second question, there is no "best" conversion rate. Each group has a distribution over conversion rates. Try to write the metrics you are interested in terms of the posteriors p(A|data) and p(B|data).

Thanks Sergey! Will research

Any advice on how to compute credible intervals for p(CTR(A) > CTR(B))?

Sure, just sample 10000 pairs of (CTR(A), CTR(B)) from the two (independent) Beta posteriors, and then find the interval where (the middle) 95% of the samples fall. That will give you a 95% credible interval.

Hi Sergey,

Thanks for this great post! I have some question regarding your approach:

- There is no notion of "sample size" in Bayesian AB testing. So when can we say that the results are significant enough? During the first few days of the experiments, the shape of the posterior is mainly due to the prior: do you have any idea of how many impressions (roughly) are necessary to assume that the prior is negligeable?

- As you stated in your previous comment, CTRA and CTRB are both distributions. In order to evaluate whether those two distributions are "statistically significantly" different, do you define a form of "distance" between them ?

Thanks in advance for your answers!

Aymen

I haven't worked on this myself, but check out this blog post: http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-testing/