## Bayesian A/B Tests

Here at RichRelevance we regularly run live tests to ensure that our algorithms are providing top-notch performance.  Our RichRecs engine, for example, displays personalized product recommendations to consumers, and the $A/B$ tests we run pit our recommendations against recommendations generated by a competing algorithm, or no recommendations at all.  We test for metrics like click-through-rate, average order value, and revenue per session.  Historically, we have used null hypothesis tests to analyze the results of our tests, but are now looking ahead to the next generation of statistical models.  Frequentist is out, and Bayesian is in!

Why are null hypothesis tests under fire?  There are many reasons [e.g. here or here], and a crucial one is that null hypothesis tests and p-values are hard to understand and hard to explain.  There are arbitrary thresholds (0.05?) and the results are binary - you can either reject the null hypothesis or fail to reject the null hypothesis.  And is that what you really care about? Which of these two statements is more appealing:

(1) "We rejected the null hypothesis that $A = B$ with a p-value of 0.043."

(2) "There is an 85% chance that $A$ has a 5% lift over $B$ ."

Bayesian modeling can answer questions like (2) directly.

What's Bayesian, anyway?  Here's a short but thorough summary [source]:

The Bayesian approach is to write down exactly the probability we want to infer, in terms only of the data we know, and directly solve the resulting equation [...] One distinctive feature of a Bayesian approach is that if we need to invoke uncertain parameters in the problem, we do not attempt to make point estimates of these parameters; instead, we deal with uncertainty more rigorously, by integrating over all possible values a parameter might assume.

Let's think this through with an example.  Assume your parameter-of-interest is click-through rate (CTR), and your  $A/B$ test is pitting two different product recommendation engines against one another.  With null hypothesis testing, you assume that there exist true-but-unknown click-through rates for $A$ and $B,$ which we will write as $\text{CTR}_A$ and $\text{CTR}_B,$ and the goal is to figure out if they are different or not.

With Bayesian statistics we we will instead model the $\text{CTR}_A$ and $\text{CTR}_B$ as random variables, and specify their entire distributions (I'll go through this example in more detail in the next section).  $\text{CTR}_A$ and $\text{CTR}_B$ are no longer two numbers, but are now two distributions.

Here's a quick dictionary of Bayesian terms:

• prior - a distribution that encodes your prior belief about the parameter-of-interest
• likelihood - a function that encodes how likely your data is given a range of possible parameters
• posterior - a distribution of the parameter-of-interest given your data, combining the prior and likelihood

So forget everything you know about statistical testing for now.  Let's start from scratch and answer our customer's most important question directly: what is the probability that $\text{CTR}_A$ is larger than $\text{CTR}_B$ given the data from the experiment (i.e. a sequence of 0s and 1s in the case of click-through-rate)?

To compute this probability, we'll first need to find the joint distribution (a.k.a. the posterior):

and then integrate across area-of-interest.  What does that mean?  Well, $P(\text{CTR}_A,\text{CTR}_B|\text{data})$ is a two-dimensional function of $\text{CTR}_A$ and $\text{CTR}_B.$   So to find $P(\text{CTR}_A>\text{CTR}_B|\text{data})$ we have to add up all the probabilities in the region where $\text{CTR}_A>\text{CTR}_B$ :

To actually calculate this integral will require a few insights.  The first is that for many standard $A/B$ tests, $A$ and $B$ are independent because they are observed by non-overlapping populations.  Keeping this in mind, we have:

This means we can do our computations separately for $\text{CTR}_A$ and $\text{CTR}_B$ and then combine them at the very end to find the probability that $\text{CTR}_A > \text{CTR}_B.$   Then, applying Bayes rule to both $P(\text{CTR}_A|\text{data})$ and $P(\text{CTR}_B|\text{data}),$ we get:

The next step is to define the models $P(\text{data}|\cdot)$ and $P(\cdot).$   (We don't need a model for $P(\text{data})$ because, in practice, we'll never have to use it to compute the probabilities of interest.)  The models are different for every type of test, and the simplest is...

### BINARY A/B TESTS

If your data is a sequence of 0s and 1s, a binomial coin-flip model is appropriate.  In this case we can summarize each side of the test by the parameters $\text{CTR}_A$ and $\text{CTR}_B,$ where $\text{CTR}_A$ is the probability of a 1 on the $A$ side.

We'll need some more notation.  Let $\text{clicks}_A$ and $\text{views}_A$ be the number of clicks and the total number of views, respectively, on the $A$ side.  The likelihood is then:

with a similar looking equation for the $B$ side.  Choosing the prior $P(\text{CTR}_A)$ is a bit of a black art, but let's just use the conjugate Beta distribution for mathematical & computational convenience (see here and here for more about conjugate priors).  Also, for the sake of fairness, we will use the same prior for $\text{CTR}_A$ and $\text{CTR}_B$ (unless there is a good reason to think otherwise):

where $B$ is the beta function (confusingly, not the same as a Beta distribution), and $\alpha$ and $\beta$ can be set to reflect your prior belief on what $\text{CTR}$ should be.  Note that $P(\text{CTR}_A)$ has the same form as $P(\text{views}_A, \text{clicks}_A | \text{CTR}_A )$ - that's precisely the meaning of conjugacy - and we can now write the posterior probability $P(\text{CTR}_A |\text{views}_A, \text{clicks}_A)$ directly as:

(In practice it doesn't really matter what prior we choose - we have so much experimental data that the likelihood will overwhelm the prior easily.  But we chose the Beta prior because it simplifies the math and computations.)

Now we have two Beta distributions the product of which is proportional to our posterior - what's next?  We can numerically compute the integral we wrote down earlier!  In particular, nothing that $\text{data} = \{ \text{views}_A, \text{clicks}_A, \text{views}_B, \text{clicks}_B\}$ , let's find

To do so just draw independent samples of $\text{CTR}_A$ and $\text{CTR}_B$ (Monte Carlo style) from

and

as follows (in Python):

from numpy.random import beta as beta_dist
import numpy as np
N_samp = 10000 # number of samples to draw
clicks_A = 450 # insert your own data here
views_A = 56000
clicks_B = 345 # ditto
views_B = 49000
alpha = 1.1 # just for the example - set your own!
beta = 14.2
A_samples = beta_dist(clicks_A+alpha, views_A-clicks_A+beta, N_samp)
B_samples = beta_dist(clicks_B+alpha, views_B-clicks_B+beta, N_samp)

Now you can compute the posterior probability that $\text{CTR}_A > \text{CTR}_B$ given the data simply as:

np.mean(A_samples > B_samples)

Or maybe you're interested in computing the probability that the lift of $A$ relative to $B$ is at least 3%.  Easy enough:

np.mean( 100.*(A_samples - B_samples)/B_samples > 3 )

Pretty neat, eh?  Stay tuned for the next blog post where I will cover Bayesian A/B tests for Log-normal data!

PS, How should you set your $\alpha$ and $\beta$ in the Beta prior?  You can set them both to be 1 - that's like throwing your hands up and saying "all values are equally likely!"  Alternatively you can set $\alpha$ and $\beta$ such that the mean or mode of the Beta prior is roughly where you expect $CTR_A$ and $CTR_B$ to be.

Reference: Bayesian Data Analysis, Chapter 2

Sergey Feldman is a data scientist & machine learning cowboy with the RichRelevance Analytics team. He was born in Ukraine, moved with his family to Skokie, Illinois at age 10, and now lives in Seattle. In 2012 he obtained his machine learning PhD from the University of Washington. Sergey loves random forests and thinks the Fourier transform is pure magic.

• Praveen Kumar says:

Your posts are brilliant! I became of fan of you from Quora and these are indeed interesting problems to work on!

• Mark says:

Thanks Sergey - super interesting post.

I am implementing a simple experiment platform for my team. There are two metrics I'd appreciate your advice on calculating:
1. With a 95% credible interval, the lift that group A has over group B (different perspective to the lift calc you describe above)
2. Probability that Group X has the best conversion rate (with >2 groups)

Many thanks,
Mark

• Sergey Feldman says:

Hi Mark,

Since we're dealing with probabilities, lift isn't a single number - it's a probability distribution. So you have to formulate your questions in ways that allow you to compute with the two posteriors you have.

Similarly, for the second question, there is no "best" conversion rate. Each group has a distribution over conversion rates. Try to write the metrics you are interested in terms of the posteriors p(A|data) and p(B|data).

• Mark says:

Thanks Sergey! Will research

• Bogdan says:

Any advice on how to compute credible intervals for p(CTR(A) > CTR(B))?

• Sergey Feldman says:

Sure, just sample 10000 pairs of (CTR(A), CTR(B)) from the two (independent) Beta posteriors, and then find the interval where (the middle) 95% of the samples fall. That will give you a 95% credible interval.

• Aymen says:

Hi Sergey,

Thanks for this great post! I have some question regarding your approach:
- There is no notion of "sample size" in Bayesian AB testing. So when can we say that the results are significant enough? During the first few days of the experiments, the shape of the posterior is mainly due to the prior: do you have any idea of how many impressions (roughly) are necessary to assume that the prior is negligeable?
- As you stated in your previous comment, CTRA and CTRB are both distributions. In order to evaluate whether those two distributions are "statistically significantly" different, do you define a form of "distance" between them ?