Bayesian A/B Testing with a Log-Normal Model

In my first blog post I introduced the general framework of Bayesian A/B testing, and went into the gory details on how to make it happen with binary data: 0, 1, 0, 1, 0, 0, 0...

In the second blog post I went over how to do these kinds of tests on normally distributed data.  But what if you have other types of data?  Like, say, order values, revenue per session, or time spent on page?  Notice that all of these measurements are non-negative or positive, as are many of the metrics we measure & report on at RichRelevance.  This means we shouldn't model the data as being drawn from the battle-tested Gaussian (normal) distribution, because the Gaussian function stretches from negative infinity to positive infinity, and we know for a fact that negative values appear with probability zero. (Although human height, despite being strictly positive, is famously well-modeled by a normal.)

In the realm of strictly positive data (no zeroes!), the log-normal distribution is very common across the sciences and in finance.  And, in practice, it fits our data better than other distributions on the same domain (I tried a good portion of these).  Here is a histogram of some real order values (excluding zeros), and the log-normal fit:

blog_aov_example_figure

Look at that well-fit tail!  Also, notice that a Pareto (power law) distribution is not quite right here because the histogram rises at first and only then tapers off.

So, what next?  We now know all about Bernoullis, Binomials, and Gaussians, but log-normals?  The easiest way to deal with log-normal data is to remember why the distribution is called log-normal in the first place.  Quoth Wikipedia:

If X \sim \operatorname{Log-\mathcal{N}}(\mu, \sigma^2) is distributed log-normally, then \ln(X) \sim \mathcal{N}(\mu, \sigma^2) is a normal random variable.

So contrary to what the name might lead you to conclude, a log-normal distribution is not the log of a normal distribution but rather a distribution whose log is normally distributed.

And another clarification: the mean of a log-normal is not \mu like you would expect, but it is actually e^{\mu+\sigma^2/2} ; we will use this fact soon.

Just to make sure we believe that Wikipedia quote, I drew samples from a log-normal distribution in Python, plotted its histogram, and then also plotted the histogram of the log of each of the samples [here is the code]:

blog_log_normal_figure

As expected, after taking the log of some log-normal samples, the result is normally distributed!

 

A/B TEST OVERVIEW

Keeping that in mind, here is the outline of our approach to A/B testing with log-normal models:

1) Take the log of the data samples.

2) Treat the data as normally distributed, and specify posterior distributions of the parameters \mu_A and \sigma_A^2 using conjugate priors & a Gaussian likelihood.  The details and code are in the previous blog post.

3) Draw random samples of \mu_A and \sigma_A^2 from the posteriors, and transform each of them back into the domain of the original data.  The transformation is chosen depending on what you're comparing between the A side and B side.  If, for example, the comparison is between means of the log-normal, then the transformation is e^{\mu_A+\sigma_A^2/2}.   If the comparison is between modes of the log-normal, then use e^{\mu_A-\sigma_A^2}.   You can also compare medians, variances, and even entropies!

4) Repeat steps (1) - (3) for the B side, and then compute the probability you're interested in by numerical integration (just like in the first post).

 

THE MATH & THE CODE

Let's take a concrete example: running an A/B test on the average value of a purchase order (AOV) of an online retailer.  We have AOV_A on the A side and AOV_B on the B side, and want to answer the question: "What is the probability that  AOV_A is larger than  AOV_B given the data from the experiment?"  Or, in math (after log-transforming the data):

p(AOV_A > AOV_B | \log(\text{data})).

How do we answer this question?  If we model the order values as being drawn from a log-normal distribution, then the AOV is just the mean of that distribution!  And, recalling an earlier point, for a log-normal distribution \mu is not the mean and \sigma^2 is not the variance!  The mean of the log-normal distribution is a function of its \mu and \sigma^2 parameters; in particular, the mean is

e^{\mu+\sigma^2/2}.

Knowing all that, we can rewrite the probability we're interested in as:

p(AOV_A > AOV_B | \log(\text{data})) = p(e^{\mu_A+\sigma_A^2/2} > e^{\mu_A+\sigma_B^2/2} | \log(\text{data})).

Where the right hand side is a gnarly-looking integral of the joint posterior (don't stare it for too long):

\iint\limits_{e^{\mu_A+\sigma_A^2/2}>e^{\mu_A+\sigma_B^2/2}} p(\mu_A,\sigma^2_A,\mu_B,\sigma^2_B|\log(\text{data})) d\mu_A d\sigma^2_A d\mu_B d\sigma^2_B.

Despite looking somewhat evil, this integral isn't that bad!  Below the integral signs I have lazily put the "region of interest."  It says: take the integral in the region where e^{\mu_A+\sigma_A^2/2} is larger than e^{\mu_B+\sigma_B^2/2}.  I have no idea what that region looks like, and I won't ever need to because we are going to use sampling to do the integration for us.  As a quick reminder, here is how numerical integration of a probability distribution works:

  1. Draw lots of random samples of (\mu_A,\sigma^2_A,\mu_B,\sigma^2_B)  from p(\mu_A,\sigma^2_A,\mu_B,\sigma^2_B|\log(\text{data})).
  2. Count the number of those samples that satisfy the condition e^{\mu_A+\sigma_A^2/2}>e^{\mu_A+\sigma_B^2/2}.
  3. The probability of interest is just the count from step (2) divided by total number of samples drawn.  That's it!

So now all we have to do is figure out how to draw those samples.  As before, independence will be very helpful:

p(\mu_A,\sigma^2_A,\mu_B,\sigma^2_B|\log(\text{data})) = p(\mu_A,\sigma^2_A|\log(\text{data})) p(\mu_B,\sigma^2_B|\log(\text{data})).

Hey look!  Since we have already established that \log(\text{data}) is distributed according to a normal distribution, that means p(\mu_A,\sigma^2_A|\log(\text{data})) and p(\mu_B,\sigma^2_B|\log(\text{data})) are both Gaussian posteriors a la the previous blog post (first and second equations).  And we already know how to sample from those!  Here is a function that (1) log-transforms the data, (2) gets posterior samples of \mu and \sigma^2 using the function draw_mus_and_sigmas from the previous blog post, and (3) transforms these parameters back into the log-normal domain:

from numpy import exp, log, mean

def draw_log_normal_means(data,m0,k0,s_sq0,v0,n_samples=10000):
    # log transform the data
    log_data = log(data)
    # get samples from the posterior
    mu_samples, sig_sq_samples = draw_mus_and_sigmas(log_data,m0,k0,s_sq0,v0,n_samples)
    # transform into log-normal means
    log_normal_mean_samples = exp(mu_samples + sig_sq_samples/2)
    return log_normal_mean_samples

And here is some code that runs an A/B test on simulated data:

from numpy.random import lognormal

# step 1: define prior parameters for inverse gamma and the normal    
m0 = 4. # guess about the log of the mean
k0 = 1. # certainty about m.  compare with number of data samples  
s_sq0 = 1. # degrees of freedom of sigma squared parameter
v0 = 1. # scale of sigma_squared parameter

# step 2: get some random data, with slightly different means
A_data = lognormal(mean=4.05, sigma=1.0, size=500)
B_data = lognormal(mean=4.00,  sigma=1.0, size=500) 

# step 3: get posterior samples
A_posterior_samples = draw_log_normal_means(A_data,m0,k0,s_sq0,v0)
B_posterior_samples = draw_log_normal_means(B_data,m0,k0,s_sq0,v0)

# step 4: perform numerical integration
prob_A_greater_B = mean(A_posterior_samples > B_posterior_samples)
print prob_A_greater_B

# or you can do more complicated lift calculations
diff = A_posterior_samples - B_posterior_samples
print mean( 100.*diff/B_posterior_samples > 1 )

 

PUTTING IT ALL TOGETHER

We now have models to perform Bayesian A/B tests for three types if data: Bernoulli data, normal data, and log-normal data.  That covers a great many of the data streams we encounter, and you can even combine them without too much trouble.  For example, in computing revenue per session (RPS), most of the samples are zero, and zeroes are not modeled by a log-normal model (its support is strictly positive).  To deal with this kind of data, you can simply model zero vs. not-zero with the Bernoulli/Beta model, (separately) model the non-zero values with the log-normal model, and then multiply the posterior samples together. An example combining all the functions from all three blog posts:

from numpy.random import beta as beta_dist
from numpy import mean, concatenate, zeros

# some random log-normal data
A_log_norm_data = lognormal(mean=4.10, sigma=1.0, size=100)
B_log_norm_data = lognormal(mean=4.00, sigma=1.0, size=100)
# appending many many zeros
A_data = concatenate([A_log_norm_data,zeros((10000))])
B_data = concatenate([B_log_norm_data,zeros((10000))])

# modeling zero vs. non-zero
non_zeros_A = sum(A_data > 0)
total_A = len(A_data)
non_zeros_B = sum(B_data > 0)
total_B = len(B_data)
alpha = 1 # uniform prior
beta = 1

n_samples = 10000 # number of samples to draw
A_conv_samps = beta_dist(non_zeros_A+alpha, total_A-non_zeros_A+beta, n_samples)
B_conv_samps = beta_dist(non_zeros_B+alpha, total_B-non_zeros_B+beta, n_samples)

# modeling the non-zeros with a log-normal
A_non_zero_data = A_data[A_data > 0]
B_non_zero_data = B_data[B_data > 0]

m0 = 4. 
k0 = 1. 
s_sq0 = 1. 
v0 = 1. 

A_order_samps = draw_log_normal_means(A_non_zero_data,m0,k0,s_sq0,v0)
B_order_samps = draw_log_normal_means(B_non_zero_data,m0,k0,s_sq0,v0)

# combining the two
A_rps_samps = A_conv_samps*A_order_samps
B_rps_samps = B_conv_samps*B_order_samps

# the result
print mean(A_rps_samps > B_rps_samps)

 

PRIORS ARE YOUR FRIENDS

One final note: this code works extremely poorly if you only have a few samples of data. For example, set the data to be:

from numpy import array
A_data = array([0,0,0,0,0,0,0,0,0,0,10,20,200,0,0,0,0,0])
B_data = array([0,0,0,0,0,0,0,0,0,0,30,10,100,0,0,0,0,0])

And then run the snippet again. The code will run without errors or warnings, but type max(A_rps_samps) into the console and see what pops out.  I see: 2.14e+31!  What's going on?  Well, the less data you have, the more uncertainty in both \mu and \sigma^2 .  In this case it is \sigma^2 that's the culprit.  Because the log-normal mean samples are \exp(\mu+\sigma^2/2), it doesn't take much change in \sigma^2 for the exponential to blow up.  And the inverse Gamma distribution that is the conjugate prior over \sigma^2 can get a bit fat-tailed when you have too few samples (fewer than about 50).

But that doesn't mean we can't use this methodology in cases of small sample size. Why? Because we can set the prior parameters as we see fit!  If you change v0 from 1 to 25, then the max of A_rps_samps drops down to 109.15, a totally reasonable value given that the maximum of our original data is 200.  So, don't just ignore your prior parameters! As its name suggests, the prior distribution is there so you can encode your prior knowledge into the model. In this example, setting v0 to 25 is equivalent to saying "Hey model, there is less uncertainty in the \sigma^2 parameter than it may seem."

That's it for Bayesian A/B tests. The next blog post will probably be less statistics-esque and more machine learning-y.

About :

Sergey Feldman is a data scientist & machine learning cowboy with the RichRelevance Analytics team. He was born in Ukraine, moved with his family to Skokie, Illinois at age 10, and now lives in Seattle. In 2012 he obtained his machine learning PhD from the University of Washington. Sergey loves random forests and thinks the Fourier transform is pure magic.

4 Comments

  • Justin Gough says:

    Great stuff - very clear and well explained - thanks!

    I have some questions on applying this to time on page:

    Chao Liu and colleagues (reported by Jakob Nielsen) report a Weibull distribution for time on page. You mention time on page in talking about log normal, but the example you work is revenue. Can you comment on the fit of log normal for time on page?

    Analytics tools do not generally report individual samples of time on page, but only a calculated average. Am I right in thinking that this analysis requires row-level data, ie individual sessions, (because you need to get μ and σ and N for the log of the samples)? Is there a way to extend this analysis to situations where you can only get the average, and counts of samples in various ranges (for example by the Google Analytics API), or would it be simpler to get a different measurement tool before embarking on such testing?

    • Sergey Feldman says:

      Hi Justin,

      Thanks for the comment.

      From my experience, it makes sense that time spent on page would be distributed Weibull or Gamma. The Weibull and the Gamma are both generalizations of the Exponential, which "describes events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate."

      Let me make sure I understand clearly what your data is. You have 1000 sessions, each of which has some number of samples of time on page.

      If you did have access to the individual samples, you'd just dump them all in a big bag (for each treatment), and then fit a model for the A side and the B side, etc.

      But you don't have individual samples. You only have averages from each session. That's certainly tougher. In the Gaussian case, if x is distributed with mean \mu and standard deviation \sigma , then a sample mean made out of n samples is also distributed Gaussian with mean \mu and standard deviation \sigma/\sqrt{n} .

      But if you have many sample means, each with a different n , I am not sure whether there is a simple closed-form distribution. And I'm even less sure about the Weibull/Gamma case.

      If I was in your shoes, I'd get a bunch of samples and try out different distributions. If something looks like it fits, just use it. Hope that's helpful!

  • Steven Ou says:

    Wow, this is super helpful. I've been trying to wrap my head around A/B testing non-binary data and this nailed it on the head. I'm used to A/B testing using the z-score. I would love it if you could help me understand this Bayesian method a little better.

    In your above example is prob_A_greater_B >= 0.95 or prob_A_greater_B <= 0.05 equivalent to a "95% confidence level"?

    Also you have the "more complicated lift calculations" at the end of the example. I just want to make sure I understand that calculation: are you calculating the probability that A is 1% (or more) better than B?

    If so, is there an easy way to do this calculation "backwards"? For example, calculate the highest lift with a minimum probability of 95%.

    Thanks!

  • Leave a Comment

    Your email address will not be published. Required fields are marked *

    *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>