## Recommendations with Thompson Sampling

Jun 05, 2014

/Topics: Bayesian, Big data, Data Science

*This is the second in a series of three blog posts on bandits for recommendation systems.
*

If you read the last blog post, you should now have a good idea of the challenges in building a good algorithm for dishing out recommendations in the bandit setting. The most important challenge is to balance *exploitation *with *exploration.* That is, we have two somewhat conflicting goals: (a) quickly find the best arm to pull and (b) pull the best arm as often as possible. What I dubbed the naive algorithm in the preceding blog post fulfilled these two goals in a direct way: explore for a while, and then exploit forever. It was an OK approach, but we found that more sophisticated approaches, like the UCB family of bandit algorithms, had significantly better performance and no parameters to tune.

In this post, we'll introduce another technique: *Thompson sampling* (also known as probability matching). This has been well covered elsewhere, but mostly for the binary reward case (zeros and ones). I'll also go over Thompson in the log-normal reward case, and offer some approximations that can work in for any reward distribution.

Before I define Thompson sampling, let's build up some intuition. If we have an infinite amount of pulls, then we know exactly what the expected rewards are, and there is no reason to ever explore. When we have a finite number of pulls, then we have to explore and the reason is that we are *uncertain *of what is the best arm. The right machinery to quantify uncertainty is the probability distribution. The UCB algorithms are implicitly using a probability distribution, but only one number from it: the upper confidence bound. In Bayesian thinking, we want to use the entire probability distribution. In the preceding post I defined , the probability distribution from which rewards are drawn. That's what controls the bandit. It would be great if we could estimate this entire distribution, but we don't need to. Why? Because all we care about is its mean . What we will do is encode our uncertainty of in the probability distribution , and then use that probability distribution to decide when to explore and when to exploit.

You may be confused by now because there are two related probability distributions floating around, so let's review:

- - the probability distribution that bandit uses to generate rewards when it is pulled.
- - the probability distribution of where we think the mean of after observing some data.

With Thompson sampling you keep around a probability distribution that encodes your belief about where the expected reward is for arm . For the simple coin-flip case, we can use the convenient Beta-Bernoulli model, and the distribution at round after seeing successes and failures for arm is simply:

where the added 1's are convenient priors.

So now we have a distribution that encodes our uncertainty of where the true expected reward is. What's the actual algorithm? Here it is, in all its simple glory:

- Draw a random sample from for each arm .
- Pull the arm which has the largest drawn sample.

That's it! It turns out that this approach is a very natural way to balance exploration and exploitation. Here is the same simulation from last time, comparing the algorithms from the preceding blog post to Thompson Sampling:

## Continuous Variables and the Normal Approximation

The Bernoulli case is well known and well understood. But what happens when you want to maximize, say, revenue instead of click-through rate? To find out, I coded up a log-normal bandit where each arm pays out strictly positive rewards drawn from a log-normal distribution (the code is messy, so I won't be posting it).

For Thompson sampling, I used a full posterior with priors over the log-normal parameters and as described here (note that these are *not* the mean and standard deviation of the log-normal; read the linked blog post for a lot more info).

Finding that the full posterior was unwieldy and slow, I also implemented a normal approximation to exact Thompson sampling is (using Central Limit Theorem arguments):

where and , are the sample mean and sample variance, respectively, of the rewards observed from arm at round , and is the number of times arm has been pulled at round .

Two UCBs were compared against. For a model-specific UCB, I used the modified Cox method of computing confidence bounds for the log-normal distribution from here. The UCB normal approximation is identical to that in the previous simulation.

For all algorithms, I used the log of the observed rewards to compute sufficient statistics. The results:

Observations:

- Epsilon-Greedy is still the worst performer.
- The normal approximations are slightly worse than their hand-designed counterparts - green is worse than orange and gray is worse than purple.
- UCB is doing better than Thompson sampling over this horizon, but Thompson sampling is maybe poised to do better in the long run.

## The Trouble with UCB

In the above experiments UCB beat out Thompson sampling. It sounds like a great algorithm and performs well in simulations, but it has a key weakness when you actually get down to productionizing your bandit algorithm. Let's say that you aren't Google and you have limited computational resources, which means that you can only update your observed data in batch every 2 hours. For the delayed batch case, UCB will pull the same arm every time for those 2 hours because it is *deterministic* in the absence of immediate updates. While Thompson sampling relies on random samples, which will be different every time even if the distributions aren't updated for a while, UCB needs the distributions to be updated every single round to work properly.

Given that the simulated performance differences between Thompson sampling and UCB are small, I heartily recommend Thompson sampling over UCB; it will work in a larger variety of practical cases.

## Avoiding Trouble with Thompson Sampling

RichRelevance sees gobbles of data. This is usually great, but for Thompson sampling this may mean a subtle pitfall. To understand this point, first note that the variance of a Beta distribution with parameters and is:

For our recommender system, is the number of successes (clicks) and is the number of failures (non-clicks). As the amount of total data goes up, the variance shrinks, and quickly. After a while, the posteriors will be so narrow, that exploration will effectively cease. This may sound good - after all, didn't we learn what the best lever to pull is? In practice, we're dealing with a moving target so it is a good idea to put an upper bound on , so that exploration can continue indefinitely. For details, see Section 7.2.3 here.

Analogously, if you're optimizing for revenue instead of click-through rate and using a Normal approximation, you can compute sample means and sample variances in an incremental fashion, using decay as per the last page here. This will ensure that older samples have less influence than newer ones and allow you to track changing means and variances.

Coming up next: contextual bandits!

## 19 Comments

I'm curious, would it be possible to probabilistically quantify the rate at which your "moving-target" is moving and use this to choose a time-window for inclusion of recent data in the estimation?

Good idea! Maybe try to fit some kind of continuous time Markov chain? Or a Kalman filter? I messed around with tracking a bit, but it went poorly...

How do you deal with position bias? If a list of recommendations is shown, the first one is more likely to get clicked than a lower one, even if their relevance is comparable?

Good point. Every strategy has this same issue, and we are only choosing between strategies. This means that position bias doesn't affect our ability to choose the right strategy.

Hey,

I have two questions,

First How does the Cox method of computing confidence bounds for the log-normal distribution work into the UCB1 Ranking formula x̄j + sqrt(2Log(n)/nj) ?

Second, sorry if this is trivial, but what is the formula for N (a,b) that you used for Normal Approximation?

Hi Daniel,

(1) Erm I should have made this clearer. For the log-normal case, I am not using the UCB1 formula. I am just doing a more standard UCB, by choosing the bandit with the highest 95% upper confidence bound of the estimate of the mean. The Cox method gives me that upper confidence bound for each bandit. That's why the legend on the second plot just says UCB instead of UCB1. I am not sure if the UCB1 formula works for non-Bernoulli cases, so I didn't use it.

(2) N(a,b) is just the ordinary one-variable Normal distribution with mean a and variance b. http://en.wikipedia.org/wiki/Normal_distribution

So for each Round, you pull the K arms with the greatest Value yielded from equn (4) (from the Cox method paper)?

Also if N(a,b) is just the ordinary one-variable Normal distribution with mean a and variance b, doesnt this give a distribution over X and not a distinct probability?

Yes that is the distribution you pull samples from. It's a distribution over where you think the true mean is. It is analogous to the Beta distribution sampled from in the Bernoulli case.

The total number of arms available is K. In these posts, I am assuming you have to pull exactly one arm, and, yes, you pull the one with the largest value from eq (4) in the Cox paper.

So once you have normal distribution, using the cox method, the mean of this distribution is P(Ua|data)?

The Cox method is for UCB. The Normal distribution for Thompson Sampling!

Love your articles. Please keep them coming.

Nice follow up from the first article, thanks Sergey!

We've also found that Thompson sampling results in the lowest amount of cumulative regret in the majority of cases because it is an almost purely dynamic way of updating the rate at which each arm is selected. I appreciate your additional note on the advantage of Thompson sampling in the context of limited computing power.

For those that are interested, here's an article we published on howe we architected our own Thompson sampling bandit, and includes some custom rules for when to stop an experiment and rollout a 'winner' to all: http://splitforce.com/resources/auto-optimization/

Some areas we've considered for improvement are implementing a 'memory loss' function, perhaps dynamically tied to regret, which essentially allows for greater exploitation over time and can be a good way to handle changes due to seasonality, consumer preferences, etc. - Would be interested in hearing your thoughts on this last point.

Hi Sergey,

Great article! I've been successfully using Thompson sampling and I love it.

Now I am trying to see if there is a way to extend this method for optimizing a continuous variable.

My problem is this:

I need an algorithm to optimize the time of the week that I show a message to a user to ensure the highest probability that the user will click the message.

When the message is shown, a database entry will be updated with the day/time and whether or not the user clicked. The goal is to maximize the click through rate.

I could break up the time into discrete arms and use regular Thompson sampling, but in this case, I resent the loss of information between adjacent arms.

One person had recommended that I distribute rewards between adjacent arms, such that when you get a click at 8:05, you will attribute it heavily to the 8am model, a lesser amount to the 9am one, an even lesser amount to the 10am one and so on.

However, even with this idea I'm struggling with how I would do the weighting and how I would choose the samples.

Do you have any ideas on how to accomplish this???

Warmest regards,

-Justin

For continuous variables, I would try an Upper Confidence Bound (UCB) approach with Gaussian Process Regression (GPR). A GPR model will let you model the input-output relationship between your continuous variable and the (random) reward, and provides a probability distribution at each point. The only problem is: vanilla GPR computationally intensive. So it should work if you don't have that much data (fewer than 1000 samples total), or you may have to find some way to do a fast approximation.

If you have a ton of data, then I think discretizing by 1-minute or 5-minute intervals and providing rewards to adjacent arms according to some function (kernel) sounds like a reasonable place to start. You will have to mess around to figure out the right kernel, but GPRs have the same problem - you have to specify a covariance kernel there, it has parameters to tune, etc, etc.