## Bandits for Recommendation Systems

06/02/2014 • Topics: Bayesian, Big data, Data Science

*This is the first in a series of three blog posts on bandits for recommendation systems. *

In this blog post, we will discuss the bandit problem and how it relates to online recommender systems. Then, we'll cover some classic algorithms and see how well they do in simulation.

A common problem for internet-based companies is: *which piece of content should we display?* Google has this problem (which ad to show), Facebook has this problem (which friend's post to show), and RichRelevance has this problem (which product recommendation to show). Many of the promising solutions come from the study of the *multi-armed bandit* problem. A one-armed "bandit" is another way to say slot machine (probably because both will leave you with empty pockets). Here is a description that I hijacked from Wikipedia:

"The **multi-armed bandit problem **is the problem a gambler faces at a row of slot machines when deciding which machines to play, how many times to play each machine and in which order to play them.^{ }When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls."

Let's rewrite this in retail language. Each time a shopper looks at a webpage, we have to show them one of product recommendations. They either click on it or do not, and we log this (binary) reward. Next, we proceed to either the next shopper or the next page view of this shopper and have to choose one of product recommendations again. (Actually, we have to choose multiple recommendations per page, and our 'reward' could instead be sales revenue, but let's ignore these aspects for now.)

Multi-armed bandits come in two flavors: stochastic and adversarial. The stochastic case is where each bandit doesn't change in response to your actions, while in the adversarial case the bandits learn from your actions and adjust their behavior to minimize your rewards. We care about the stochastic case, and our goal is to find the arm which has the largest *expected reward*. I will index the arms by , and the probability distribution over possible rewards for each arm can be written as . We have to find the arm with the largest mean reward

as quickly as possible while accumulating the most rewards along the way. One important point is that in practice are *non-stationary*, that is, rewards change over time, and we have to take that into account when we design our algorithms.

## Approach #1: A Naive Algorithm

We need to figure out the mean reward (expected value) of each arm. So, let's just try each arm 100 times, take the sample mean of the rewards we get back, and then pull the arm with the best sample mean forever more. Problem solved?

Not exactly. This approach will get you in trouble in a few key ways:

- If is of even moderate size (10-100), you'll spend a long time gathering data before you can actually benefit from feedback.
- Is 100 samples for each arm enough? How many should we use? This is an arbitrary parameter that will require experimentation to determine.
- If after 100 samples (or however many), the arm you settle on is not actually the optimal one, you can never recover.
- In practice, the reward distribution is likely to change over time, and we should use an algorithm that can take that into account.

OK, so maybe the naive approach won't work. Let's move on to a few that are actually used in practice.

## Approach #2: The Greedy Algorithm

What we'd like to do is start using what we think is the best arm as soon as possible, and adjust course when information to the contrary becomes available. And we don't want to get stuck in a sub-optimal state forever. The " greedy" algorithm addresses both of these concerns. Here is how it works: with probability pull the arm with the best current sample mean reward, and otherwise pull a random other arm (uniformly). The advantages over the naive are:

- Guaranteed to not get stuck in a suboptimal state forever.
- Will use the current best performing arm a large proportion of the time.

But setting is *hard*. If it’s too small, learning is slow at the start, and you will be slow to react to changes. If we happen to sample, say, the second-best arm the first few times, it may take a long time to discover that another arm is actually better. If is too big, you’ll waste many trials pulling random arms without gaining much. After a while, we'll have enough samples to be pretty sure which is best, but we will still be wasting an of our traffic exploring other options. In short, is a parameter that gives poor performance at the extremes, and we have little guidance as to how to set it.

## Approach #3: Upper Confidence Bound Algorithms

In the world of statistics, whenever you estimate some unknown parameter (such as the mean of a distribution) using random samples, there is a way to quantify the uncertainty inherent in your estimate. For example, the true mean of a fair six-sided die is 3.5. But if you only roll it once and get a 2, your best estimate of the mean is just 2. Obviously that estimate is not very good, and we can quantify just how variable it is. There are *confidence bounds* which can be written, for example, as: "The mean of this die is 2, with a 95-th percentile lower bound of 1.4 and a 95-th percentile upper bound of 5.2."

The upper confidence bound (UCB) family of algorithms, as its name suggests, simply selects the arm with the largest upper confidence bound at each round. The intuition is this: the more times you roll the die, the tighter the confidence bounds. If your roll the die an infinite number of times then the width of the confidence bound is zero, and before you ever roll it the width of the confidence bound is the largest it will ever be. So, as the number of rolls increases, the uncertainty decreases, and so does the width of the confidence bound.

In the bandit case, imagine that you have to introduce a brand new choice to the set of choices a week into your experiment. The greedy algorithm would keep chugging along, showing this new choice rarely (if the initial mean is defined to be 0). But the upper confidence bound of this new choice will be very large because of the uncertainty that results from us never having pulled it. So UCB will choose this new arm until its upper bound is below the upper bound of the more established choices.

So, the advantages are:

- Take uncertainty of sample mean estimate into account in a smart way.
- No parameters to validate.

And the major disadvantage is that the confidence bounds designed in the machine learning literature require heuristic adjustments. One way to get around having to wade through heuristics is to recall the central limit theorem. I'll skip the math but it says that the distribution of the sample mean computed from samples from *any* distribution converges to a Normal (Gaussian) as the number of samples increases (and fairly quickly). Why does that matter here? Because we are estimating the true expected reward for each arm with a sample mean. Ideally we want a posterior of where the true mean is, but that's hard in non-Bernoulli and non-Gaussian cases. So we will instead content ourselves with an approximation and use a Gaussian distribution centered at the sample mean instead. We can thus always use a, say, 95% upper confidence bound, and be secure in the knowledge that it will become more and more accurate the more samples we get. I will discuss this in more detail in the next blog post.

## Simulation Comparison

So how do these 3 algorithms perform? To find out, I ran a simple simulation 100 times with and binary rewards (aka a Bernoulli bandit). Here are the 5 algorithms compared:

- Random - just pick a random arm each time without learning anything.
- Naive - with 100 samples of each arm before committing to the best one
- Greedy - with
- UCB - with (1 - 1/t) bounds (heuristic modification of UCB1)
- UCB - with 95% bounds

The metric used to compare these algorithms is average (over all the trials) *expected regret* (lower is better), which quantifies how much reward we missed out on by pulling the suboptimal arm at each time step. The Python code is here and the results are in the plot below.

What can we conclude from this plot?

- Naive is as bad as random for the first rounds, but then has effectively flat performance. In the real world, the arms have shifting rewards, so this algorithm is impractical because it over-commits
- greedy is OK but without a decaying we're still wasting 1% of traffic on exploration when it may no longer be necessary.
- The UCB algorithms are great. It's not clear which one is the winner in this limited horizon, but both handily beat all of the other algorithms.

Now you know all about bandits, and have a good idea of how they might be relevant to online recommender systems. But there's more to do before we have a system that is really up to the job.

Coming up next: let's get Bayesian with Thompson Sampling!

## 19 Comments

Great article. I really like the upper confidence band approach. I have a question: what would you do in a situation where you have delayed feedback on successes/failures? E.g. lets say you have to pick 1,000,000 actions, observe their outcomes, and then set your next 1,000,000 actions?

Initially, an UCB algorithm would probably work well, but if you ever introduced a new lever you would spend a way too many pulls discovering it was sub-optimal.

Hi Zach,

Thanks for the comment. In the next two blog posts of this series I will discuss the delayed feedback case. The short answer is: don't use UCB. Use Thompson Sampling instead. It works just fine in between updates.

Thanks for the great article, very clean an easy. This was time well spent!

Nice, concise post. Any chance you'll be discussing Bayesian bandits in a follow up?

Thomas: yup! The second blog post will be about Thompson Sampling, which is effectively a Bayesian Bandit.

Great! I've been playing around with them a bit, but it's always good to hear another's take on it.

Thank you so much for the python code! numpy dependence notwithstanding, your variable names & clear explanation helped me in translating this to Scala. Thanks again Sir!

Hey,

My main curiosity is a difference in assumptions btwn MAB and recommending products, which is you can pull a slot machine any number of times, while you can only show an individual product, every K times. Basically product x can only be shown once per user session.

In regards to this i'm curious about how you measured regret in your python code. For the K products the algorithms choose (Native, UCB1 etc), when you calculate the total regret do you take the above into account? IE, the K product choices shouldn't be measured against the maximum product (estimated) expected value, but the K max maximum product (estimated) expected values. Do you think this would have any effect on your measure of cumulative expected regret?

Also would the UCB account for this fact? That products can not be shown repeatedly to a single user.

I hope that I havnt been too all over the place.

Hi Daniel,

First, let me say that there are certainly bandit models in the literature that take into account "resource constraints".

In the code & experiments I posted, there is no notion of a session at all. Each view counts as independent and separate, and a product can be shown to the same person as often as the algorithm chooses. These assumptions do not reflect reality, and would result in the same products being shown over and over again. In third blog post, I will describe what exactly we use bandits for and the point you bring up will be less pressing (preview: we don't choose between products, but between

strategieswhich choose sets of products to show).I personally think that taking session constraints into account is a good idea, but it is hard to do right. For example, in the middle of a session we have no idea how much longer the session will last (apart from middling guesses). What if we show all the reasonable products in the first 10 views, but there are another 50 views coming up? In that case, the restriction that we show each product only once will mean we start showing irrelevant products, and risk annoying the shopper. I'm sure there is room to develop models that act on a session level, but these blog posts are more basic than a true in-production model would require.

One thing that does seem to help is to increase the amount of exploration that you do. Then the algorithm is more likely to show a diverse array of products, and this can help for longer sessions.

Hope that helps!

Nice post, Sergey. We've been using Thompson Sampling to deal with delayed feedback, as we're dealing with data coming from mobile applications which are not always connected. The results have been pretty good, here's a breakdown of how it works if you're interested: https://splitforce.com/resources/auto-optimization/

Have you thought about how to deal with changes in environmental factors over relatively longer periods of time? For example, seasonality or changes in popular taste.

What a great article. Rarely I find such enjoyable reads These Days. Will definitely check in here more frequently.

Didn't expect to see this performance from UCB, it's above the rest which is interesting.

>First, let me say that there are certainly bandit models in the literature that take into account "resource constraints".

Do you know of any off the top of your head?

I trust these authors, but haven't read the paper: http://arxiv.org/pdf/1402.6779.pdf

I just wanted to say that this is a very well-written blog article on this topic. Most articles I read on statistics, probability, and machine learning present things in a very opaque manner. Thanks for the clear blog post.