## Personalization with Contextual Bandits

Jun 11, 2014

/Topics: Bayesian, Big data, Data Science

*This is the third in a series of three blog posts about bandits for recommendation systems.*

In the first and second blog posts we covered the bandit problem, and some ways to solve it. But in doing so, we ignored two critical challenges to dishing out recommendations in the real world. They are:

1. What if you have not potential items to recommend but , as may well be the case in retail? It is tricky to learn this many distributions quickly. We need to get good recommendations going as quickly as possible, and it would take a while to explore properly.

2. Different recommendations are appropriate for different contexts. For example, if you're on a search page at an online retailer, the recommendations should be related to what you searched for, while if you're on a specific item page, the recommendations should be related to the item you're looking at.

The solution to the first problem is RichRelevance's use of *strategies*. There is a wealth of information about this approach elsewhere but the short version is: instead of choosing between individual items to recommend, we choose between strategies that themselves do the recommending. One strategy may use collaborative filtering approaches (users who bought X also bought Y), while another might use more general trends (these are the most popular items in this category over the past N days). The number of these strategies is much smaller than the number of items, and their recommendations (unlike the bandit algorithm's) are computed offline in batch, where computational resources are larger.

The solution to the second problem is the topic of this blog post. Let's dive in and define an example from online shopping so that we have something to work with. Each time we have a chance to provide a recommendation to a shopper, the specific context can be described as a combination of customer features and webpage features. We will encode these features in a *binary *feature vector (I'll describe why later). Here are some possible features:

- - Whether the customer's IP is in an urban center or not.
- - Whether the webpage is a specific product page or not.
- - Whether the webpage is cart-related or not.
- - Whether we have seen this customer before.
- - Whether this customer has bought something before.

And so on. In practice, there can be hundreds or thousands of features, and they don't have to be binary. Our technique will work as long the features are encoded in such a way that a distance metric in the feature space makes sense; that is, if the distance between two feature vectors is small, then they represent semantically similar contexts. The reason I am choosing to use a small number of binary features is that we end up with a relatively small number of unique contexts (feature vectors), and this will be necessary to some of the models we will be comparing. Speaking of which...

## Thompson Sampling for Contextual Bandits

Let's say that we have a total of 10 binary features, and strategies. The total number of unique contexts is thus Here are some Thompson sampling options for this scenario:

- Ignore context entirely and build probability distributions, one for each arm.
- For each context, build probability distributions, for a total of
- Something fancy like this.

Option (1) is a good baseline to compare against. Option (2) is doable because for each particular context, you still only have to draw samples at runtime. Option (3) I leave for the reader to try out and let me know if it's any good.

## The Contender: Linear Contextual Bandits

The second simplest model is linear (the simplest is constant), so let's lean on Occam's razor and model our reward as a linear function of the context as follows:

where is an unknown weight vector. Why are we modeling the expected reward in this way? Because it results in a model that we can actually learn quickly and efficiently. Now all we have to do is learn a weight vector for each strategy. This problem is actually very similar to the *multi-class classification* problem in machine learning. When you have to classify, you are presented with some feature vector and you have to guess its class. A linear classifier would do it like this:

In the contextual bandit case, you are presented with some feature vector and you have to guess which strategy will yield the largest payout (in expectation). Because we are modeling our expectation linearly, we can choose which strategy to use this fashion:

(Note that I will refer to the strategy chosen by this rule as the *arg max strategy*.) These two criteria look basically identical, so why can't we just throw a good classifier at the problem? Here's why:

*(1) The training data for classification has exactly one class that gives a positive "reward" and the other classes give exactly zero "reward".* In the bandit case, any number of strategies may give rewards of any value. We still only care about guessing which arm will give the *largest reward*, but we have to take into account relative values. For example, if strategies 1, 2 and 3 have expected rewards of 10, 100, and 101 respectively for a particular context, we do not care too much if strategy 2 or 3 is chosen, but we really don't want to choose strategy 1.

*(2) The training data for classification has full information, and the bandit case doesn't. *When you're training a bandit, you see the rewards of only the strategy you chose and none other. The complete reward vector might have looked like this: , but all you saw was that last How do we train a classifier with partial feedback?

*(3) Taking the arg max is all well and good for exploitation, but how do we explore? *If we always choose the strategy that has the largest discriminant , we may get too little feedback about many of the rarely-selected strategies to learn properly.

Now we're starting to see that this problem is more complicated than just an online classifier. But there are ways to solve all three of the challenges outlined above. The first challenge of non-zero rewards can be solved by using* cost-sensitive classifiers.* These take into account the cost of mis-classifying each sample. (Note that to use these algorithms, we will have to convert rewards into costs, but this is straightforward to do.)

But we don't know all of the costs because of the partial feedback! We'll have to estimate the *unobserved costs* for each sample. How? This part is complicated, but there has been a good amount of research done on this topic, and we can just use it.

The final challenge is how to explore. Despite all of my bad-mouthing of algorithms in the first blog post, it's going to make an appearance again here: we'll use the arg max strategy with probability, and a random other strategy otherwise.

## Simulations

For this blog post, unlike the preceding two, we will use *batch* updates. The reason is simple: this is more realistic. It is too computationally intensive to update the model after seeing every sample. We have to focus all of our resources on serving recommendations in an efficient, timely manner, so we will simulate batch offline updates as follows:

- Show completely random recommendations during the first batch.
- Use the resulting feedback data from the first batch to initially train the models.
- Publish the models, and use them to serve recommendations for the second batch.
- Use the resulting feedback data from the second batch to update the models.
- Repeat (3) and (4) with subsequent batches.

The data is generated from Normal distributions (one per strategy) where the true-but-unknown mean is The code for data generation is here if you'd like to run your own experiments.

Since we're dealing with a more complex model in batch mode, I'm going to report results as average reward for samples in each batch, consisting of 1000 samples. The results are additionally averaged over 10 separate trials where for each trial underlying parameters are randomly generated (fewer than blog posts 1 and 2, but this experiment is much more computationally intensive).

Some observations:

- Contextless Thompson sampling learns quickly, but its ultimate performance is not good.
- Contextual Thompson sampling is
*slow*. This makes sense because there are about 1000 contexts to learn and so it sees on average one sample per batch. - The linear contextual bandit is the clear winner. It starts to learn quickly and continues to get better over time.

I believe my work here is done... "But Sergey," an observant reader might interrupt, "your simulation is not convincing." Why not? "Because the samples are drawn from a distribution that has a linear mean, and your function is also linear, so your model is the exact right one for the application. Surely, reality isn't so convenient." OK, you got me. To address this point, I ran another simulation where the means of the underlying samples is:

Here are the results:

Now contextual Thompson sampling eventually *surpasses* the linear bandit. This makes sense since the linear bandit is somewhat mismatched, so it makes errors by trying to model the entire function, while contextual Thompson sampling learns independently for each unique context. However, it's still slow, taking over 30000 samples to match the linear bandit! And, as I mentioned before, if we drop the requirement that have few binary features, then contextual Thompson sampling quickly becomes a headache.

Here's a summary of what you should consider using:

- If you have no contexts or relatively few unique contexts, then go ahead and use Thompson sampling. It's simple to understand, implement, debug, and it works in batch mode (unlike UCB).
- If you have high-dimensional or continuous context features, then the linear bandit is the way to go. In this post, I used an strategy to explore, but there are probably better ways. (Let me know if you think of some.)

In the preceding three blog posts we: learned about the bandit problem, applied it to online recommendation system, discussed many of details needed to actually making it work, and ran simulations to build up our intuition. Now go out there and recommend!

## 2 Comments

Well, the obvious immediate next step would be an annealing epsilon-greedy. It may also be interesting to look at time-varying windows, or having the exploration coefficient tied to the number of "losses."