Parameter Optimization with DAPS!
Oct 13, 2014/
Consider a situation where you have a dial to tweak, and this dial setting may influence a reward of some kind. For example, the dial may be a weight used in a personalization algorithm, and the reward may be clickthrough or revenue. The problem is, we don't know beforehand how the dial affects the reward, and the reward behavior may be noisy. How then can we choose a dial setting that maximizes the reward?
Now let's get a little more concrete: let's say we have a parameter X that can take on any real value between 0 and 1, and the reward depends on X in some way that we don't know beforehand. Additionally, the reward is noisy, so the result of choosing a particular parameter value isn't deterministic.
You may have read a previous post by Sergey in which he describes a similar problem: how to pick from several discrete choices ('bandits') in a situation where the reward depends on the choice you make. The difference here is that instead of several discrete choices, we are picking a value from a continuous range. Therefore:
- There is an infinite number of possible choices.
- In typical cases, one may expect nearby parameter values to produce similar reward; for instance, parameter values 0.4 and 0.4001 should have similar results. In other words, there is a reasonable assumption of smoothness.
A handy way to approach this problem is to model the unknown reward function as an instance of a Gaussian Process - this method is called kriging or Gaussian Process Regression (GPR). There is wide availability of kriging/GPR literature and online tutorials , and open-source implementations such as the excellent scikit-learn package by Dubourg & Vanderplas. I don't want to duplicate the math here, but I do want to convey an intuition for how kriging works and how we can use it.
Let's say that in a given situation, we have already made a few (noisy) observations - that is, we have seen what reward we obtained for a few different choices of parameter value. Our challenge is what parameter value to pick next.
The figure below shows a simulated experiment after three data points have been collected. The parameter range is shown along the x-axis and reward is plotted on the y-axis. The nominal reward function is known to us (indicated by the dashed red line), but not known to the kriging algorithm. All it knows are the actual observations, indicated by the solid red dots. Since the reward is noisy, you will notice that the red dots do not fall exactly on the dotted red line. Our algorithm will need to deal with the noise appropriately.
The three data points don't give us a ton of information, but do provide some basis for inference. Using them, along with a Gaussian Process model with an appropriate kernel, we can establish a 95% confidence region for the "true" function; this confidence region is shown in purple. At the three points where we have already tried a particular choice of parameter, we have a good degree of confidence about where reward should be, so the confidence band is narrow at these places. Not pinched to a point, because we know there is noise, but narrow. Also, because of the assumption of smoothness we mentioned before, the confidence interval is also somewhat narrow close to the known data points.
This is a good place to remind ourselves that our goal is not to accurately determine the reward function everywhere. Rather, our goal is to maximize reward. However, since we don't know the reward function a priori, we need to explore. How do we manage the exploration-exploitation tradeoff?
There is a simple and elegant way to do this: pick the highest value of the upper confidence bound (UCB). This approach is well described by Srinivas et al . In the case above, this suggests the next parameter value we should pick, shown by the dashed green line. Then we gain another observation, repeat the process again, and so on. This is thus an online learning method; we are learning as we go. The movie below shows this in action:
Some care is needed when designing an optimization algorithm based on this method. For one thing, it is important to use an appropriate noise formulation for the case at hand. Another thing to consider is drift: the reward function itself may change over time (and usually does in our world at RichRelevance). One way to deal with this is to treat recent observations as more important than old ones, but this must be handled carefully as well.
Based on the ideas above, we created a parameter optimization engine for our use case, which we call DAPS (for Dynamic Algorithm Parameter Selection). DAPS is robust, executes in real-time and deals with live incoming data.
DAPS has been deployed to work with one of our major recommendation strategies, and the results are compelling: it produces a reliable revenue lift of 1-2% for this strategy. If you are a RichRelevance client, you are already seeing the benefits of this technology.
- Carl E. Rasmussen and Christopher K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, 2006. (Also available online.)
- Kevin P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012, pp. 515-542.
- Chris Fonnesbeck, Non-parametric Bayes: Gaussian Processes (online tutorial)
- Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger, "Gaussian process optimization in the bandit setting: no regret and experimental design", Proc. International Conference on Machine Learning (ICML), 2010. (Also available online.)