## NIPS Data Science Review Part 1: Similarities Between Histograms

Sometimes when you have histograms you want to find similarities between them. A classic machine learning example is when you need a kernel (a fancy name for a kind of similarity function) for a SVM classifier [see here and here], and your data happens to have histograms as features. Sure you could still use the favorite RBF (Gaussian) kernel and it should work, but there are kernels specifically designed to measure how similar two normalized histograms are (vectors where all of the elements are non-negative and sum to one).  My favorite has been the histogram intersection kernel (HIK). Given two empirical histograms $\mathbf{p}$ and $\mathbf{q}$ (the $i$ th entry in the vector is the normalized count for the $i$ th bin), the HIK similarity is:

Let's break that down.  The HIK is bounded between 0 and 1, and we can treat it as a measure of how much probability mass $\mathbf{p}$ and $\mathbf{q}$ have in common.  It will be exactly 1 only when  $\mathbf{p} = \mathbf{q}$ , and will be zero if there are exactly zero bins for which both  $\mathbf{p}$ and $\mathbf{q}$ have non-zero entries.  Recalling that the inner product can be written $<\mathbf{p},\mathbf{q}> =\sum_i p_iq_i$ , you can think of the HIK as a kind of non-linearized inner product that takes into account the meaning of probability. And it happens to be a mathematically valid kernel!

With that background out of the way, take a gander at Sign Cauchy Projections and Chi-Square Kernel by Li et al [here].  The paper is about much more than similarities between histograms, but it does begin by discussing a similarity I had not heard of it before.  Namely, the  $\chi^2$ similarity, denoted $\rho_{\chi^2}$ :

which is also bounded between 0 and 1 for normalized histograms.  The paper goes on to compare a number of kernels in a kernelized SVM, demonstrating that the chi-square kernel performs well.  But they do not mention the HIK.  So I got to wondering: how do the two compare?

A practical way to compare kernels is to look at what kind of classification accuracy they provide when used in a kernelized SVM. However, the kernels under discussion here are designed specifically for normalized histograms, and most data sets do not have histograms as their native feature representations. But that can be fixed; I implemented a simplified version of the Coates et al k-means encoding pipeline, which turns an image into a histogram of cluster memberships. That means we can use our histogram kernels!  Which I did!  I compared the HIK, the chi-square kernel, the RBF kernel, and a linear kernel with respect to classification accuracy on three different datasets: Digits, Olivetti Faces, and (a subset of) MNSIT.

The results are pictured in the three plots below.  The $x$ -axis is $C$ , the SVM regularization parameter, and the $y$ -axis is classification accuracy on a held-out test set.  Briefly, the larger the $C$ is, the harder the classifier tries to find a wider boundary between classes; the parameter represents a trade-off between minimizing error on the training set and creating a boundary that we expect will generalize better in the future.  At first glance it looks like that bigger $C$ is better, but note that for two of the datasets, the best accuracy is achieved  somewhere in the middle of the curve. If you make $C$ too large, then the classifier doesn't try hard enough to fit the model such that error is minimized on the training data.

The code is here if you want to look at all the details (it will need a heck of a lot of ram to run).

Interestingly, it looks like the chi-square kernel is slightly better than the HIK for all three datasets, and both of the histogram-oriented kernels tend to be better than the RBF kernel and the linear kernel for a wider range of the $C$ parameter.

Conclusion: the chi-square kernel looks like a winner (at least with respect to this particular application).