Turning Theory into Algorithms

Jonathan Huggins · December 21, 2012

Machine Learning

Some of the common complaints I hear about (learning) theoretical work run along the lines of "those bounds are meaningless in practice," "that result doesn't apply to any algorithm someone would actually use," and "you lost me as soon as martingales/Banach spaces/measure-theoretic niceties/... got involved." I don't have a good answer for the latter concern, but a very nice paper by Sasha Rakhlin, Ohad Shamir, and Karthik Sridharan at NIPS this year goes some ways toward address the first two criticisms. Their paper, "Relax and Randomize: From Value to Algorithms," (extended version here) is concerned with transforming non-constructive online regret bounds into useful algorithms. The paper builds off previous statistical learning theory results for worst-case online learning by authors together with Ambuj Tewari. This preprint provides a summary of results for learning from sequential, dependent data. Their results mirror those for the i.i.d. setting, and include necessary and sufficient conditions for uniform convergence of empirical processes with dependent data; notions of covering numbers, fat-shattering dimension, Rademacher complexity, and other measures of function class complexity in the sequential setting; and a variety of regret bounds based on these quantities.

The essential idea of Rahklin et al.'s approach to statistical learning theory for dependent sequences is to view learning as a 2-player game between the learner and Nature. This hearkens back to a classic way of doing worst-case analysis of online algorithms. In that scenario, the algorithm chooses some strategy $i$ and Nature (a.k.a. the Adversary) chooses some sequence $j$ to give the algorithm. This results in some loss $a_{ij}$ for the algorithm. Forming a (possibly infinite) matrix $A = (a_{ij})$, if the algorithm chooses a strategy according to $p$ (distribution over rows of $A$) and Nature according to $q$ (distribution of columns), then we are interested in the value of the game,

\[ V = \min_p \max_q pAq = \max_q \min_p pAq.\]

Here, the algorithm tries to minimize its loss while Nature tries to maximize the algorithm's loss. The value of the game gives an upper bound for the loss (and hence performance) of the algorithm.

The situation is a little more complicated for online learning. Let $\mathcal F$ be the moves of the learner and $\mathcal X$ be the moves of Nature. For a finite number of rounds $t = 1,\dots, T$, the learner and Nature choose $f_t \in \mathcal F$ and $x_t \in \mathcal X$ and then observe each other's choices. These move pairs are used to define a loss function $\ell : \mathcal F \times \mathcal X \to \mathbb R$. The overall loss for the game is the regret

\[\textbf{Reg}_{T} := \textstyle\sum_{t=1}^{T} \ell(f_{t}, x_{t}) - \inf_{f \in \mathcal F} \sum_{t=1}^{T} \ell(f, x_{t}).\]

The second term relativizes the loss so that the algorithm is compared to how well you could have done if you had chosen the best possible $f \in \mathcal F$ from the start. Unlike in the algorithms case, the value of the online learning "game" has interleaved min's and max's (well, inf's and sup's), since there are $T$ rounds in the game:

\[\mathcal{V}_T(\mathcal F) = \inf_{p_1 \in \Delta(\mathcal F)} \sup_{x_1 \in \mathcal X} \mathbb E_{f_1 \sim p_1} \cdots \inf_{p_T \in \Delta(\mathcal F)} \sup_{x_T \in \mathcal X} \mathbb E_{f_T \sim p_T} \textbf{Reg}_T,\]

where $\Delta(\mathcal F)$ is the set of all distributions over $\mathcal F$. In the learning scenario, the expectation plays the role as $pAq$ and Nature's move can be deterministic while still giving the same value.

What Rahklin et al. do is consider relaxations, i.e. upper bounds, to $\mathcal{V}_T(\mathcal F)$ (well, actually the bound a conditional form of $\mathcal{V}_T(\mathcal F)$). These relaxations are chosen to be more manageable to deal with analytically. At the $t$-th stage of the game, instead of finding $p_t$ based on the expression for $\textbf{Reg}_T$, an approximate $p_t$ is calculated. This $p_t$ is the distribution over $\mathcal F$ that minimizes an expression that involves the maximum of the relaxation plus an expected loss term. Since the relaxation can be handled analytically, one now has an actual algorithm for calculating $p_t$, which can be used to make a prediction. The authors derive sometimes improved versions of classic algorithms as well as some new ones. For example, they give a parameter-free version of exponential weights which optimizes the learning rate in an online manner. Specifically, in round $t$, for each $f \in \mathcal F$

\[ p_t(f) \propto \exp(-\lambda_{t-1}^* \textstyle\sum_{s=1}^{t-1} \ell(f, x_s)) \]

where

\[ \lambda_{t-1}^* = \arg\min_{\lambda > 0} \left\{ \frac{1}{\lambda} \log \left (\sum_{f \in \mathcal F} \left(- \lambda\textstyle\sum_{s=1}^{t-1} \ell(f, x_s) \right)\right) + 2\lambda (T - t) \right \} \]

is the optimal learning rate. The relaxation can then be used to bound the regret of the algorithm by $2 \sqrt{2 T \log |\mathcal F|}$. There are many other algorithms discussed, including mirror descent, follow the perturbed leader, and a matrix completion method. I highly recommend checking out the paper for the rest of the details and other examples.

Share: Twitter, Facebook