The essential idea of Rahklin et al.’s approach to statistical learning theory for dependent sequences is to view learning as a 2-player game between the learner and Nature. This hearkens back to a classic way of doing worst-case analysis of online algorithms. In that scenario, the algorithm chooses some strategy $i$ and Nature (a.k.a. the Adversary) chooses some sequence $j$ to give the algorithm. This results in some loss $a_{ij}$ for the algorithm. Forming a (possibly infinite) matrix $A = (a_{ij})$, if the algorithm chooses a strategy according to $p$ (distribution over rows of $A$) and Nature according to $q$ (distribution of columns), then we are interested in the *value *of the game,

\[ V = \min_p \max_q pAq = \max_q \min_p pAq.\]

Here, the algorithm tries to minimize its loss while Nature tries to maximize the algorithm’s loss. The value of the game gives an upper bound for the loss (and hence performance) of the algorithm.

The situation is a little more complicated for online learning. Let $\mathcal F$ be the moves of the learner and $\mathcal X$ be the moves of Nature. For a finite number of rounds $t = 1,\dots, T$, the learner and Nature choose $f_t \in \mathcal F$ and $x_t \in \mathcal X$ and then observe each other’s choices. These move pairs are used to define a loss function $\ell : \mathcal F \times \mathcal X \to \mathbb R$. The overall loss for the game is the *regret *

\[\textbf{Reg}_{T} := \textstyle\sum_{t=1}^{T} \ell(f_{t}, x_{t}) – \inf_{f \in \mathcal F} \sum_{t=1}^{T} \ell(f, x_{t}).\]

The second term relativizes the loss so that the algorithm is compared to how well you could have done if you had chosen the best possible $f \in \mathcal F$ from the start. Unlike in the algorithms case, the value of the online learning “game” has interleaved min’s and max’s (well, inf’s and sup’s), since there are $T$ rounds in the game:

\[\mathcal{V}_T(\mathcal F) = \inf_{p_1 \in \Delta(\mathcal F)} \sup_{x_1 \in \mathcal X} \mathbb E_{f_1 \sim p_1} \cdots \inf_{p_T \in \Delta(\mathcal F)} \sup_{x_T \in \mathcal X} \mathbb E_{f_T \sim p_T} \textbf{Reg}_T,\]

where $\Delta(\mathcal F)$ is the set of all distributions over $\mathcal F$. In the learning scenario, the expectation plays the role as $pAq$ and Nature’s move can be deterministic while still giving the same value.

What Rahklin et al. do is consider *relaxations*, i.e. upper bounds, to $\mathcal{V}_T(\mathcal F)$ (well, actually the bound a conditional form of $\mathcal{V}_T(\mathcal F)$). These relaxations are chosen to be more manageable to deal with analytically. At the $t$-th stage of the game, instead of finding $p_t$ based on the expression for $\textbf{Reg}_T$, an approximate $p_t$ is calculated. This $p_t$ is the distribution over $\mathcal F$ that minimizes an expression that involves the maximum of the relaxation plus an expected loss term. Since the relaxation can be handled analytically, one now has an actual algorithm for calculating $p_t$, which can be used to make a prediction. The authors derive sometimes improved versions of classic algorithms as well as some new ones. For example, they give a parameter-free version of exponential weights which optimizes the learning rate in an online manner. Specifically, in round $t$, for each $f \in \mathcal F$

\[ p_t(f) \propto \exp(-\lambda_{t-1}^* \textstyle\sum_{s=1}^{t-1} \ell(f, x_s)) \]

where

\[ \lambda_{t-1}^* = \arg\min_{\lambda > 0} \left\{ \frac{1}{\lambda} \log \left (\sum_{f \in \mathcal F} \left(- \lambda\textstyle\sum_{s=1}^{t-1} \ell(f, x_s) \right)\right) + 2\lambda (T – t) \right \} \]

is the optimal learning rate. The relaxation can then be used to bound the regret of the algorithm by $2 \sqrt{2 T \log |\mathcal F|}$. There are many other algorithms discussed, including mirror descent, follow the perturbed leader, and a matrix completion method. I highly recommend checking out the paper for the rest of the details and other examples.