The Poisson Estimator

Ryan Adams · December 19, 2012

Much of what we do when we analyze data and invent algorithms is think about estimators for unknown quantities, even when we don't directly phrase things this way.  One type of estimator that we commonly encounter is the Monte Carlo estimator, which approximates expectations via the sample mean.  That is, many problems in which we are interested involve a distribution $\pi$ on a space $\mathcal{X}$, where we wish to calculate the expectation of a function $f(x)$:

\[\hat{f}_{\pi} = \int_{\mathcal{X}} \pi(x)\,f(x)\,\mathrm{d}x \approx \frac{1}{N}\sum_{n=1}^N f(x_n) \text{ where } x_n \sim \pi.\]

This is very nice because it gives you an unbiased estimator of $\hat{f}_\pi$.  That is, the expectation of this estimator is the desired quantity.  However, one issue that comes up very often is that we want to find an unbiased estimator of a quantity that is a function of an expectation.  Of course, we know that the expectation of a function is not in general a function of the expectation, so we can't do the easy thing that we'd like to do and wind up with an unbiased estimator.  For me, this is most salient when one wants to compute an unbiased estimate of a geometric mean.  Let's imagine that we have a set of $M$ (where $M$ is very large) positive quantities $\{\gamma_m\}^M_{m=1}$ and we are interested in estimating the geometric mean:

\[\hat{\gamma} = \left(\prod_{m=1}^M\gamma_m\right)^{1/M}.\]

You might reasonably examine this and observe that you can get an unbiased estimator of the log of this quantity with a randomly-chosen subset of $N$ of the $\gamma_m$, i.e.,

\[\log\hat{\gamma} = \frac{1}{M}\sum^M_{m=1}\log \gamma_m \approx \frac{1}{N}\sum^N_{n=1}\log \gamma_n\]

You might now imagine exponentiating this to get an estimator, but it will be biased.  (Jensen's Inequality should help you think about why this is true.)  Is there anything to be done?  Are we sunk?  Remarkably, it turns out that there are ways to fix up such estimators so that they are unbiased.  One such example is the Poisson estimator, which I learned about from Fearnhead et al. (2010) and which they attribute to Wagner (1987).  I particularly recommend the general treatment in Papaspiliopoulos (2009). Continuing with the example above, we want to know $\hat{\gamma}$, but we can only estimate $\log\hat{\gamma}$.  Let's imagine that $M$ is so much larger than $N$ that we can get i.i.d. estimates of $\log\hat{\gamma}$, which I will denote $\lambda_j$.

The trick is to compute the randomized product of your estimators:

\[e^{\delta} \prod^J_{j=1} \frac{\lambda_j}{\delta}, \text{ where } J \sim \textrm{Poisson}(\delta)\]

where $\delta > 0$ is a free parameter.  This has expectation $\gamma = e^{\hat{\lambda}}$ where $\hat{\lambda} = \mathbb{E}[\lambda_j]$.  You can see this by thinking about unrolling the expectation of the estimator, where $\Pr(J=0) = e^{-\delta}$, $\Pr(J=1) = \delta e^{-\delta}$, $\Pr(J=2)=\delta^2 e^{-\delta}/2$ and so on according to the Poisson distribution

\[\Pr(J=j\,|\,\delta) = \frac{1}{j!}e^{-\delta}\delta^j.\]

Then the expectation of the estimator becomes:

\[1 + \hat{\lambda} + \hat{\lambda}^2/2 + \hat{\lambda}^3/6 + \cdots + \hat{\lambda}^j/j! + \cdots\]

If all of the $\lambda_j$ are independent, then we can see this is the power series expansion of $e^{\hat{\lambda}}$.  I think this is a very cool trick.  In practice, one can introduce another free parameter $c \in \mathbb{R}$ to get lower variance, so that the final estimator is:

\[e^{\delta + c} \prod^J_{j=1} \frac{\lambda_j - c}{\delta}, \text{ where } J \sim \textrm{Poisson}(\delta).\]

Max Welling and his student  Anoop Korattikara, as well as some other folks, have been thinking about how this could be used with pseudo-marginal Markov chain Monte Carlo (Andrieu and Roberts, 2009) to get "batch MCMC" that only computes the likelihood for a subset of the data but nevertheless has the full posterior as its equilibrium distribution.  However, Anoop and Max report that the variance of such estimators is sufficiently high that the Markov chain is very likely to get stuck in a bad location due to the occasional overestimation of the likelihood.  It's too bad, because this seems like it could overcome some of the frustrations intrinsic to Monte Carlo inference with very large data sets.

Twitter, Facebook