The log marginal likelihood is a central object for Bayesian inference with latent variable models: where are observations, are latent variables, and are parameters. Variational inference tackles this problem by approximating the posterior over with a simpler density . Often this density has a factored structure, for example. The approximating density is fit by maximizing a lower bound on the log marginal likelihood, or “evidence” (hence ELBO = evidence lower bound): The hope is that this will be a tight enough bound that we can use this as a proxy for the marginal likelihood when reasoning about . The ELBO is typically derived in one of two ways: via Jensen’s inequality or by writing down the … Read More
The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables . Define the random variable (1) to be a scaled sum of the first variables in the sequence. Now, we would like to make interesting statements about the sequence (2)
It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation: where the are unconstrained in , but the live on the simplex, i.e., and . The parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such … Read More
This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function , with , and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state to a proposed state with probability . But what if we cannot evaluate exactly? Such a situation might arise if we are given a joint density function , with , and we must marginalize out in order to compute . In this situation, we may only be able to approximate
If you have some randomness in your life, chances are that you want to try Chernoff’s bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff’s bound or its cousins). My favorite form of Chernoff’s bound is: for independent binary random variables, and , and , then Note that are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so is typically larger than . In the standard applications, the stochastic system has size and an event of interest, , has expectation . The … Read More
I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic. The goal of variational inference is to approximate an intractable probability distribution, , with a tractable one, , in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.
I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form (1) where is a multivariate Gaussian prior with mean and covariance matrix , and is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution as (2)
As the title of the post suggests, this week I will discuss a geometric intuition for Markov’s inequality, which for a nonnegative random variable, , states This is a simple result in basic probability that still felt surprising every time I used it… until very recently. (Warning: Basic measure theoretic probability lies ahead. These notes look like they provide sufficient background if this post is confusing and you are sufficiently motivated!)
Ryan Adams and I just uploaded to the arXiv our paper “High-Dimensional Probability Estimation with Deep Density Models”. In this work, we introduce the deep density model (DDM), a new approach for density estimation.
An exponential family parametrized by is the set of probability distributions that can be expressed as for given functions (the partition function), , and (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified.
- Page 1 of 2