# Hashing, streaming and sketching

One of the questions in the air at NIPS 2012 was, how do we make machine learning algorithms scale to large datasets? There are two main approaches: (1) developing parallelizable ML algorithms and integrating them with large parallel systems and (2) developing more efficient algorithms. More often than not, the latter approach requires some sort of relaxation of an underlying task. Hashing, streaming algorithms and sketching are increasingly employed to achieve efficient approximate algorithms that arise in ML tasks. Below, I highlight a few examples, mostly from NIPS 2012, with several coming from the Big Learning workshop.

Nearest neighbor search (or similarity search) appears in many “meta” ML tasks such as information retrieval and near-duplicate detection. Many approximate approches are based on locality-sensitive hashing (LSH). The basic idea with LSH is to choose a hash function that maps similar items to the same bucket instead of computing some distance between all pairs of items. For example, some natural language processing (NLP) algorithms depend on nearest-neighbor graphs; LSH can be exploited to efficiently produce approximate nearest-neighbor graphs for large text datasets, e.g., Goyal et al. (2012). While LSH has been around for more than a decade, LSH algorithms are still being developed, e.g., Li et al. (2012) use one permutation hashing to compete with (k-permutation) minwise hashing. A recent area of interest has been in learning a good LSH function automatically from a given dataset; Zheng and Yeung (2012) present one such routine designed for multimodal data.

When a dataset is too large to fit in memory, it may be possible to process the data sequentially with only a limited amount of memory. Streaming is precisely this computational model, in which algorithms are allowed a single or small number of passes over the data; an excellent tutorial-style overview is by Muthukrishnan (2003) . Note that a data stream can often be interpreted as a sequence of ‘slices’ in time or space. Further, a data set may be thought of as streaming along multiple dimensions, e.g., both time and space, or over some arbitrary serialization of data points, e.g., some ordering of the nodes and edges in a large graph. There is active ML research in both applying existing streaming algorithm results and designing new streaming algorithms motivated by ML problems. Examples include an implementation of expectation maximization for a data stream that is discretized into a series of high-frequency batch computations (Hunter et al., 2012) and streaming algorithms for balanced graph partitioning (Stanton, 2012).

A sketch is a compact representation of data, useful for processing large amounts of data or data streams. It is typically lossy and/or focuses on important statistical features of the data (Cormode, 2011). Li et al. (2006) use sketches to construct conditional random samples for sparse data. A sketch is different from a sample; the latter considers only a portion of the entire data, while the former is computed over all the data. For example, the count-min sketch is able to characterize rare events that would be missed by a sample from a long-tailed distribution (Cormode and Muthukrishnan, 2004). Instead of limiting the size of an input dataset to a ML algorithm by sampling, we can design algorithms that work with sketches. For example, Goyal et al. (2012) – already mentioned above for their use of LSH – a count-min sketch to store the approximate counts of words and other entities of interest for an NLP application.

P.S. Special thanks to Zhenming Liu for his very helpful feedback on this post and Michael Mitzenmacher for CS 222, where I developed an appreciation for topics like hashing, streaming and sketching!