Hashing#

Hashing is a method for compressing information from a high dimensional space into a smaller space. Hashing is commonly used in computer science to help us with many tasks. For instance, if two documents are (randomly) hashed to the same code, it is very likely that they are exactly the same. Also, in computer vision, we sometimes hash images in a clever way to find similar or related images through their codes.

Hashing goes all the way back to Shannon, the father of information theory, who looked at random hashes in his source coding theorem. There are also interesting connections to compressed sensing which have not been fully explored as yet.

In deep learning and machine learning, there has been increased interest in applying hashing to speed up the computation. Here are some examples.

Hashing Trick#

Wikipedia article on Feature Hashing

Let us suppose that we represent documents as Bags of Words. Essentially, a bag of words is a large vector where the \(n\)-th coordinate counts the number of times the \(n\)-th word of a given dictionary occurs in the document. Doing classification with such large vectors can be computationally intensive, so the hashing trick maps each word in the dictionary randomly down a smaller list of indices. The bag of words is then mapped into a smaller vector by adding the coordinates of words which hash to the same index. One can show that if the original bag of words is sparse, then there is a very small probability of collision in the hash. Moreover, for many linear machine learning methods like linear regression and support vector machines, the performance of the algorithm degrades only very slightly under the hash, but the improvement in computation time is great.

Vowpal Wabbit#

Github Wiki for Vowpal Wabbit

This is not a kind of hash, but a software package for machine learning that uses the hash trick extensively. The hash used is a type of random projection (MD5).

Minwise Hashing#

Hashing Algorithms for Large Scale Learning

This is a variant of the Minwise Hash. In the minwise hash, we first fix permutations \(\pi_1, \pi_2, \ldots, \pi_k\) of the dictionary. Given a document \(S\), we compute the minimum nonzero index of \(\pi_i(S)\) for each \(i\), storing each index using 64-bits. This collection of indices is the minwise hash. The \(b\)-bit version stores the first \(b\) bits of the minimum indices instead of the full 64 bits. The authors apply their hash method to SVMs and other linear methods and they claim that it has better performance than the random projections used by Vowpal Wabbit.

Compressing Neural Networks#

Compressing Neural Networks with the Hashing Trick

Binary Embeddings with Structured Hash Projections

A Deep Neural Network Compression Pipeline: Pruning, Quantization, Huffman Encoding

One important problem is to fit the large neural networks learned for facial recognition or image recognition into mobile phones. The challenge is in the storage of the many neural weights. One approach is randomly pick matrix structures where the values of some of its entries are tied together, and use such matrices as weights in training the neural network. Optimal values of the tied weights are learned by the algorithm. If we look at minwise hash or the random hashes used by Vowpal Wabbit using this perspective, they correspond to using fat \(0-1\) matrices where each column (corresponding to each input neuron) has only one nonzero element, and the positions of the ones are randomly chosen in each column. If we multiply a square matrix of free weights to the left of the fat matrix, this gives a fat matrix whose entries must come from the original square matrix.

Acknowledgements#

I would like to thank Binghao, Sai Ganesh, Georgios, Yeow Khiang and Zuozhu for insightful discussions on this topic