Abstract
In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.
Original language | English |
---|---|
Title of host publication | Proceedings - 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015 |
Number of pages | 19 |
Volume | 2015-December |
Publisher | IEEE Computer Society Press |
Publication date | 11 Dec 2015 |
Pages | 1292-1310 |
Article number | 7354457 |
ISBN (Electronic) | 9781467381918 |
DOIs | |
Publication status | Published - 11 Dec 2015 |
Event | 56th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2015 - Berkeley, United States Duration: 18 Oct 2015 → 20 Oct 2015 Conference number: 56 |
Conference
Conference | 56th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2015 |
---|---|
Number | 56 |
Country/Territory | United States |
City | Berkeley |
Period | 18/10/2015 → 20/10/2015 |
Sponsor | IEEE |
Keywords
- concentration bounds
- constant moments
- Hashing
- invertible bloom filters
- joint distributions
- k-partition
- peelability
- tabulation hashing
- uniform hashing