Hashing for Statistics over K-Partitions

Soren Dahlgaard, Mathias Baek Tejs Knudsen, Eva Rotenberg, Mikkel Thorup

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS'83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used Hyper Log Log algorithm of Flajolet et al. [AOFA'97] and in large-scale machine learning by Li et al. [NIPS'12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. A simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015
Number of pages19
Volume2015-December
PublisherIEEE Computer Society Press
Publication date11 Dec 2015
Pages1292-1310
Article number7354457
ISBN (Electronic)9781467381918
DOIs
Publication statusPublished - 11 Dec 2015
Event56th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2015 - Berkeley, United States
Duration: 18 Oct 201520 Oct 2015
Conference number: 56

Conference

Conference56th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2015
Number56
Country/TerritoryUnited States
CityBerkeley
Period18/10/201520/10/2015
SponsorIEEE

Keywords

  • concentration bounds
  • constant moments
  • Hashing
  • invertible bloom filters
  • joint distributions
  • k-partition
  • peelability
  • tabulation hashing
  • uniform hashing

Fingerprint

Dive into the research topics of 'Hashing for Statistics over K-Partitions'. Together they form a unique fingerprint.

Cite this