Cuckoo hashing

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Cuckoo hashing example. The arrows show the alternative location of each key. A new item would be inserted in the location of A by moving A to its alternative location, currently occupied by B, and moving B to its alternative location which is currently vacant. Insertion of a new item in the location of H would not succeed: Since H is part of a cycle (together with W), the new item would get kicked out again.

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table. Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in 2001.[1] The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches.

Contents

[edit] Theory

The basic idea is to use two hash functions instead of only one. This provides two possible locations in the hash table for each key.

When a new key is inserted, a greedy algorithm is used: The new key is inserted in one of its two possible locations, "kicking out", that is, displacing any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place[2] using new hash functions.

Lookup requires inspection of just two locations in the hash table, which takes constant time in the worst case (see Big O notation). This is in contrast to many other hash table algorithms, which may not have a constant worst-case bound on the time to do a lookup.

It can also be shown that insertions succeed in expected constant time,[1] even considering the possibility of having to rebuild the table, as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50%.

Generalizations of cuckoo hashing that use more than 2 alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%. Another generalization of cuckoo hashing consists in using more than one key per bucket. Using just 2 keys per bucket permits a load factor above 80%.

Other algorithms that use multiple hash functions include the Bloom filter. Cuckoo hashing can be used to implement a data structure equivalent to a Bloom filter.

A study by Zukowski et al.[3] has shown that cuckoo hashing is much faster than chained hashing for small, cache-resident hash tables on modern processors. Kenneth Ross[4] has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high. However as of 2007 cuckoo hashing remains largely unknown outside the research community.

The performance of the bucketized cuckoo hash table for maintaining a large dictionary of integers in memory was investigated empirically by Askitis [5], who compared its performance against alternative hashing schemes tables, including linear probing and the cache-conscious array hash table. When storing a large set of distinct keys, bucketized cuckoo hashing was observed to be faster to build, search and delete than the equivalent chained and array hash tables. However, the bucketized cuckoo hash table was space-intensive relative to the array hash table. In addition, it was not as scalable, since the total number of distinct keys storeable is bounded by the number of slots available, their capacity, and the extra vacant slots needed to prevent an irresolvable collision. Hence, in order to cater for an unexpected increase in the number of distinct keys processed, for example, the bucketized cuckoo hash table will need to be re-sized which can be both expensive and space-intensive, particularly in a dynamic environment.

Askitis also tested the performance of these hash tables under heavy skew access (i.e., when few keys are repeatedly searched over a period of time, which is typically observed in practice). In this case, and despite its constant worst-case look-up cost, the bucketized cuckoo hash table was demonstrated to be the slowest hash table, far inferior to both the chained and array hash tables. Indeed, the fastest hash table for both distinct and skew data distribution was linear probing, though it too is not a scalable option relative to the array hash table; and although more space-efficient than bucketized cuckoo hashing, it also requires a surplus of empty slots.

[edit] See also

[edit] References

  1. ^ a b Pagh, Rasmus; Rodler, Flemming Friche (2001) (PDF, PS). Cuckoo Hashing. doi:10.1.1.25.4189. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.4189. Retrieved on 2008-10-16. 
  2. ^ Pagh and Rodler: "There is no need to allocate new tables for the rehashing: We may simply run through the tables to delete and perform the usual insertion procedure on all keys found not to be at their intended position in the table."
  3. ^ Zukowski, Marcin; Heman, Sandor; Boncz, Peter (2006-06) (pdf). Architecture-Conscious Hashing. Proceedings of the International Workshop on Data Management on New Hardware (DaMoN). http://www.cs.cmu.edu/~damon2006/pdf/zukowski06archconscioushashing.pdf. Retrieved on 2008-10-16. 
  4. ^ Ross, Kenneth (2006-11-08) (pdf). Efficient Hash Probes on Modern Processors. IBM Research Report RC24100. RC24100. http://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf. Retrieved on 2008-10-16. 
  5. ^ Askitis, Nikolas (2009), "Fast and Compact Hash Tables for Integer Keys", Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009) 91: 113-122, ISBN 978-1-920682-72-9, http://www.crpit.com/VolumeIndexU.html#Vol91 

[edit] External links

Personal tools
Languages