Gilbert Sebastian Yogiana

Hash Tables

A hash table is a data structure that implements a map.

It uses an array to store key-value pairs, and a hash function that, given a key, computes an index into the array where the associated value can be found.

Operations

Insert: Insert or replace key-value pair
Lookup: Given a key, get its associated value
Delete: Given a key, delete its key-value pair

Performance

Average-case: O(1)
- Assuming good hash function and appropriate resizing
Worst-case: O(n)
- If all keys hash to the same value (extremely unlikely with good hash)

Hashing

Hashing is the process of mapping data of arbitrary size to fixed-size values using a hash function

A hash function:

Maps a key to an index in the range [0, N − 1]
- where N is the size of the array
Must be cheap to compute
Is deterministic
- Given the same key, will always return the same index
Ideally, maps keys uniformly over the range of indices

A hash collision occurs when for two keys x and y, x $\neq$ y, but h(x) = h(y).

Collision Resolution

Important statistic: load factor (α)

Ratio of items to slots; α = M/N
Useful when analysing collision resolution methods

Separate Chaining

Resolve collisions by having multiple items per array slot. Each array slot contains a linked list of items that are hashed to that index.

Cost analysis:

N array slots, M items
Average list length L = M/N
Best case: Items evenly distributed, so maximum list length is $\lceil M/N \rceil$
- Cost of insert/lookup/delete: $O(M/N)$
Worst case: One list of length M
- Cost of insert/lookup/delete: $O(M)$

Average costs:

If good hash and α ≤ 1, cost is $O(1)$
If good hash and α > 1, cost is $O(M/N)$
- To avoid degrading perfomance, hash table should be resized when α ≈ 1

Linear Probing

Resolve collisions by finding a new slot for the item

Each array slot stores a single item (unlike separate chaining)
On a hash collision, try next slot, then next, until an empty slot is found
Insert item into empty slot

Two primary methods for deletion:

Backshift

Remove and re-insert all items between the deleted item and the next empty slot

Tombstone

Replace the deleted item with a “deleted” marker (AKA a tombstone) that:
Is treated as empty during insertion
Is treated as occupied during lookup

Backshift method:

Moves items closer to their hash index
- Thus reducing the length of their probe path
Deletion becomes more expensive

Tombstone method:

Fast
But does not reduce probe path length
Large number of deletions will cause tombstones to build up

Problem with linear probing: clustering

Items tend to cluster together into long runs
Long runs are a problem:
- Insertions must travel to the end of a run
- Lookups of non-existent keys must travel to the end of a run

Double Hashing

Double hashing improves on linear probing:

By using an increment which…
- is based on a secondary hash of the key
- ensures that all slots will be visited (by using an increment which is relatively prime to N)
Tends to reduce clustering ⇒ shorter probe paths

To generate relatively prime number:

Set table size to prime, e.g., N = 127
Ensure secondary hash function returns number in range [1, N − 1]

Summary of Collision Resolution

Collision resolution approaches:

Separate chaining: Easy to implement, allows α > 1
Linear probing: Fast if α << 1, complex deletion
Double hashing: Avoids clustering issues with linear probing

All approaches can be used to achieve O(1) performance on average, assuming

good hash function
table is appropriately resized if load factor exceeds threshold

Implementation Details

How do we resize a hash table?

Hash function depends on the number of slots
- Items may not belong at the same index after resizing
So all items must be re-inserted
Good strategy is to roughly double the number of slots every resizing