As one of implementation of dictionary.

Trivial Idea

Direct addressing

Keys are from \(0,\cdots,m-1\), memory size \(\geq m\).

Insert pair \((k,v)\): arr[k]=v

Problem: Memory limits

Dictionary with Hash Tables

Suppose keys \(\in S\), a very large domain.

The number of keys \(=n<<|S|\)

Memory size \(m=O(n)\) with \(m\geq n\).

How to map keys to hash table? (How to insert \((k,v)\)?) Use hash function!

Hash Functions

Hash function \(h:k\in S\to\{0,\cdots,m-1\}\)

When inserting, we need to store key-value pair, not just the value!

Hash Collision

By pigeonhole principle, it is not hard to see hash collision is unavoidable in practice.

The number of hash collisions depend on hash function \(h\).

Desired properties of hash function \(h\):

\(h\) must be computably fast (ideally \(O(1)\)).
\(h\) must be deterministic function, i.e., for the same value of key, the hash function always return the same value.
\(h\) should behave like a random function.

Ways to deal with hash collision: Separate Chaining and Open Addressing.

Separate Chaining

When collision happens, we just add the key-value pair on the linked list from the same entry of hash value.

Assume \(h\) behaves like a random function.

\(\Pr\{\)key ends up at a position\(\}=\frac{1}{m}\). Thus expected length of linked list is \(\frac{n}{m}=O(1)\).

Insert: \(O(1)\)

Find and delete: \(O(1+\text{len(linked list)})\), worst case is \(O(n)\) with very low probability.

Open Addressing

When collision happens, we try to find a different free position on the hash table.

How? Linear probing, quadratic probing.

Linear Probing

Insert: If \(h(x)\) is occupied, try the next position until we find a free one.

Find: Calculate \(h(x)\), if the key on entry \(h(x)\) is not the same, move to next entry, until we find the same key value or empty entry.

When deleting, we just replace the data by a special marker! Otherwise, finding the data may fail.

Issues of linear probing: The larger consecutive cluster is, the higher probability the hash value will fall in the cluster. This makes the worst case \(O(n)\).

In expectation, find/insert/delete should be \(O(n)\).

Advantages:

Easy to implement.
Cache friendly.

Quadratic Probing

Instead of trying consecutive position, we jump.

Rule: At the \(i\)-th trial, starting from \(0\), we try the position \((h(x)+c_1i+c_2i^2)\mod n\) on the hash table, where \(c_1,c_2\) is constant.