Hash Tables
A hash table maps keys to values using a hash function to compute an array index. Python’s built-in dict is a highly optimised hash table that gives you O(1) average-case insert, lookup, and delete.
What Is a Hash Table?
Internally a hash table is just an array of buckets. To store a key-value pair, the hash function converts the key into an integer index, and the value is placed in that bucket.
# Python dict is a hash table
phonebook = {}
phonebook["Alice"] = "555-0100"
phonebook["Bob"] = "555-0200"
print(phonebook["Alice"]) # 555-0100
print("Bob" in phonebook) # True
print(phonebook.get("Charlie", "not found")) # not found
# Dict literal
scores = {"Alice": 95, "Bob": 87, "Carol": 92}
print(scores["Carol"]) # 92
Hash Functions
A hash function takes a key and returns an integer. The table index is then hash(key) % capacity. Python’s built-in hash() works on any immutable object.
# Built-in hash
print(hash("Alice")) # some large integer (changes between runs in Python 3.3+)
print(hash(42)) # 42 (integers hash to themselves)
print(hash((1, 2, 3))) # tuple is hashable (immutable)
# Lists are NOT hashable — mutable objects cannot be dict keys
try:
hash([1, 2, 3])
except TypeError as e:
print(e) # unhashable type: 'list'
# A minimal hash table — illustrative only, not production code
class MinimalHashTable:
def __init__(self, capacity=16):
self.capacity = capacity
self.buckets = [[] for _ in range(capacity)]
def _index(self, key):
return hash(key) % self.capacity
def put(self, key, value):
idx = self._index(key)
for pair in self.buckets[idx]:
if pair[0] == key:
pair[1] = value
return
self.buckets[idx].append([key, value])
def get(self, key):
idx = self._index(key)
for k, v in self.buckets[idx]:
if k == key:
return v
raise KeyError(key)
Collision: Chaining
A collision occurs when two different keys hash to the same index. Separate chaining stores all colliding entries in a linked list (or dynamic array) at that bucket.
# Illustrating chaining — each bucket holds multiple pairs
# Capacity 4, keys "a" and "e" both hash to index 1 (hypothetical)
# Index 0: []
# Index 1: [("a", 1), ("e", 5)] ← chain of length 2
# Index 2: [("b", 2)]
# Index 3: [("c", 3)]
# Lookup traverses the chain at the computed index
# Average chain length ≈ n / capacity (load factor)
# All operations degrade to O(n) if all keys hash to same bucket
CPython’s dict uses a compact open-addressing scheme with pseudo-random probing, not linked-list chains. This keeps entries cache-friendly in a single array.
Collision: Open Addressing
Instead of a chain, open addressing probes for the next empty slot within the array itself. Three common probing strategies:
# Linear probing — try index+1, index+2, ...
# Quadratic probing — try index+1², index+2², ...
# Double hashing — second hash function determines step size
# Pseudocode for linear-probe insert
def linear_probe_put(table, capacity, key, value):
idx = hash(key) % capacity
while table[idx] is not None:
if table[idx][0] == key: # update existing
table[idx] = (key, value)
return
idx = (idx + 1) % capacity # linear probe
table[idx] = (key, value)
# Problem: "clustering" — long runs of occupied slots slow lookup
Python dict Internals
Since Python 3.7, dict preserves insertion order and uses a two-array layout: a compact indices array and a dense entries array. This reduces memory compared to a sparse array of slots.
d = {"x": 10, "y": 20, "z": 30}
# Common dict methods
print(list(d.keys())) # ['x', 'y', 'z'] — insertion order
print(list(d.values())) # [10, 20, 30]
print(list(d.items())) # [('x', 10), ('y', 20), ('z', 30)]
# Iteration
for key, val in d.items():
print(f"{key} -> {val}")
# Merge (Python 3.9+)
a = {"x": 1}
b = {"y": 2}
merged = a | b # {'x': 1, 'y': 2}
# Dict comprehension
squares = {n: n**2 for n in range(1, 6)}
# {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
Load Factor & Resizing
The load factor α = n / capacity (entries / buckets). When α gets too high, collisions become frequent. CPython resizes the dict when it is more than two-thirds full, doubling capacity and rehashing every entry.
import sys
# Watch memory grow as dict expands
d = {}
prev_size = sys.getsizeof(d)
for i in range(10):
d[i] = i
new_size = sys.getsizeof(d)
if new_size != prev_size:
print(f"Resize at n={i+1}: {prev_size} → {new_size} bytes")
prev_size = new_size
# Each resize is O(n) but happens so rarely that
# amortized cost per insert is still O(1)
Complexity Table
| Operation | Average | Worst | Note |
|---|---|---|---|
| Insert / update | O(1) | O(n) | Worst case: all keys collide |
| Lookup | O(1) | O(n) | Rare with a good hash function |
| Delete | O(1) | O(n) | Must mark slot as “deleted” |
| Iteration | O(n) | O(n) | Visits all entries once |
| Space | O(n) | O(n) | Plus overhead for empty slots |