Data Structures

Hash Tables

A hash table maps keys to values using a hash function to compute an array index. Python’s built-in dict is a highly optimised hash table that gives you O(1) average-case insert, lookup, and delete.

What Is a Hash Table?

Internally a hash table is just an array of buckets. To store a key-value pair, the hash function converts the key into an integer index, and the value is placed in that bucket.

Python
# Python dict is a hash table
phonebook = {}
phonebook["Alice"] = "555-0100"
phonebook["Bob"]   = "555-0200"

print(phonebook["Alice"])          # 555-0100
print("Bob" in phonebook)          # True
print(phonebook.get("Charlie", "not found"))  # not found

# Dict literal
scores = {"Alice": 95, "Bob": 87, "Carol": 92}
print(scores["Carol"])             # 92

Hash Functions

A hash function takes a key and returns an integer. The table index is then hash(key) % capacity. Python’s built-in hash() works on any immutable object.

Python
# Built-in hash
print(hash("Alice"))    # some large integer (changes between runs in Python 3.3+)
print(hash(42))         # 42  (integers hash to themselves)
print(hash((1, 2, 3)))  # tuple is hashable (immutable)

# Lists are NOT hashable — mutable objects cannot be dict keys
try:
    hash([1, 2, 3])
except TypeError as e:
    print(e)            # unhashable type: 'list'

# A minimal hash table — illustrative only, not production code
class MinimalHashTable:
    def __init__(self, capacity=16):
        self.capacity = capacity
        self.buckets = [[] for _ in range(capacity)]

    def _index(self, key):
        return hash(key) % self.capacity

    def put(self, key, value):
        idx = self._index(key)
        for pair in self.buckets[idx]:
            if pair[0] == key:
                pair[1] = value
                return
        self.buckets[idx].append([key, value])

    def get(self, key):
        idx = self._index(key)
        for k, v in self.buckets[idx]:
            if k == key:
                return v
        raise KeyError(key)

Collision: Chaining

A collision occurs when two different keys hash to the same index. Separate chaining stores all colliding entries in a linked list (or dynamic array) at that bucket.

Python
# Illustrating chaining — each bucket holds multiple pairs
# Capacity 4, keys "a" and "e" both hash to index 1 (hypothetical)

#  Index 0: []
#  Index 1: [("a", 1), ("e", 5)]   ← chain of length 2
#  Index 2: [("b", 2)]
#  Index 3: [("c", 3)]

# Lookup traverses the chain at the computed index
# Average chain length ≈ n / capacity (load factor)
# All operations degrade to O(n) if all keys hash to same bucket
💡
Python uses open addressing, not chaining

CPython’s dict uses a compact open-addressing scheme with pseudo-random probing, not linked-list chains. This keeps entries cache-friendly in a single array.

Collision: Open Addressing

Instead of a chain, open addressing probes for the next empty slot within the array itself. Three common probing strategies:

Python
# Linear probing — try index+1, index+2, ...
# Quadratic probing — try index+1², index+2², ...
# Double hashing — second hash function determines step size

# Pseudocode for linear-probe insert
def linear_probe_put(table, capacity, key, value):
    idx = hash(key) % capacity
    while table[idx] is not None:
        if table[idx][0] == key:   # update existing
            table[idx] = (key, value)
            return
        idx = (idx + 1) % capacity  # linear probe
    table[idx] = (key, value)

# Problem: "clustering" — long runs of occupied slots slow lookup

Python dict Internals

Since Python 3.7, dict preserves insertion order and uses a two-array layout: a compact indices array and a dense entries array. This reduces memory compared to a sparse array of slots.

Python
d = {"x": 10, "y": 20, "z": 30}

# Common dict methods
print(list(d.keys()))     # ['x', 'y', 'z']  — insertion order
print(list(d.values()))   # [10, 20, 30]
print(list(d.items()))    # [('x', 10), ('y', 20), ('z', 30)]

# Iteration
for key, val in d.items():
    print(f"{key} -> {val}")

# Merge (Python 3.9+)
a = {"x": 1}
b = {"y": 2}
merged = a | b            # {'x': 1, 'y': 2}

# Dict comprehension
squares = {n: n**2 for n in range(1, 6)}
# {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

Load Factor & Resizing

The load factor α = n / capacity (entries / buckets). When α gets too high, collisions become frequent. CPython resizes the dict when it is more than two-thirds full, doubling capacity and rehashing every entry.

Python
import sys

# Watch memory grow as dict expands
d = {}
prev_size = sys.getsizeof(d)
for i in range(10):
    d[i] = i
    new_size = sys.getsizeof(d)
    if new_size != prev_size:
        print(f"Resize at n={i+1}: {prev_size} → {new_size} bytes")
        prev_size = new_size

# Each resize is O(n) but happens so rarely that
# amortized cost per insert is still O(1)

Complexity Table

OperationAverageWorstNote
Insert / updateO(1)O(n)Worst case: all keys collide
LookupO(1)O(n)Rare with a good hash function
DeleteO(1)O(n)Must mark slot as “deleted”
IterationO(n)O(n)Visits all entries once
SpaceO(n)O(n)Plus overhead for empty slots