अभिव्यक्ति

Building Dynamo from scratch in Go: Part 2

Partitioning with Consistent Hashing

This series

Code for this series

This is Part 2 of a 6-part series walking through a from-scratch Go implementation of Amazon's Dynamo. In Part 1, we built a single-node key-value store with a pluggable storage engine. Now we make it aware of multiple nodes by introducing consistent hashing - the mechanism that determines which node is responsible for which key.

Outline

The Problem with Naive Hashing

The most obvious way to distribute keys across N nodes is to hash the key and take the modulo:

node = hash(key) % N

This works fine as long as N never changes. But the moment you add or remove a node - which in any real cluster happens constantly - the value of N changes, and almost every key maps to a different node. If you had 10 nodes and added an 11th, roughly 90% of keys would need to move. For a system managing millions of keys, that's a catastrophic amount of data shuffling, and it all has to happen at once.

Dynamo needs to add nodes one at a time without system-wide disruption. The paper calls this incremental scalability. Consistent hashing is the technique that makes it possible.

Consistent Hashing: The Ring

The core idea is simple and elegant. Instead of mapping keys to nodes via modulo, we map both keys and nodes onto the same circular hash space - a ring of numbers from 0 to 2³² − 1.

To find which node owns a key, you hash the key to a position on the ring, then walk clockwise until you hit a node. That node is the coordinator - the primary owner of that key.

        Node A (pos: 100)
           ·
         /   \
        /     \
  Node C      Node B
 (pos: 700)   (pos: 400)

Key "user:123" hashes to position 250
  → walk clockwise → hits Node B at 400
  → Node B is the coordinator

The crucial property: when a node joins or leaves, only the keys in the range it's responsible for need to move. Adding a node between positions 400 and 700 only affects keys in that arc - everything else stays put. On average, only K/N keys move (where K is total keys and N is total nodes), compared to nearly all keys with modulo hashing.

The Ring Implementation

Let's look at the data structures. The Ring needs to track virtual nodes (positions on the ring), physical nodes (actual machines), and a sorted list of positions for efficient lookup:

// pkg/ring/consistent_hash.go

type Ring struct {
    vnodes     map[uint32]*VirtualNode  // position → vnode
    nodes      map[string]*PhysicalNode // nodeID → physical node
    sortedKeys []uint32                 // sorted positions for binary search
    replicas   int                      // replication factor N (used later)
    mu         sync.RWMutex
}

type VirtualNode struct {
    NodeID   string
    Position uint32
}

type PhysicalNode struct {
    NodeID string
    VNodes []uint32  // all positions this node occupies
}

Three maps working together. vnodes maps a ring position to the virtual node sitting there. nodes maps a physical node's ID to all of its positions. And sortedKeys is a sorted slice of all occupied positions, enabling binary search for key lookups.

The replicas field stores the replication factor N - we won't use it in this post (every key lives on exactly one node for now), but it's there in the struct because GetPreferenceList will need it starting in Part 4 when we introduce quorum-based replication.

Construction is straightforward:

func NewRing(replicas, vnodesPerNode int) *Ring {
    return &Ring{
        vnodes:   make(map[uint32]*VirtualNode),
        nodes:    make(map[string]*PhysicalNode),
        replicas: replicas,
    }
}

Adding a Node to the Ring

When a node joins the cluster, it needs to claim positions on the ring. The implementation hashes the node ID combined with an index to generate deterministic, well-distributed positions:

func (r *Ring) AddNode(nodeID string, vnodesCount int) []uint32 {
    r.mu.Lock()
    defer r.mu.Unlock()

    tokens := make([]uint32, vnodesCount)
    physNode := &PhysicalNode{
        NodeID: nodeID,
        VNodes: tokens,
    }

    for i := 0; i < vnodesCount; i++ {
        // Generate token position by hashing nodeID:index
        hash := md5.Sum([]byte(nodeID + ":" + string(rune(i))))
        position := binary.BigEndian.Uint32(hash[:4])

        tokens[i] = position
        r.vnodes[position] = &VirtualNode{
            NodeID:   nodeID,
            Position: position,
        }
    }

    r.nodes[nodeID] = physNode
    r.rebuildSortedKeys()

    return tokens
}

A few things worth noting here.

The hash function is MD5, truncated to the first 4 bytes (32 bits). The full MD5 output is 128 bits, but 2³² positions on the ring is more than sufficient for a toy implementation - it gives us roughly 4 billion positions to spread nodes across. The paper uses the full 128-bit hash space, but the principle is identical.

The token generation is deterministic: given the same nodeID and the same number of virtual nodes, you'll always get the same positions. This matters for cluster recovery - a restarting node reclaims exactly the same ring positions it had before.

The method returns the generated tokens. This is important because when a node starts, it stores these tokens in its membership record and shares them with other nodes via gossip (which we'll cover in Part 5). Other nodes call AddNodeWithTokens to place the joining node on their local copy of the ring:

func (r *Ring) AddNodeWithTokens(nodeID string, tokens []uint32) {
    r.mu.Lock()
    defer r.mu.Unlock()

    physNode := &PhysicalNode{
        NodeID: nodeID,
        VNodes: tokens,
    }

    for _, pos := range tokens {
        r.vnodes[pos] = &VirtualNode{
            NodeID:   nodeID,
            Position: pos,
        }
    }

    r.nodes[nodeID] = physNode
    r.rebuildSortedKeys()
}

After any change to the ring, rebuildSortedKeys reconstructs the sorted position list:

func (r *Ring) rebuildSortedKeys() {
    r.sortedKeys = make([]uint32, 0, len(r.vnodes))
    for k := range r.vnodes {
        r.sortedKeys = append(r.sortedKeys, k)
    }
    sort.Slice(r.sortedKeys, func(i, j int) bool {
        return r.sortedKeys[i] < r.sortedKeys[j]
    })
}

This is a full rebuild on every mutation, which is fine - node additions and removals are rare operations (minutes to hours apart), while key lookups happen thousands of times per second. Optimizing the rare path at the cost of the hot path would be the wrong trade-off.

Looking Up a Key: The Preference List

This is the heart of consistent hashing - given a key, find the node(s) responsible for it. The method is called GetPreferenceList because in Dynamo, a key isn't owned by just one node; it's replicated across N nodes. The preference list is the ordered list of those N nodes.

For now, think of it as being called with n=1 (single owner). In Part 4, we'll call it with n=N for quorum replication.

func (r *Ring) GetPreferenceList(key string, n int) []string {
    r.mu.RLock()
    defer r.mu.RUnlock()

    if len(r.sortedKeys) == 0 {
        return nil
    }

    // Hash the key to a position on the ring
    hash := md5.Sum([]byte(key))
    position := binary.BigEndian.Uint32(hash[:4])

    // Binary search: find the first vnode at or after this position
    idx := sort.Search(len(r.sortedKeys), func(i int) bool {
        return r.sortedKeys[i] >= position
    })

    // Wrap around if we've gone past the end of the ring
    if idx == len(r.sortedKeys) {
        idx = 0
    }

    // Walk clockwise, collecting N unique physical nodes
    seen := make(map[string]bool)
    list := make([]string, 0, n)

    for len(list) < n && len(seen) < len(r.nodes) {
        vnode := r.vnodes[r.sortedKeys[idx]]
        if !seen[vnode.NodeID] {
            list = append(list, vnode.NodeID)
            seen[vnode.NodeID] = true
        }
        idx = (idx + 1) % len(r.sortedKeys)
    }

    return list
}

Let's walk through this step by step.

Hash the key. The key "user:123" is hashed with MD5, and the first 4 bytes are taken as a uint32. This gives us a position on the ring - say, position 250.

Binary search. We use sort.Search to find the first entry in sortedKeys that is ≥ 250. This is O(log V) where V is the total number of virtual nodes across all physical nodes. With 10 physical nodes at 256 virtual nodes each, that's a binary search over 2,560 entries - essentially instant.

Wrap around. If the key's position is beyond the last virtual node on the ring, we wrap to position 0. This is the "ring" property - the hash space is circular.

Walk clockwise. Starting from the found position, we walk clockwise collecting node IDs. The critical detail is the seen map - we skip virtual nodes that belong to a physical node we've already collected. Without this, if Node A has virtual nodes at positions 260 and 270, both would appear in the preference list, which defeats the purpose of replicating across different physical machines. The walk continues until we have n unique physical nodes.

There's also a convenience method that returns just the first node - the coordinator:

func (r *Ring) GetCoordinator(key string) string {
    list := r.GetPreferenceList(key, 1)
    if len(list) == 0 {
        return ""
    }
    return list[0]
}

Why Virtual Nodes?

If each physical node occupied just one position on the ring, the load distribution would be poor. With 3 nodes, you'd have 3 positions on a ring with 4 billion slots. The arcs between those positions would be wildly uneven, meaning one node might own 50% of the key space while another owns 10%.

Virtual nodes solve this by giving each physical node many positions on the ring. With 256 virtual nodes per physical node, a 3-node cluster has 768 positions spread across the ring. The law of large numbers kicks in - the more positions you scatter randomly, the more evenly the arcs average out.

Without virtual nodes (3 nodes, 3 positions):
  Node A owns: ~45% of keys
  Node B owns: ~35% of keys
  Node C owns: ~20% of keys    ← very uneven

With virtual nodes (3 nodes, 768 positions):
  Node A owns: ~33.8% of keys
  Node B owns: ~32.5% of keys
  Node C owns: ~33.7% of keys  ← much better

The implementation's test suite validates this. The load distribution test creates 10 nodes with 256 virtual nodes each, distributes 10,000 keys, and checks that no node deviates more than 30% from the average:

func TestRingLoadDistribution(t *testing.T) {
    r := ring.NewRing(3, 256)

    for i := 0; i < 10; i++ {
        r.AddNode(fmt.Sprintf("node%d", i), 256)
    }

    keyDistribution := make(map[string]int)
    for i := 0; i < 10000; i++ {
        key := fmt.Sprintf("key%d", i)
        coordinator := r.GetCoordinator(key)
        keyDistribution[coordinator]++
    }

    avgKeys := 10000 / 10
    for nodeID, count := range keyDistribution {
        deviation := float64(count-avgKeys) / float64(avgKeys)
        if deviation > 0.3 || deviation < -0.3 {
            t.Logf("Warning: Node %s has %d keys (%.1f%% deviation)",
                nodeID, count, deviation*100)
        }
    }
}

Virtual nodes also give us two other important properties.

Heterogeneity. A powerful machine can be assigned more virtual nodes than a weaker one, taking on a proportionally larger share of the data. This is the paper's answer to running on heterogeneous hardware - you don't need identical machines.

Graceful failure handling. When a physical node with 256 virtual nodes goes down, its load is spread across up to 256 different nodes (in a large enough cluster), rather than dumping everything onto a single successor. This prevents cascading overload.

The default configuration uses 256 virtual nodes per physical node. The paper discusses this trade-off: more virtual nodes means better distribution but also more metadata to track (more entries in the ring, more tokens to gossip). 256 is a pragmatic sweet spot for clusters of up to a few hundred nodes.

Removing a Node

When a node leaves the cluster - whether gracefully or due to failure - its positions are removed from the ring:

func (r *Ring) RemoveNode(nodeID string) {
    r.mu.Lock()
    defer r.mu.Unlock()

    node := r.nodes[nodeID]
    if node == nil {
        return
    }

    for _, pos := range node.VNodes {
        delete(r.vnodes, pos)
    }

    delete(r.nodes, nodeID)
    r.rebuildSortedKeys()
}

After removal, any key that was assigned to the departing node now falls through to the next node clockwise. The data that was on the departed node needs to be re-replicated to maintain the desired number of copies - that's handled by hinted handoff (Part 4) and anti-entropy (Part 5). The ring itself just handles the routing - figuring out who's responsible.

The test for this verifies that after removing a node, the preference list degrades gracefully:

func TestRingRemoveNode(t *testing.T) {
    r := ring.NewRing(3, 10)
    r.AddNode("node1", 10)
    r.AddNode("node2", 10)

    r.RemoveNode("node1")

    preferenceList := r.GetPreferenceList("test-key", 3)
    // Only 1 node left - can't fill a list of 3
    if len(preferenceList) != 1 {
        t.Errorf("Expected 1 node in preference list after removal")
    }
}

This is important: GetPreferenceList doesn't panic or return an error if there aren't enough nodes. It returns as many unique physical nodes as it can find, up to n. The caller (the coordinator, in later posts) decides what to do if the list is shorter than expected - typically returning a quorum error.

Evolving the Node

Now we wire the ring into our Node. The struct grows by one field:

type Node struct {
    id      string
    address string
    config  *Config
    storage storage.Storage
    ring    *ring.Ring       // ← NEW

    stopCh chan struct{}
    mu     sync.RWMutex
}

And Config gains the VirtualNodes setting:

type Config struct {
    StorageEngine string
    StoragePath   string
    VirtualNodes  int          // ← NEW
}

func DefaultConfig() *Config {
    return &Config{
        StorageEngine: "memory",
        StoragePath:   "./data",
        VirtualNodes:  256,
    }
}

NewNode now initializes the ring:

func NewNode(id, address string, config *Config) (*Node, error) {
    if config == nil {
        config = DefaultConfig()
    }

    node := &Node{
        id:      id,
        address: address,
        config:  config,
        stopCh:  make(chan struct{}),
    }

    // Initialize storage engine
    var err error
    node.storage, err = storage.NewStorage(config.StorageEngine)
    if err != nil {
        return nil, fmt.Errorf("failed to initialize storage: %w", err)
    }

    // Initialize the consistent hash ring
    node.ring = ring.NewRing(1, config.VirtualNodes)  // replicas=1 for now

    return node, nil
}

And Start places this node onto the ring:

func (n *Node) Start() error {
    // Add self to the ring
    n.ring.AddNode(n.id, n.config.VirtualNodes)
    return nil
}

At this stage, the ring exists but doesn't do much - we can ask it which node should own a key, but there's no mechanism to actually route requests to remote nodes. The node still reads and writes locally. What we've established is the routing table that the entire distributed system will rely on.

To see it working:

func main() {
    config := DefaultConfig()
    node, _ := NewNode("node1", "localhost:8001", config)
    node.Start()

    // Who owns this key?
    coordinator := node.ring.GetCoordinator("user:123")
    fmt.Println(coordinator) // "node1" - we're the only node

    // Simulate adding another node to the ring
    node.ring.AddNodeWithTokens("node2", someTokens)

    coordinator = node.ring.GetCoordinator("user:123")
    fmt.Println(coordinator) // might now be "node2"
}

The Partitioning Strategy

The Dynamo paper (Section 6.2) discusses three partitioning strategies and evaluates their trade-offs. This implementation uses Strategy 3: a fixed number of equal-sized partitions, with each node assigned Q/S tokens (where Q is the total number of partitions and S is the number of nodes).

The key insight is that Strategy 3 decouples partitioning from placement. The hash space is divided into Q partitions once, and partition-to-node assignment changes only when nodes join or leave. This has practical advantages for Merkle tree-based anti-entropy (Part 5) - each partition has a fixed range, so its Merkle tree doesn't need to be recalculated when membership changes.

In the implementation, Q = 256 × S, so each node gets 256 tokens (the default VirtualNodes value). For a 3-node cluster:

Hash space: [0, 2³²)
Total tokens: 768 (= 256 × 3)
Per node: 256 tokens

Node A: tokens [45, 187, 234, ...]     (256 tokens)
Node B: tokens [12, 98, 201, ...]      (256 tokens)
Node C: tokens [5, 156, 243, ...]      (256 tokens)

When a 4th node joins, it generates 256 new tokens that interleave with the existing 768. On average, each existing node surrenders 256/4 = 64 of its key ranges to the newcomer. The data migration is proportional - only 1/N of the total data moves, which is the theoretical minimum.

What's Next

We can now partition data across nodes - given any key, we know exactly which node (or nodes) should own it. But we're still storing plain byte slices, and we have no way to detect when two clients have written conflicting values to the same key.

In Part 3, we'll introduce vector clocks - the versioning mechanism that lets Dynamo track causality between writes, detect conflicts, and enable applications to resolve them with semantic merge logic. The storage layer will evolve from storing raw bytes to storing VersionedValue structs, each carrying a vector clock that encodes its causal history.

The full source code is available at github.com/tripab/toy-dynamo. The ring implementation lives in pkg/ring/consistent_hash.go.