Building Dynamo from scratch in Go: Part 3
Vector Clocks and Conflict Resolution
This series
- Part 1: The Big Picture
- Part 2: Partitioning with Consistent Hashing
- Part 3: Vector Clocks and Conflict Resolution (this post)
- Part 4: Request Coordination, Quorums, and Failure Handling
- Part 5: Membership, Failure Detection, and Replica Synchronization
- Part 6: Production Concerns - RPC, Resilience, Performance, and Testing
This is Part 3 of a 6-part series walking through a from-scratch Go implementation of Amazon's Dynamo. In Part 1 we built a single-node store, and in Part 2 we introduced consistent hashing to partition keys across nodes. Now we tackle one of the hardest problems in distributed systems: detecting and resolving conflicting writes.
Outline
- The Problem with Wall Clocks
- Vector Clocks: The Intuition
- The VectorClock Struct
- VersionedValue: Pairing Data with Its Clock
- Reconciling Concurrent Versions
- Application-Level Conflict Resolution
- Evolving the Storage Layer
- Pruning: Keeping Vector Clocks Bounded
- Tombstones: Deletes Are Writes
- Evolving the Config
- What's Next
The Problem with Wall Clocks
When two clients write to the same key at roughly the same time through different nodes, we have a conflict. The naive instinct is to use timestamps - just keep the version with the later timestamp. But this breaks in distributed systems for a fundamental reason: clocks on different machines are not perfectly synchronized.
Even with NTP, clock skew between machines ranges from milliseconds to seconds. Worse, clocks can jump backwards (after an NTP correction), or drift at different rates. A write that physically happened first might carry a later timestamp than a write that happened second, simply because one machine's clock ran fast.
The Dynamo paper rejects wall-clock ordering entirely. Instead, it uses vector clocks - a logical clock mechanism that tracks the actual causal relationships between writes without relying on physical time at all.
Vector Clocks: The Intuition
A vector clock is a map from node IDs to counters. Each node maintains its own counter, incrementing it every time it coordinates a write. When you look at a vector clock, it tells you: "this version incorporates the effects of writes 1 through X from Node A, writes 1 through Y from Node B, and so on."
Two vector clocks can be compared to determine one of four relationships:
- Equal - they're identical. Same version.
- Before (ancestor) - one is a direct ancestor of the other. Every counter in the older clock is ≤ the corresponding counter in the newer clock, with at least one strictly less.
- After (descendant) - the reverse of Before.
- Concurrent - neither dominates the other. This means the two versions evolved independently, incorporating different sets of writes. This is a conflict.
Let's trace through a concrete example:
1. Client writes "Alice" via Node A
Clock: {A:1} → stored as version v1
2. Client reads v1, then writes "Alice Smith" via Node A
Clock: {A:2} → stored as version v2
v2 descends from v1 (A:2 > A:1), so v1 is safely replaced
3. Meanwhile, another client reads v1 and writes "Alice Jones" via Node B
Clock: {A:1, B:1} → stored as version v3
Compare v2 {A:2} and v3 {A:1, B:1}:
Node A: v2 has 2, v3 has 1 → v2 is ahead
Node B: v2 has 0, v3 has 1 → v3 is ahead
Neither dominates → CONCURRENT → conflict!
The system now holds two versions of the same key. The application must decide how to merge them (more on this shortly).
The VectorClock Struct
Here's the implementation:
// pkg/versioning/vector_clock.go
type VectorClock struct {
Versions map[string]uint64 // nodeID → counter
Timestamp time.Time // for pruning (not for ordering!)
}
func NewVectorClock() *VectorClock {
return &VectorClock{
Versions: make(map[string]uint64),
Timestamp: time.Now(),
}
}
Note the Timestamp field. This is not used for causal ordering - that's the whole point of having vector clocks. The timestamp exists purely as a tiebreaker for pruning: when the clock gets too large and we need to discard old entries, we use the timestamp to decide which entries are oldest. We'll return to pruning later in this post.
Increment
When a node coordinates a write, it bumps its own counter:
func (vc *VectorClock) Increment(nodeID string) {
vc.Versions[nodeID]++
vc.Timestamp = time.Now()
}
Simple: one node, one counter, one increment per write. If Node A coordinates three writes in a row, the clock goes {A:1} → {A:2} → {A:3}. If Node B then coordinates a write (having seen {A:3}), the clock becomes {A:3, B:1}.
Copy
Vector clocks are passed around with values, so we need deep copies to avoid aliasing bugs:
func (vc *VectorClock) Copy() *VectorClock {
newVC := NewVectorClock()
for k, v := range vc.Versions {
newVC.Versions[k] = v
}
newVC.Timestamp = vc.Timestamp
return newVC
}
Compare
This is the core operation. Given two vector clocks, determine their causal relationship:
type Ordering int
const (
Before Ordering = iota // this clock is older (ancestor)
After // this clock is newer (descendant)
Equal // identical
Concurrent // conflict - independent histories
)
func (vc *VectorClock) Compare(other *VectorClock) Ordering {
if vc == nil || other == nil {
return Concurrent
}
vcGreater := false
otherGreater := false
// Gather all node IDs from both clocks
allNodes := make(map[string]bool)
for node := range vc.Versions {
allNodes[node] = true
}
for node := range other.Versions {
allNodes[node] = true
}
// Compare each node's counter
for node := range allNodes {
vcVal := vc.Versions[node] // 0 if absent
otherVal := other.Versions[node] // 0 if absent
if vcVal > otherVal {
vcGreater = true
} else if otherVal > vcVal {
otherGreater = true
}
}
if vcGreater && !otherGreater {
return After // vc is strictly newer
} else if otherGreater && !vcGreater {
return Before // vc is strictly older
} else if !vcGreater && !otherGreater {
return Equal
} else {
return Concurrent // conflict
}
}
The algorithm walks through every node ID that appears in either clock. For each, it checks which clock has the higher counter. If vc is ahead on at least one node and behind on none, it's After (newer). If behind on at least one and ahead on none, it's Before (older). If both have exactly the same counters, they're Equal. And if vc is ahead on some nodes and behind on others - the Concurrent case - that's a conflict.
This is an O(N) comparison where N is the number of distinct node IDs across both clocks, which in practice is small (a few to a few dozen).
The test suite exercises each case explicitly:
func TestVectorClockCompare(t *testing.T) {
vc1 := versioning.NewVectorClock()
vc1.Versions["node1"] = 1
vc1.Versions["node2"] = 2
vc2 := versioning.NewVectorClock()
vc2.Versions["node1"] = 1
vc2.Versions["node2"] = 3
// vc1 {node1:1, node2:2} vs vc2 {node1:1, node2:3}
// node1: equal. node2: vc2 ahead.
// → vc1 is Before (older)
ordering := vc1.Compare(vc2)
if ordering != versioning.Before {
t.Errorf("Expected vc1 Before vc2")
}
}
func TestVectorClockConcurrent(t *testing.T) {
vc1 := versioning.NewVectorClock()
vc1.Versions["node1"] = 2
vc1.Versions["node2"] = 1
vc2 := versioning.NewVectorClock()
vc2.Versions["node1"] = 1
vc2.Versions["node2"] = 2
// node1: vc1 ahead. node2: vc2 ahead. → Concurrent!
ordering := vc1.Compare(vc2)
if ordering != versioning.Concurrent {
t.Errorf("Expected Concurrent, got %v", ordering)
}
}
Merge
When a client resolves a conflict (reconciling multiple concurrent versions into one), the resulting value needs a vector clock that reflects all the causal history it incorporates. This is done by taking the component-wise maximum:
func (vc *VectorClock) Merge(other *VectorClock) *VectorClock {
merged := vc.Copy()
for node, count := range other.Versions {
if count > merged.Versions[node] {
merged.Versions[node] = count
}
}
if other.Timestamp.After(merged.Timestamp) {
merged.Timestamp = other.Timestamp
}
return merged
}
After merging {A:2} and {A:1, B:1}, the result is {A:2, B:1} - a clock that dominates both parents, so any future write based on this merged clock will properly supersede both conflicting versions.
The test verifies:
func TestVectorClockMerge(t *testing.T) {
vc1 := versioning.NewVectorClock()
vc1.Versions["node1"] = 2
vc1.Versions["node2"] = 1
vc2 := versioning.NewVectorClock()
vc2.Versions["node1"] = 1
vc2.Versions["node2"] = 3
vc2.Versions["node3"] = 1
merged := vc1.Merge(vc2)
// max(2,1)=2, max(1,3)=3, max(0,1)=1
// → {node1:2, node2:3, node3:1}
}
VersionedValue: Pairing Data with Its Clock
Every value stored in Dynamo now carries its vector clock:
type VersionedValue struct {
Data []byte
VectorClock *VectorClock
IsTombstone bool
}
The IsTombstone field is for deletes - we'll cover it later in this post. For now, the important thing is that Data and VectorClock travel together as a unit. You never store raw bytes anymore; you always store a VersionedValue.
Reconciling Concurrent Versions
When a node collects responses from multiple replicas during a read (which we'll implement in Part 4), it may receive several versions of the same key. Some may be ancestors of others, and some may be concurrent. The ReconcileConcurrent function strips out dominated versions, keeping only the concurrent ones:
func ReconcileConcurrent(versions []VersionedValue) []VersionedValue {
if len(versions) <= 1 {
return versions
}
concurrent := []VersionedValue{}
for i := range versions {
isDominated := false
for j := range versions {
if i == j {
continue
}
ordering := versions[i].VectorClock.Compare(versions[j].VectorClock)
if ordering == Before {
// versions[i] is an ancestor of versions[j] - discard it
isDominated = true
break
}
}
if !isDominated {
concurrent = append(concurrent, versions[i])
}
}
return concurrent
}
This is an O(n²) comparison across all collected versions, which is fine because in practice there are very few concurrent versions - the paper reports that fewer than 0.06% of reads see more than one version in Amazon's production deployment.
The algorithm's logic: for each version, check if any other version strictly dominates it (is a descendant). If so, the version is an ancestor and can be safely discarded. What remains is the set of versions that are either equal to or concurrent with each other - these are the versions the application needs to see.
If only one version survives reconciliation, there's no conflict and the application gets a clean read. If multiple survive, the application must decide how to merge them.
Application-Level Conflict Resolution
This is one of Dynamo's most distinctive design choices. Instead of the storage system picking a winner (like "last write wins"), it surfaces all concurrent versions to the application and lets the application decide how to merge them.
The implementation provides a Reconciler interface:
// pkg/versioning/reconciler.go
type Reconciler interface {
Resolve(values []VersionedValue) []byte
}
With two built-in strategies. First, the simplest possible default - last-write-wins using timestamps:
type LastWriteWinsReconciler struct{}
func (r *LastWriteWinsReconciler) Resolve(values []VersionedValue) []byte {
if len(values) == 0 {
return nil
}
latest := values[0]
for _, v := range values[1:] {
if v.VectorClock.Timestamp.After(latest.VectorClock.Timestamp) {
latest = v
}
}
return latest.Data
}
And second, an adapter for custom application logic:
type ApplicationReconciler struct {
ResolveFn func([]VersionedValue) []byte
}
func (r *ApplicationReconciler) Resolve(values []VersionedValue) []byte {
if r.ResolveFn != nil {
return r.ResolveFn(values)
}
return values[0].Data
}
The Shopping Cart Example
The paper's motivating example is Amazon's shopping cart. When two concurrent writes add different items to the same cart, the correct merge strategy is to take the union of all items - you never want to silently drop something a customer added.
The examples/shopping_cart directory demonstrates this:
type ShoppingCart struct {
UserID string `json:"user_id"`
Items []string `json:"items"`
}
func MergeShoppingCarts(values []versioning.VersionedValue) []byte {
allItems := make(map[string]bool)
var userID string
for _, v := range values {
var cart ShoppingCart
json.Unmarshal(v.Data, &cart)
userID = cart.UserID
for _, item := range cart.Items {
allItems[item] = true
}
}
merged := ShoppingCart{
UserID: userID,
Items: make([]string, 0, len(allItems)),
}
for item := range allItems {
merged.Items = append(merged.Items, item)
}
data, _ := json.Marshal(merged)
return data
}
The reconciliation flow in practice looks like this:
// Read cart - might get multiple concurrent versions
result, err := node.Get(ctx, "cart:user123")
if len(result.Values) > 1 {
// Conflict! Merge using application logic
merged := MergeShoppingCarts(result.Values)
// Write back the merged version with the merged context
// This creates a new version that dominates both parents
err = node.Put(ctx, "cart:user123", merged, result.Context)
}
The result.Context is important - it contains the merged vector clock from all the concurrent versions. When the merged value is written back using this context, its vector clock dominates both parents, so subsequent reads will see a single, reconciled version.
Evolving the Storage Layer
The Storage interface and MemoryStorage from Parts 1 and 2 dealt with raw bytes. Now they need to handle VersionedValue:
// pkg/storage/interface.go - evolved
type Storage interface {
Get(key string) ([]VersionedValue, error)
Put(key string, value VersionedValue) error
Delete(key string) error
GetRange(start, end string) (map[string][]VersionedValue, error)
Close() error
}
Two changes: Get returns a slice of VersionedValue (because a key can have multiple concurrent versions), and Put takes a single VersionedValue. We also add GetRange - it's needed by the anti-entropy system (Part 5) to fetch all keys in a hash range for Merkle tree comparison.
The MemoryStorage implementation changes significantly. The internal map now holds slices of versioned values, and Put performs reconciliation on every write:
// pkg/storage/memory.go - evolved
type MemoryStorage struct {
data map[string][]versioning.VersionedValue
mu sync.RWMutex
}
func (m *MemoryStorage) Get(key string) ([]versioning.VersionedValue, error) {
m.mu.RLock()
defer m.mu.RUnlock()
values, exists := m.data[key]
if !exists || len(values) == 0 {
return nil, ErrKeyNotFound
}
result := make([]versioning.VersionedValue, len(values))
copy(result, values)
return result, nil
}
func (m *MemoryStorage) Put(key string, value versioning.VersionedValue) error {
m.mu.Lock()
defer m.mu.Unlock()
existing := m.data[key]
// Combine new value with existing and reconcile
combined := append(existing, value)
reconciled := versioning.ReconcileConcurrent(combined)
m.data[key] = reconciled
return nil
}
The critical line is in Put: rather than blindly overwriting, we combine the new value with all existing versions and run ReconcileConcurrent. This ensures that ancestors are automatically discarded. If the new value descends from all existing versions, only the new value survives. If it's concurrent with an existing version, both are kept. The storage layer never loses a concurrent version - that decision belongs to the application.
Pruning: Keeping Vector Clocks Bounded
There's a subtle problem with vector clocks: they grow without bound. Every node that ever coordinates a write adds an entry to the clock. Over time, as nodes join and leave the cluster, clocks can accumulate entries for nodes that haven't been active in weeks.
The paper addresses this with pruning - when a vector clock exceeds a threshold size (default: 10 entries), the oldest entries are removed:
func (vc *VectorClock) Prune(maxSize int) {
if len(vc.Versions) <= maxSize {
return
}
type entry struct {
node string
count uint64
}
entries := make([]entry, 0, len(vc.Versions))
for node, count := range vc.Versions {
entries = append(entries, entry{node, count})
}
// Sort by counter value (lowest first - these are the oldest)
sort.Slice(entries, func(i, j int) bool {
return entries[i].count < entries[j].count
})
// Remove the oldest entries
toRemove := len(entries) - maxSize
for i := 0; i < toRemove; i++ {
delete(vc.Versions, entries[i].node)
}
}
This is a trade-off. Removing an entry means losing some causal information - the system might fail to detect that two versions are related, treating them as concurrent when one is actually an ancestor. The paper acknowledges this but notes that in practice, with a threshold of 10, the issue rarely arises because most keys are written by a small number of nodes.
The implementation takes a simplified approach: it removes entries with the lowest counter values (assuming those are the stalest). A production implementation would track per-entry timestamps for more accurate age-based eviction. The VectorClockMaxSize configuration parameter (default: 10) controls the threshold.
Tombstones: Deletes Are Writes
In a distributed system, you can't just delete a key by removing it from local storage. If you did, the key's replicas on other nodes would still have the data, and the next anti-entropy synchronization would restore the deleted key - a resurrection bug.
The solution is to treat deletes as writes. Instead of removing the key, we write a special marker called a tombstone - a VersionedValue with nil data and IsTombstone: true:
tombstone := versioning.VersionedValue{
Data: nil,
VectorClock: newClock,
IsTombstone: true,
}
The tombstone carries a vector clock just like any other value. It participates in the same reconciliation logic - a tombstone that descends from a regular value replaces it, and a tombstone concurrent with a regular value results in both being kept (the application sees both and can decide).
The test suite exercises this nuance:
func TestConcurrentTombstoneAndWrite(t *testing.T) {
store := storage.NewMemoryStorage()
// Node 1 writes a value: clock {node1:1}
vc1 := versioning.NewVectorClock()
vc1.Increment("node1")
store.Put("key1", versioning.VersionedValue{
Data: []byte("value1"), VectorClock: vc1,
})
// Node 2 writes concurrently: clock {node2:1}
vc2 := versioning.NewVectorClock()
vc2.Increment("node2")
store.Put("key1", versioning.VersionedValue{
Data: []byte("value2"), VectorClock: vc2,
})
// Both are concurrent - 2 versions stored
values, _ := store.Get("key1")
// len(values) == 2
// Node 1 deletes (based on its own version): clock {node1:2}
vc3 := vc1.Copy()
vc3.Increment("node1")
store.Put("key1", versioning.VersionedValue{
VectorClock: vc3, IsTombstone: true,
})
// Result: tombstone {node1:2} dominates "value1" {node1:1}
// but is concurrent with "value2" {node2:1}
// → 2 versions: the tombstone and "value2"
values, _ = store.Get("key1")
// len(values) == 2 - tombstone + "value2"
}
This is correct behavior. Node 1's delete only "saw" its own version, not Node 2's concurrent write. The tombstone properly supersedes Node 1's original value but can't supersede a write it never knew about. The application would receive both during a read and decide what to do.
Tombstone Compaction
Tombstones can't live forever - they'd accumulate indefinitely and consume storage. But they can't be deleted immediately either, because they need time to propagate to all replicas. The TombstoneCompactor handles this lifecycle:
type TombstoneCompactor struct {
storage Storage
ttl time.Duration // default: 7 days
}
func (tc *TombstoneCompactor) Compact() (int, error) {
keys, _ := tc.storage.GetAllKeys()
cutoff := time.Now().Add(-tc.ttl)
removedCount := 0
for _, key := range keys {
values, _ := tc.storage.Get(key)
// Only remove if ALL versions are old tombstones
allOldTombstones := true
for _, v := range values {
if !v.IsTombstone {
allOldTombstones = false
break
}
if v.VectorClock.Timestamp.After(cutoff) {
allOldTombstones = false
break
}
}
if allOldTombstones {
tc.storage.Delete(key)
removedCount++
}
}
return removedCount, nil
}
The compactor runs periodically (default: every hour) and removes keys where all versions are tombstones older than the TTL (default: 7 days). The 7-day window gives ample time for tombstones to propagate through gossip, anti-entropy, and hinted handoff. Only when we're confident that every replica has seen the tombstone is it safe to physically delete the key.
Evolving the Config
The Config struct gains two new fields for versioning:
type Config struct {
StorageEngine string
StoragePath string
VirtualNodes int
VectorClockMaxSize int // ← NEW (default: 10)
TombstoneTTL time.Duration // ← NEW (default: 7 days)
TombstoneCompactionInterval time.Duration // ← NEW (default: 1 hour)
}
The Node struct doesn't change in this post - the versioning layer is entirely within the Storage and versioning packages. The node stores and retrieves VersionedValue through its storage engine, but the version management logic is encapsulated below it. In Part 4, when we introduce the coordinator, the Node will gain fields and methods that actively manipulate vector clocks during distributed reads and writes.
What's Next
We now have the ability to track the causal history of every value, detect conflicts when they arise, and let applications resolve them with domain-specific merge logic. But we're still missing the distributed coordination that makes conflicts possible in the first place - the mechanism that fans out writes to N replicas, waits for W acknowledgments, and collects R responses during reads.
In Part 4, we'll build the Coordinator - the state machine at the heart of Dynamo's request processing. We'll introduce quorum parameters (N, R, W), implement the full distributed read and write paths, add read repair for opportunistic consistency, and integrate hinted handoff so writes succeed even when nodes are temporarily down.
The full source code is available at github.com/tripab/toy-dynamo. The versioning implementation lives in pkg/versioning/, with vector clocks in vector_clock.go and reconciliation logic in reconciler.go.