Building Dynamo from scratch in Go: Part 6

18 Feb, 2026

Production Concerns - RPC, Resilience, Performance, and Testing

This series

Part 1: The Big Picture
Part 2: Partitioning with Consistent Hashing
Part 3: Vector Clocks and Conflict Resolution
Part 4: Request Coordination, Quorums, and Failure Handling
Part 5: Membership, Failure Detection, and Replica Synchronization
Part 6: Production Concerns - RPC, Resilience, Performance, and Testing (this post)

This is the final post in a 6-part series walking through a from-scratch Go implementation of Amazon's Dynamo. In previous posts, we built the consistent hash ring (Part 2), vector clock versioning (Part 3), quorum-based coordination with hinted handoff (Part 4), and gossip, failure detection, and anti-entropy (Part 5). Now we cover the infrastructure that makes the whole thing robust, observable, and testable.

Outline

What This Part Covers
The RPC Layer
Circuit Breaker
Retry with Exponential Backoff
Admission Control
Coordinator Selection
Metrics
Storage Engines Beyond Memory
Testing Strategy
Some Features We Could Add
Wrapping Up

What This Part Covers

The core distributed algorithms are in place. But a distributed system isn't production-ready just because the algorithms are correct. You also need: a reliable communication layer between nodes, mechanisms to prevent cascade failures when one node is slow or down, protection for user-facing latency when background work gets heavy, observability into what the system is actually doing, and confidence that the whole thing works under failure conditions.

This post brings the Node struct to its final form:

type Node struct {
    id      string
    address string
    config  *Config
    storage storage.Storage
    ring    *ring.Ring

    coordinator  *Coordinator
    membership   *membership.Membership
    hintedHoff   *replication.HintedHandoff
    failDetector *membership.FailureDetector
    antiEntropy  *synchronization.AntiEntropy
    rpcClient    *rpc.Client
    rpcServer    *rpc.Server

    admissionController  *AdmissionController     // NEW
    coordinatorSelector  *CoordinatorSelector      // NEW
    metrics              *metrics.Collector         // NEW
    tombstoneCompactor   *storage.TombstoneCompactor // NEW

    stopCh chan struct{}
    wg     sync.WaitGroup
    mu     sync.RWMutex
}

And Config reaches its final form with fields for circuit breakers, retry, admission control, coordinator selection, and metrics.

The RPC Layer

Every cross-node operation - reads, writes, gossip exchanges, hint deliveries, anti-entropy syncs - flows through the RPC layer. The implementation uses HTTP/JSON for simplicity and debuggability. It's not the fastest choice (gRPC with protobuf would be more efficient), but it's easy to inspect with curl, easy to debug with any HTTP tool, and easy to replace later.

The Protocol

The protocol consists of five endpoints, each with a typed request/response pair:

POST /rpc/get        ->  GetRequest / GetResponse
POST /rpc/put        ->  PutRequest / PutResponse
POST /rpc/gossip     ->  GossipRequest / GossipResponse
POST /rpc/sync       ->  SyncRequest / SyncResponse
POST /rpc/hint       ->  HintRequest / HintResponse
GET  /health         ->  200 OK
GET  /metrics        ->  Prometheus text format

Each DTO (Data Transfer Object) is a JSON-serializable version of the internal types. For example, VersionedValueDTO carries the data, vector clock, and tombstone flag in a format that can cross the wire:

type VersionedValueDTO struct {
    Data        []byte            `json:"data,omitempty"`
    VectorClock map[string]uint64 `json:"vector_clock"`
    Timestamp   time.Time         `json:"timestamp"`
    IsTombstone bool              `json:"is_tombstone,omitempty"`
}

Conversion functions bridge between internal types and DTOs:

func FromVersionedValue(vv versioning.VersionedValue) VersionedValueDTO {
    vcMap := make(map[string]uint64)
    if vv.VectorClock != nil {
        for nodeID, counter := range vv.VectorClock.Versions {
            vcMap[nodeID] = counter
        }
    }
    return VersionedValueDTO{
        Data:        vv.Data,
        VectorClock: vcMap,
        Timestamp:   time.Now(),
        IsTombstone: vv.IsTombstone,
    }
}

This separation between internal types and wire types is deliberate. Internal types can evolve (e.g., adding fields to VectorClock) without breaking the RPC protocol, and vice versa.

The Server

The RPC server is a thin HTTP wrapper that delegates to a NodeOperations interface:

type NodeOperations interface {
    LocalGet(key string) ([]versioning.VersionedValue, error)
    LocalPut(key string, value versioning.VersionedValue) error
    HandleGossip(members []MemberDTO) []MemberDTO
    HandleSync(req *SyncRequest) *SyncResponse
    HandleHint(req *HintRequest) error
    GetNodeID() string
}

type Server struct {
    node    NodeOperations
    address string
    server  *http.Server
    mux     *http.ServeMux
}

Handler registration maps each endpoint to its handler:

func (s *Server) registerHandlers() {
    s.mux.HandleFunc("/rpc/get", s.handleGet)
    s.mux.HandleFunc("/rpc/put", s.handlePut)
    s.mux.HandleFunc("/rpc/gossip", s.handleGossip)
    s.mux.HandleFunc("/rpc/sync", s.handleSync)
    s.mux.HandleFunc("/rpc/hint", s.handleHint)
    s.mux.HandleFunc("/health", s.handleHealth)
}

Each handler follows the same pattern: decode JSON request, call the node operation, encode JSON response. The handlePut handler, for instance:

func (s *Server) handlePut(w http.ResponseWriter, r *http.Request) {
    var req PutRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        writeError(w, http.StatusBadRequest, "invalid request")
        return
    }

    value := req.Value.ToVersionedValue()
    err := s.node.LocalPut(req.Key, value)

    resp := PutResponse{Success: err == nil}
    if err != nil {
        resp.Error = err.Error()
    }
    json.NewEncoder(w).Encode(resp)
}

Note that the server calls LocalPut, not Put. This is a local storage operation, not a coordinated write. The coordinator on the originating node handles the quorum logic; the remote nodes just store what they're told. This avoids recursive coordination - a coordinator that sends a write to a remote node doesn't want that node to run its own quorum protocol.

The server is configured with sensible timeouts:

s.server = &http.Server{
    Addr:         s.address,
    Handler:      s.mux,
    ReadTimeout:  10 * time.Second,
    WriteTimeout: 10 * time.Second,
    IdleTimeout:  60 * time.Second,
}

The Client

The RPC client makes HTTP calls with context-based timeouts and connection pooling:

type Client struct {
    httpClient *http.Client
    timeout    time.Duration
}

func NewClient(timeout time.Duration) *Client {
    return &Client{
        httpClient: &http.Client{
            Transport: &http.Transport{
                MaxIdleConns:        100,
                MaxIdleConnsPerHost: 10,
                IdleConnTimeout:     90 * time.Second,
            },
        },
        timeout: timeout,
    }
}

Connection pooling (MaxIdleConns, MaxIdleConnsPerHost) is important. Without it, every RPC call would open a new TCP connection, which adds latency from the three-way handshake and puts pressure on the OS's file descriptor limit. With pooling, connections are reused across calls to the same node.

Circuit Breaker

When a remote node is slow or failing, repeatedly sending it requests doesn't just waste time - it can create a cascade. Your goroutines pile up waiting for timeouts, your thread pool fills, and soon your node is slow too. The circuit breaker pattern prevents this by failing fast once a node is determined to be unhealthy.

The implementation follows the standard three-state model:

type CircuitBreaker struct {
    config  *CircuitBreakerConfig
    state   CircuitState      // Closed, Open, HalfOpen
    failures    int
    successes   int           // in half-open state
    lastFailureTime time.Time
    mu sync.Mutex
}

Closed (normal operation): Requests flow through normally. Each failure increments a counter. When failures hit the threshold (default: 5 consecutive), the circuit opens.

Open (fail-fast): All requests are immediately rejected with ErrCircuitOpen. No RPC call is made. After a cooldown period (ResetTimeout, default: 30 seconds), the circuit transitions to half-open.

Half-open (recovery testing): A limited number of requests (default: 3) are allowed through as probes. If enough succeed (default: 2), the circuit closes and normal operation resumes. If any fail, the circuit opens again.

func (cb *CircuitBreaker) Allow() bool {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    switch cb.state {
    case StateClosed:
        return true
    case StateOpen:
        if time.Since(cb.lastFailureTime) >= cb.config.ResetTimeout {
            cb.transitionToHalfOpen()
            return true
        }
        return false
    case StateHalfOpen:
        if cb.halfOpenReqs < cb.config.HalfOpenMaxRequests {
            cb.halfOpenReqs++
            return true
        }
        return false
    }
    return false
}

The CircuitBreakerManager maintains a per-address circuit breaker, so one failing node doesn't affect calls to healthy nodes:

type CircuitBreakerManager struct {
    config   *CircuitBreakerConfig
    breakers map[string]*CircuitBreaker  // address -> breaker
    mu       sync.RWMutex
}

The manager uses a double-checked locking pattern to minimize contention - a fast path with a read lock, falling back to a write lock only to create new breakers.

Retry with Exponential Backoff

Transient failures (a brief network hiccup, a garbage collection pause on the remote node) shouldn't cause permanent errors. The retry layer wraps RPC calls with configurable retry logic:

type RetryConfig struct {
    MaxRetries        int            // default: 3
    InitialBackoff    time.Duration  // default: 100ms
    MaxBackoff        time.Duration  // default: 5s
    BackoffMultiplier float64        // default: 2.0
    Jitter            float64        // default: 0.2 (20%)
    RetryableErrors   []error        // which errors trigger retry
}

The backoff calculation uses exponential growth with jitter:

func (r *Retryer) calculateBackoff(attempt int) time.Duration {
    // Exponential: 100ms, 200ms, 400ms, 800ms, ...
    backoff := float64(r.config.InitialBackoff) *
               math.Pow(r.config.BackoffMultiplier, float64(attempt))

    // Jitter: ±20% randomization
    if r.config.Jitter > 0 {
        jitterRange := backoff * r.config.Jitter
        backoff = backoff + (rand.Float64()*2-1)*jitterRange
    }

    // Cap at max
    if backoff > float64(r.config.MaxBackoff) {
        backoff = float64(r.config.MaxBackoff)
    }
    return time.Duration(backoff)
}

The jitter is crucial. Without it, all clients that failed at the same time would retry at the same time, creating a thundering herd that overwhelms the recovering node. Random jitter spreads retries across time, giving the node breathing room.

Composing Circuit Breaker and Retry

The RetryableOperation struct composes both patterns into a single call:

func (ro *RetryableOperation) Execute(ctx context.Context,
                                      operation func() error) error {
    // Check circuit breaker first - fail fast if open
    if ro.circuitBreaker != nil && !ro.circuitBreaker.Allow() {
        return ErrCircuitOpen
    }

    // Execute with retry
    err := ro.retryer.Do(ctx, operation)

    // Update circuit breaker based on result
    if ro.circuitBreaker != nil {
        if err != nil {
            ro.circuitBreaker.RecordFailure()
        } else {
            ro.circuitBreaker.RecordSuccess()
        }
    }
    return err
}

The order matters: circuit breaker check before retry. If the circuit is open, we don't even attempt the operation. This prevents retries from piling up against a known-dead node.

Admission Control

Background tasks - gossip, anti-entropy, hinted handoff, tombstone compaction - consume CPU, memory, and network bandwidth. Under light load, this is fine. But during peak traffic, these background tasks compete with user-facing reads and writes for the same resources, potentially inflating tail latency.

The admission controller monitors foreground request latency and throttles background work when latency rises above a threshold:

type AdmissionController struct {
    latencyThreshold time.Duration  // default: 100ms (p99)
    maxSlots         int            // default: 10
    minSlots         int            // default: 1
    windowSize       int            // default: 1000 samples

    backgroundSlots int             // current allowed background tasks
    latencies       []time.Duration // rolling window
    latencyIndex    int
    latencyCount    int
    mu              sync.RWMutex
}

Every foreground request records its latency:

func (ac *AdmissionController) RecordLatency(latency time.Duration) {
    ac.mu.Lock()
    defer ac.mu.Unlock()
    ac.latencies[ac.latencyIndex] = latency
    ac.latencyIndex = (ac.latencyIndex + 1) % ac.windowSize
    if ac.latencyCount < ac.windowSize {
        ac.latencyCount++
    }
}

When a background task wants to run, it checks AllowBackground:

func (ac *AdmissionController) AllowBackground() bool {
    ac.mu.Lock()
    defer ac.mu.Unlock()

    if ac.latencyCount < 10 {
        return true  // not enough data yet
    }

    p99 := ac.calculateP99Locked()

    if p99 > ac.latencyThreshold {
        // Foreground under stress - reduce background slots
        ac.backgroundSlots = max(ac.minSlots, ac.backgroundSlots-1)
    } else {
        // Foreground healthy - allow more background work
        ac.backgroundSlots = min(ac.maxSlots, ac.backgroundSlots+1)
    }
    return ac.backgroundSlots > 0
}

The mechanism is adaptive. Background slots gradually decrease when latency is high and gradually increase when latency is healthy. The minSlots floor (default: 1) ensures that background work never stops entirely - you always need at least some gossip and anti-entropy running to maintain cluster health.

The p99 calculation sorts the rolling window samples and picks the 99th percentile - the same metric the Dynamo paper uses for SLA targets:

func (ac *AdmissionController) calculateP99Locked() time.Duration {
    samples := make([]time.Duration, ac.latencyCount)
    copy(samples, ac.latencies[:ac.latencyCount])
    sortDurations(samples)
    p99Index := int(float64(len(samples)-1) * 0.99)
    return samples[p99Index]
}

Every background loop checks admission control before proceeding:

func (n *Node) gossipLoop() {
    // ...
    for {
        select {
        case <-ticker.C:
            if n.admissionController != nil &&
               !n.admissionController.AllowBackground() {
                continue  // skip this round
            }
            n.membership.Gossip()
        case <-n.stopCh:
            return
        }
    }
}

Coordinator Selection

By default, the first node in the preference list coordinates the write. But that node might be slower than others due to load, garbage collection, or network conditions. The coordinator selector tracks per-node latencies and reorders the preference list so the fastest node coordinates the write:

type CoordinatorSelector struct {
    trackers   map[string]*LatencyTracker
    windowSize int
    mu         sync.RWMutex
}

type LatencyTracker struct {
    samples    []time.Duration
    index      int
    count      int
    windowSize int   // default: 100
    mu         sync.Mutex
}

ReorderByLatency sorts the preference list by measured p99, with nodes that have latency data sorted ahead of unknowns:

func (cs *CoordinatorSelector) ReorderByLatency(
    preferenceList []string) []string {

    // ... build entries with p99 from trackers ...

    sort.SliceStable(entries, func(i, j int) bool {
        // Known-fast nodes first
        if entries[i].hasData != entries[j].hasData {
            return entries[i].hasData
        }
        // Among known nodes, sort by latency
        if entries[i].hasData && entries[j].hasData {
            return entries[i].latency < entries[j].latency
        }
        // Unknown nodes keep original order
        return entries[i].originalRank < entries[j].originalRank
    })
    // ...
}

This is a performance optimization targeted at the 99.9th percentile - exactly the metric the paper says Amazon optimizes for. By routing writes away from slow nodes, tail latency drops significantly.

Metrics

You can't manage what you can't measure. The metrics collector provides Prometheus-compatible observability across every subsystem:

type Collector struct {
    nodeID string

    // Request metrics
    RequestDuration        *Histogram
    RequestsTotal          *Counter

    // Quorum metrics
    QuorumTotal            *Counter
    DivergentVersionsTotal *Counter

    // Hinted handoff metrics
    HintsPending           *Gauge
    HintsDelivered         *Counter
    HintsStoredTotal       *Counter

    // Gossip metrics
    GossipRoundsTotal      *Counter
    GossipDuration         *Histogram
    ClusterMemberCount     *Gauge

    // Anti-entropy metrics
    AntiEntropySyncsTotal  *Counter
    AntiEntropyKeysSynced  *Counter
    AntiEntropyDuration    *Histogram

    // Read repair metrics
    ReadRepairsTotal       *Counter
    ReadRepairsFailed      *Counter

    // Background throttling
    BackgroundThrottled    *Gauge
}

The Render method produces Prometheus text exposition format, served at /metrics:

# HELP dynamo_request_duration_seconds Latency of client requests in seconds.
# TYPE dynamo_request_duration_seconds histogram
dynamo_request_duration_seconds_bucket{le="0.001"} 42
dynamo_request_duration_seconds_bucket{le="0.005"} 187
...
dynamo_request_duration_seconds_count 1523
dynamo_request_duration_seconds_sum 12.847

# HELP dynamo_hints_pending Number of hints pending delivery per target node.
# TYPE dynamo_hints_pending gauge
dynamo_hints_pending 3

Three metric types - counters (monotonically increasing), gauges (current value), and histograms (distribution with configurable buckets) — cover the essentials. The coordinator instruments every operation:

func (c *Coordinator) Put(ctx context.Context, ...) error {
    start := time.Now()
    // ... quorum write ...
    if c.node.metrics != nil {
        duration := time.Since(start)
        c.node.metrics.RequestDuration.Observe(duration.Seconds())
        c.node.metrics.RequestsTotal.Inc()
        c.node.admissionController.RecordLatency(duration)
    }
}

These metrics feed dashboards, alerts, and capacity planning. In a real deployment, you'd scrape /metrics with Prometheus and visualize with Grafana.

Storage Engines Beyond Memory

Parts 1–5 used MemoryStorage — a map with reconciliation. It's perfect for testing but obviously won't survive a restart. The implementation includes three additional storage backends.

Log-Structured Storage (LSS)

The LSS engine is a custom log-structured storage system built from scratch:

type LSSEngine struct {
    config    *Config
    index     *Index        // in-memory key -> segment+offset
    segments  []*Segment    // immutable data segments
    activeWAL *Segment      // current write-ahead log
    nextSegID uint64
    compactor *Compactor
}

Writes go to the active WAL (Write-Ahead Log) — an append-only segment. When the WAL reaches its size limit, it's sealed as an immutable segment and a new WAL starts. Reads consult the in-memory index, which maps each key to the segment and offset where its latest value lives.

The Compactor runs in the background, merging smaller segments into larger ones to reclaim space from overwritten or deleted keys. The Recovery module rebuilds the index on startup by replaying all segments.

This is a simplified version of the storage architecture used by LevelDB, RocksDB, and Cassandra. It's optimized for write-heavy workloads - writes are always sequential (append-only), and reads are usually a single index lookup plus a seek.

BoltDB and BadgerDB

For users who want mature, battle-tested storage without building their own, the implementation provides adapters for BoltDB and BadgerDB:

func NewStorage(engineType string, path string, nodeID string) (Storage, error) {
    switch engineType {
    case "memory":
        return NewMemoryStorage(), nil
    case "lss":
        return lss.NewLSSEngine(lss.DefaultConfig(path))
    case "boltdb":
        return NewBoltDBStorage(path + "/" + nodeID + ".db")
    case "badger":
        return NewBadgerStorage(path + "/" + nodeID)
    default:
        return NewMemoryStorage(), nil
    }
}

All four backends implement the same Storage interface. The rest of the system - the coordinator, replicator, anti-entropy, hinted handoff - doesn't know or care which engine is underneath. You can swap engines by changing a single config field.

Testing Strategy

A distributed system needs testing at multiple levels. The implementation has three tiers.

Unit Tests

Per-component tests that verify each subsystem in isolation. The ring tests check load distribution, preference list correctness, and node removal behavior. Vector clock tests exercise all four comparison outcomes (Before, After, Equal, Concurrent) and merge semantics. The coordinator tests run a real single-node cluster and verify end-to-end read-modify-write cycles:

func TestCoordinatorSequentialWrites_VersionsIncrement(t *testing.T) {
    node, _ := startTestNode(t, "coord-seq-1")
    defer node.Stop()

    var lastCtx *dynamo.Context
    for i := 1; i <= 5; i++ {
        node.Put(ctx, "counter", []byte(fmt.Sprintf("v%d", i)), lastCtx)
        result, _ := node.Get(ctx, "counter")
        lastCtx = result.Context
    }

    result, _ := node.Get(ctx, "counter")
    // 1 version, vector clock counter == 5
}

Circuit breaker and retry tests verify state transitions and backoff calculation. Admission control tests inject latency samples and verify throttling behavior.

Integration Tests

Multi-node cluster tests that verify the system works end-to-end. Five test files cover the major scenarios:

cluster_test.go — Creates a 3-node cluster, writes data through one node, reads from another, verifies replication.

consistency_test.go — Writes with different quorum settings, verifies that R + W > N guarantees strong consistency while R + W ≤ N allows stale reads.

failure_detection_test.go — Starts a cluster, kills a node, verifies it's detected as dead, restarts it, verifies recovery.

hinted_handoff_test.go — Writes while a node is down, verifies hints are stored, brings the node back, verifies hints are delivered.

partition_test.go — Simulates network partitions between groups of nodes, verifies that writes succeed on the majority side and that partitioned nodes rejoin cleanly.

Performance Benchmarks

Go benchmarks measure the latency of individual operations:

func BenchmarkPut(b *testing.B) {
    node, _ := dynamo.NewNode("bench-node", "localhost:9000", config)
    node.Start()
    defer node.Stop()

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        key := fmt.Sprintf("key%d", i)
        node.Put(ctx, key, []byte("value"), nil)
    }
}

func BenchmarkGet(b *testing.B) {
    // Pre-populate 1000 keys, then benchmark reads
}

func BenchmarkVectorClockCompare(b *testing.B) {
    // Benchmark the hot-path comparison operation
}

func BenchmarkMerkleTreeBuild(b *testing.B) {
    // Benchmark tree construction with varying dataset sizes
}

These benchmarks help identify regressions and guide optimization. The vector clock comparison and Merkle tree construction benchmarks are particularly important because they run on the hot path of every read and every anti-entropy cycle.

Some Features We Could Add

The implementation is a faithful rendition of the core Dynamo paper, but production systems go further. Here are some natural extensions:

gRPC migration. Replace HTTP/JSON with gRPC and protobuf for lower serialization overhead and HTTP/2 multiplexing. The NodeOperations interface makes this a drop-in replacement.

Compression. Compress values before replication to reduce network bandwidth. Snappy or LZ4 add minimal CPU overhead for substantial bandwidth savings.

Encryption. TLS for inter-node traffic, encryption at rest for storage engines. The paper assumes a trusted internal network, but real deployments often need encryption.

Cross-datacenter replication. The paper mentions multi-datacenter deployment but doesn't detail it. A production extension would add datacenter-aware preference lists and asynchronous cross-DC replication.

Dynamic quorum. Instead of fixed N/R/W, adjust quorum sizes based on observed availability. If one datacenter is unreachable, temporarily lower W to maintain write availability.

Read-your-writes consistency. Track which vector clock a client has seen and ensure subsequent reads from that client return at least that version, even if routed to a different node.

Wrapping Up

Over six posts, we've built a complete distributed key-value store from the ground up:

Part	Component	What It Does
1	Node, Storage	Single-node key-value store with pluggable backends
2	Ring	Consistent hashing for data partitioning across nodes
3	VectorClock, VersionedValue	Causality tracking and conflict detection
4	Coordinator, HintedHandoff	Distributed reads/writes with quorum semantics
5	Membership, FailureDetector, AntiEntropy	Gossip discovery, phi accrual detection, Merkle tree sync
6	RPC, CircuitBreaker, AdmissionController, Metrics	Communication, resilience, performance, observability

Each subsystem is relatively simple on its own. The consistent hash ring is a sorted slice with binary search. Vector clocks are maps of counters. The coordinator is a stateless fan-out with a timeout. Gossip is a random peer exchange. The Merkle tree is recursive hashing. The circuit breaker is a three-state machine.

What makes Dynamo powerful is how these pieces compose. A write fans out through the coordinator, which consults the ring for placement, increments a vector clock for versioning, uses the RPC layer with circuit breakers for communication, stores hints via hinted handoff for failed nodes, and gets replicated to surviving nodes. A failure is detected by the phi accrual detector, propagated by gossip, worked around by sloppy quorum, and repaired by anti-entropy with Merkle trees - all while the admission controller protects foreground latency and the metrics collector records everything for operators.

The full system is about 8,000 lines of Go. Not small, but not enormous either - and every line maps to a concept from the paper. If you've followed along, you now have a working understanding of how one of the most influential distributed systems papers translates into real, running code.

The full source code is available at github.com/tripab/toy-dynamo.