Building Dynamo from scratch in Go: Part 6
Production Concerns - RPC, Resilience, Performance, and Testing
This series
- Part 1: The Big Picture
- Part 2: Partitioning with Consistent Hashing
- Part 3: Vector Clocks and Conflict Resolution
- Part 4: Request Coordination, Quorums, and Failure Handling
- Part 5: Membership, Failure Detection, and Replica Synchronization
- Part 6: Production Concerns - RPC, Resilience, Performance, and Testing (this post)
This is the final post in a 6-part series walking through a from-scratch Go implementation of Amazon's Dynamo. In previous posts, we built the consistent hash ring (Part 2), vector clock versioning (Part 3), quorum-based coordination with hinted handoff (Part 4), and gossip, failure detection, and anti-entropy (Part 5). Now we cover the infrastructure that makes the whole thing robust, observable, and testable.
Outline
- What This Part Covers
- The RPC Layer
- Circuit Breaker
- Retry with Exponential Backoff
- Admission Control
- Coordinator Selection
- Metrics
- Storage Engines Beyond Memory
- Testing Strategy
- Some Features We Could Add
- Wrapping Up
What This Part Covers
The core distributed algorithms are in place. But a distributed system isn't production-ready just because the algorithms are correct. You also need: a reliable communication layer between nodes, mechanisms to prevent cascade failures when one node is slow or down, protection for user-facing latency when background work gets heavy, observability into what the system is actually doing, and confidence that the whole thing works under failure conditions.
This post brings the Node struct to its final form:
type Node struct {
id string
address string
config *Config
storage storage.Storage
ring *ring.Ring
coordinator *Coordinator
membership *membership.Membership
hintedHoff *replication.HintedHandoff
failDetector *membership.FailureDetector
antiEntropy *synchronization.AntiEntropy
rpcClient *rpc.Client
rpcServer *rpc.Server
admissionController *AdmissionController // NEW
coordinatorSelector *CoordinatorSelector // NEW
metrics *metrics.Collector // NEW
tombstoneCompactor *storage.TombstoneCompactor // NEW
stopCh chan struct{}
wg sync.WaitGroup
mu sync.RWMutex
}
And Config reaches its final form with fields for circuit breakers, retry, admission control, coordinator selection, and metrics.
The RPC Layer
Every cross-node operation - reads, writes, gossip exchanges, hint deliveries, anti-entropy syncs - flows through the RPC layer. The implementation uses HTTP/JSON for simplicity and debuggability. It's not the fastest choice (gRPC with protobuf would be more efficient), but it's easy to inspect with curl, easy to debug with any HTTP tool, and easy to replace later.
The Protocol
The protocol consists of five endpoints, each with a typed request/response pair:
POST /rpc/get -> GetRequest / GetResponse
POST /rpc/put -> PutRequest / PutResponse
POST /rpc/gossip -> GossipRequest / GossipResponse
POST /rpc/sync -> SyncRequest / SyncResponse
POST /rpc/hint -> HintRequest / HintResponse
GET /health -> 200 OK
GET /metrics -> Prometheus text format
Each DTO (Data Transfer Object) is a JSON-serializable version of the internal types. For example, VersionedValueDTO carries the data, vector clock, and tombstone flag in a format that can cross the wire:
type VersionedValueDTO struct {
Data []byte `json:"data,omitempty"`
VectorClock map[string]uint64 `json:"vector_clock"`
Timestamp time.Time `json:"timestamp"`
IsTombstone bool `json:"is_tombstone,omitempty"`
}
Conversion functions bridge between internal types and DTOs:
func FromVersionedValue(vv versioning.VersionedValue) VersionedValueDTO {
vcMap := make(map[string]uint64)
if vv.VectorClock != nil {
for nodeID, counter := range vv.VectorClock.Versions {
vcMap[nodeID] = counter
}
}
return VersionedValueDTO{
Data: vv.Data,
VectorClock: vcMap,
Timestamp: time.Now(),
IsTombstone: vv.IsTombstone,
}
}
This separation between internal types and wire types is deliberate. Internal types can evolve (e.g., adding fields to VectorClock) without breaking the RPC protocol, and vice versa.
The Server
The RPC server is a thin HTTP wrapper that delegates to a NodeOperations interface:
type NodeOperations interface {
LocalGet(key string) ([]versioning.VersionedValue, error)
LocalPut(key string, value versioning.VersionedValue) error
HandleGossip(members []MemberDTO) []MemberDTO
HandleSync(req *SyncRequest) *SyncResponse
HandleHint(req *HintRequest) error
GetNodeID() string
}
type Server struct {
node NodeOperations
address string
server *http.Server
mux *http.ServeMux
}
Handler registration maps each endpoint to its handler:
func (s *Server) registerHandlers() {
s.mux.HandleFunc("/rpc/get", s.handleGet)
s.mux.HandleFunc("/rpc/put", s.handlePut)
s.mux.HandleFunc("/rpc/gossip", s.handleGossip)
s.mux.HandleFunc("/rpc/sync", s.handleSync)
s.mux.HandleFunc("/rpc/hint", s.handleHint)
s.mux.HandleFunc("/health", s.handleHealth)
}
Each handler follows the same pattern: decode JSON request, call the node operation, encode JSON response. The handlePut handler, for instance:
func (s *Server) handlePut(w http.ResponseWriter, r *http.Request) {
var req PutRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request")
return
}
value := req.Value.ToVersionedValue()
err := s.node.LocalPut(req.Key, value)
resp := PutResponse{Success: err == nil}
if err != nil {
resp.Error = err.Error()
}
json.NewEncoder(w).Encode(resp)
}
Note that the server calls LocalPut, not Put. This is a local storage operation, not a coordinated write. The coordinator on the originating node handles the quorum logic; the remote nodes just store what they're told. This avoids recursive coordination - a coordinator that sends a write to a remote node doesn't want that node to run its own quorum protocol.
The server is configured with sensible timeouts:
s.server = &http.Server{
Addr: s.address,
Handler: s.mux,
ReadTimeout: 10 * time.Second,
WriteTimeout: 10 * time.Second,
IdleTimeout: 60 * time.Second,
}
The Client
The RPC client makes HTTP calls with context-based timeouts and connection pooling:
type Client struct {
httpClient *http.Client
timeout time.Duration
}
func NewClient(timeout time.Duration) *Client {
return &Client{
httpClient: &http.Client{
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
},
timeout: timeout,
}
}
Connection pooling (MaxIdleConns, MaxIdleConnsPerHost) is important. Without it, every RPC call would open a new TCP connection, which adds latency from the three-way handshake and puts pressure on the OS's file descriptor limit. With pooling, connections are reused across calls to the same node.
Circuit Breaker
When a remote node is slow or failing, repeatedly sending it requests doesn't just waste time - it can create a cascade. Your goroutines pile up waiting for timeouts, your thread pool fills, and soon your node is slow too. The circuit breaker pattern prevents this by failing fast once a node is determined to be unhealthy.
The implementation follows the standard three-state model:
type CircuitBreaker struct {
config *CircuitBreakerConfig
state CircuitState // Closed, Open, HalfOpen
failures int
successes int // in half-open state
lastFailureTime time.Time
mu sync.Mutex
}
Closed (normal operation): Requests flow through normally. Each failure increments a counter. When failures hit the threshold (default: 5 consecutive), the circuit opens.
Open (fail-fast): All requests are immediately rejected with ErrCircuitOpen. No RPC call is made. After a cooldown period (ResetTimeout, default: 30 seconds), the circuit transitions to half-open.
Half-open (recovery testing): A limited number of requests (default: 3) are allowed through as probes. If enough succeed (default: 2), the circuit closes and normal operation resumes. If any fail, the circuit opens again.
func (cb *CircuitBreaker) Allow() bool {
cb.mu.Lock()
defer cb.mu.Unlock()
switch cb.state {
case StateClosed:
return true
case StateOpen:
if time.Since(cb.lastFailureTime) >= cb.config.ResetTimeout {
cb.transitionToHalfOpen()
return true
}
return false
case StateHalfOpen:
if cb.halfOpenReqs < cb.config.HalfOpenMaxRequests {
cb.halfOpenReqs++
return true
}
return false
}
return false
}
The CircuitBreakerManager maintains a per-address circuit breaker, so one failing node doesn't affect calls to healthy nodes:
type CircuitBreakerManager struct {
config *CircuitBreakerConfig
breakers map[string]*CircuitBreaker // address -> breaker
mu sync.RWMutex
}
The manager uses a double-checked locking pattern to minimize contention - a fast path with a read lock, falling back to a write lock only to create new breakers.
Retry with Exponential Backoff
Transient failures (a brief network hiccup, a garbage collection pause on the remote node) shouldn't cause permanent errors. The retry layer wraps RPC calls with configurable retry logic:
type RetryConfig struct {
MaxRetries int // default: 3
InitialBackoff time.Duration // default: 100ms
MaxBackoff time.Duration // default: 5s
BackoffMultiplier float64 // default: 2.0
Jitter float64 // default: 0.2 (20%)
RetryableErrors []error // which errors trigger retry
}
The backoff calculation uses exponential growth with jitter:
func (r *Retryer) calculateBackoff(attempt int) time.Duration {
// Exponential: 100ms, 200ms, 400ms, 800ms, ...
backoff := float64(r.config.InitialBackoff) *
math.Pow(r.config.BackoffMultiplier, float64(attempt))
// Jitter: ±20% randomization
if r.config.Jitter > 0 {
jitterRange := backoff * r.config.Jitter
backoff = backoff + (rand.Float64()*2-1)*jitterRange
}
// Cap at max
if backoff > float64(r.config.MaxBackoff) {
backoff = float64(r.config.MaxBackoff)
}
return time.Duration(backoff)
}
The jitter is crucial. Without it, all clients that failed at the same time would retry at the same time, creating a thundering herd that overwhelms the recovering node. Random jitter spreads retries across time, giving the node breathing room.
Composing Circuit Breaker and Retry
The RetryableOperation struct composes both patterns into a single call:
func (ro *RetryableOperation) Execute(ctx context.Context,
operation func() error) error {
// Check circuit breaker first - fail fast if open
if ro.circuitBreaker != nil && !ro.circuitBreaker.Allow() {
return ErrCircuitOpen
}
// Execute with retry
err := ro.retryer.Do(ctx, operation)
// Update circuit breaker based on result
if ro.circuitBreaker != nil {
if err != nil {
ro.circuitBreaker.RecordFailure()
} else {
ro.circuitBreaker.RecordSuccess()
}
}
return err
}
The order matters: circuit breaker check before retry. If the circuit is open, we don't even attempt the operation. This prevents retries from piling up against a known-dead node.
Admission Control
Background tasks - gossip, anti-entropy, hinted handoff, tombstone compaction - consume CPU, memory, and network bandwidth. Under light load, this is fine. But during peak traffic, these background tasks compete with user-facing reads and writes for the same resources, potentially inflating tail latency.
The admission controller monitors foreground request latency and throttles background work when latency rises above a threshold:
type AdmissionController struct {
latencyThreshold time.Duration // default: 100ms (p99)
maxSlots int // default: 10
minSlots int // default: 1
windowSize int // default: 1000 samples
backgroundSlots int // current allowed background tasks
latencies []time.Duration // rolling window
latencyIndex int
latencyCount int
mu sync.RWMutex
}
Every foreground request records its latency:
func (ac *AdmissionController) RecordLatency(latency time.Duration) {
ac.mu.Lock()
defer ac.mu.Unlock()
ac.latencies[ac.latencyIndex] = latency
ac.latencyIndex = (ac.latencyIndex + 1) % ac.windowSize
if ac.latencyCount < ac.windowSize {
ac.latencyCount++
}
}
When a background task wants to run, it checks AllowBackground:
func (ac *AdmissionController) AllowBackground() bool {
ac.mu.Lock()
defer ac.mu.Unlock()
if ac.latencyCount < 10 {
return true // not enough data yet
}
p99 := ac.calculateP99Locked()
if p99 > ac.latencyThreshold {
// Foreground under stress - reduce background slots
ac.backgroundSlots = max(ac.minSlots, ac.backgroundSlots-1)
} else {
// Foreground healthy - allow more background work
ac.backgroundSlots = min(ac.maxSlots, ac.backgroundSlots+1)
}
return ac.backgroundSlots > 0
}
The mechanism is adaptive. Background slots gradually decrease when latency is high and gradually increase when latency is healthy. The minSlots floor (default: 1) ensures that background work never stops entirely - you always need at least some gossip and anti-entropy running to maintain cluster health.
The p99 calculation sorts the rolling window samples and picks the 99th percentile - the same metric the Dynamo paper uses for SLA targets:
func (ac *AdmissionController) calculateP99Locked() time.Duration {
samples := make([]time.Duration, ac.latencyCount)
copy(samples, ac.latencies[:ac.latencyCount])
sortDurations(samples)
p99Index := int(float64(len(samples)-1) * 0.99)
return samples[p99Index]
}
Every background loop checks admission control before proceeding:
func (n *Node) gossipLoop() {
// ...
for {
select {
case <-ticker.C:
if n.admissionController != nil &&
!n.admissionController.AllowBackground() {
continue // skip this round
}
n.membership.Gossip()
case <-n.stopCh:
return
}
}
}
Coordinator Selection
By default, the first node in the preference list coordinates the write. But that node might be slower than others due to load, garbage collection, or network conditions. The coordinator selector tracks per-node latencies and reorders the preference list so the fastest node coordinates the write:
type CoordinatorSelector struct {
trackers map[string]*LatencyTracker
windowSize int
mu sync.RWMutex
}
type LatencyTracker struct {
samples []time.Duration
index int
count int
windowSize int // default: 100
mu sync.Mutex
}
ReorderByLatency sorts the preference list by measured p99, with nodes that have latency data sorted ahead of unknowns:
func (cs *CoordinatorSelector) ReorderByLatency(
preferenceList []string) []string {
// ... build entries with p99 from trackers ...
sort.SliceStable(entries, func(i, j int) bool {
// Known-fast nodes first
if entries[i].hasData != entries[j].hasData {
return entries[i].hasData
}
// Among known nodes, sort by latency
if entries[i].hasData && entries[j].hasData {
return entries[i].latency < entries[j].latency
}
// Unknown nodes keep original order
return entries[i].originalRank < entries[j].originalRank
})
// ...
}
This is a performance optimization targeted at the 99.9th percentile - exactly the metric the paper says Amazon optimizes for. By routing writes away from slow nodes, tail latency drops significantly.
Metrics
You can't manage what you can't measure. The metrics collector provides Prometheus-compatible observability across every subsystem:
type Collector struct {
nodeID string
// Request metrics
RequestDuration *Histogram
RequestsTotal *Counter
// Quorum metrics
QuorumTotal *Counter
DivergentVersionsTotal *Counter
// Hinted handoff metrics
HintsPending *Gauge
HintsDelivered *Counter
HintsStoredTotal *Counter
// Gossip metrics
GossipRoundsTotal *Counter
GossipDuration *Histogram
ClusterMemberCount *Gauge
// Anti-entropy metrics
AntiEntropySyncsTotal *Counter
AntiEntropyKeysSynced *Counter
AntiEntropyDuration *Histogram
// Read repair metrics
ReadRepairsTotal *Counter
ReadRepairsFailed *Counter
// Background throttling
BackgroundThrottled *Gauge
}
The Render method produces Prometheus text exposition format, served at /metrics:
# HELP dynamo_request_duration_seconds Latency of client requests in seconds.
# TYPE dynamo_request_duration_seconds histogram
dynamo_request_duration_seconds_bucket{le="0.001"} 42
dynamo_request_duration_seconds_bucket{le="0.005"} 187
...
dynamo_request_duration_seconds_count 1523
dynamo_request_duration_seconds_sum 12.847
# HELP dynamo_hints_pending Number of hints pending delivery per target node.
# TYPE dynamo_hints_pending gauge
dynamo_hints_pending 3
Three metric types - counters (monotonically increasing), gauges (current value), and histograms (distribution with configurable buckets) — cover the essentials. The coordinator instruments every operation:
func (c *Coordinator) Put(ctx context.Context, ...) error {
start := time.Now()
// ... quorum write ...
if c.node.metrics != nil {
duration := time.Since(start)
c.node.metrics.RequestDuration.Observe(duration.Seconds())
c.node.metrics.RequestsTotal.Inc()
c.node.admissionController.RecordLatency(duration)
}
}
These metrics feed dashboards, alerts, and capacity planning. In a real deployment, you'd scrape /metrics with Prometheus and visualize with Grafana.
Storage Engines Beyond Memory
Parts 1–5 used MemoryStorage — a map with reconciliation. It's perfect for testing but obviously won't survive a restart. The implementation includes three additional storage backends.
Log-Structured Storage (LSS)
The LSS engine is a custom log-structured storage system built from scratch:
type LSSEngine struct {
config *Config
index *Index // in-memory key -> segment+offset
segments []*Segment // immutable data segments
activeWAL *Segment // current write-ahead log
nextSegID uint64
compactor *Compactor
}
Writes go to the active WAL (Write-Ahead Log) — an append-only segment. When the WAL reaches its size limit, it's sealed as an immutable segment and a new WAL starts. Reads consult the in-memory index, which maps each key to the segment and offset where its latest value lives.
The Compactor runs in the background, merging smaller segments into larger ones to reclaim space from overwritten or deleted keys. The Recovery module rebuilds the index on startup by replaying all segments.
This is a simplified version of the storage architecture used by LevelDB, RocksDB, and Cassandra. It's optimized for write-heavy workloads - writes are always sequential (append-only), and reads are usually a single index lookup plus a seek.
BoltDB and BadgerDB
For users who want mature, battle-tested storage without building their own, the implementation provides adapters for BoltDB and BadgerDB:
func NewStorage(engineType string, path string, nodeID string) (Storage, error) {
switch engineType {
case "memory":
return NewMemoryStorage(), nil
case "lss":
return lss.NewLSSEngine(lss.DefaultConfig(path))
case "boltdb":
return NewBoltDBStorage(path + "/" + nodeID + ".db")
case "badger":
return NewBadgerStorage(path + "/" + nodeID)
default:
return NewMemoryStorage(), nil
}
}
All four backends implement the same Storage interface. The rest of the system - the coordinator, replicator, anti-entropy, hinted handoff - doesn't know or care which engine is underneath. You can swap engines by changing a single config field.
Testing Strategy
A distributed system needs testing at multiple levels. The implementation has three tiers.
Unit Tests
Per-component tests that verify each subsystem in isolation. The ring tests check load distribution, preference list correctness, and node removal behavior. Vector clock tests exercise all four comparison outcomes (Before, After, Equal, Concurrent) and merge semantics. The coordinator tests run a real single-node cluster and verify end-to-end read-modify-write cycles:
func TestCoordinatorSequentialWrites_VersionsIncrement(t *testing.T) {
node, _ := startTestNode(t, "coord-seq-1")
defer node.Stop()
var lastCtx *dynamo.Context
for i := 1; i <= 5; i++ {
node.Put(ctx, "counter", []byte(fmt.Sprintf("v%d", i)), lastCtx)
result, _ := node.Get(ctx, "counter")
lastCtx = result.Context
}
result, _ := node.Get(ctx, "counter")
// 1 version, vector clock counter == 5
}
Circuit breaker and retry tests verify state transitions and backoff calculation. Admission control tests inject latency samples and verify throttling behavior.
Integration Tests
Multi-node cluster tests that verify the system works end-to-end. Five test files cover the major scenarios:
cluster_test.go — Creates a 3-node cluster, writes data through one node, reads from another, verifies replication.
consistency_test.go — Writes with different quorum settings, verifies that R + W > N guarantees strong consistency while R + W ≤ N allows stale reads.
failure_detection_test.go — Starts a cluster, kills a node, verifies it's detected as dead, restarts it, verifies recovery.
hinted_handoff_test.go — Writes while a node is down, verifies hints are stored, brings the node back, verifies hints are delivered.
partition_test.go — Simulates network partitions between groups of nodes, verifies that writes succeed on the majority side and that partitioned nodes rejoin cleanly.
Performance Benchmarks
Go benchmarks measure the latency of individual operations:
func BenchmarkPut(b *testing.B) {
node, _ := dynamo.NewNode("bench-node", "localhost:9000", config)
node.Start()
defer node.Stop()
b.ResetTimer()
for i := 0; i < b.N; i++ {
key := fmt.Sprintf("key%d", i)
node.Put(ctx, key, []byte("value"), nil)
}
}
func BenchmarkGet(b *testing.B) {
// Pre-populate 1000 keys, then benchmark reads
}
func BenchmarkVectorClockCompare(b *testing.B) {
// Benchmark the hot-path comparison operation
}
func BenchmarkMerkleTreeBuild(b *testing.B) {
// Benchmark tree construction with varying dataset sizes
}
These benchmarks help identify regressions and guide optimization. The vector clock comparison and Merkle tree construction benchmarks are particularly important because they run on the hot path of every read and every anti-entropy cycle.
Some Features We Could Add
The implementation is a faithful rendition of the core Dynamo paper, but production systems go further. Here are some natural extensions:
gRPC migration. Replace HTTP/JSON with gRPC and protobuf for lower serialization overhead and HTTP/2 multiplexing. The NodeOperations interface makes this a drop-in replacement.
Compression. Compress values before replication to reduce network bandwidth. Snappy or LZ4 add minimal CPU overhead for substantial bandwidth savings.
Encryption. TLS for inter-node traffic, encryption at rest for storage engines. The paper assumes a trusted internal network, but real deployments often need encryption.
Cross-datacenter replication. The paper mentions multi-datacenter deployment but doesn't detail it. A production extension would add datacenter-aware preference lists and asynchronous cross-DC replication.
Dynamic quorum. Instead of fixed N/R/W, adjust quorum sizes based on observed availability. If one datacenter is unreachable, temporarily lower W to maintain write availability.
Read-your-writes consistency. Track which vector clock a client has seen and ensure subsequent reads from that client return at least that version, even if routed to a different node.
Wrapping Up
Over six posts, we've built a complete distributed key-value store from the ground up:
| Part | Component | What It Does |
|---|---|---|
| 1 | Node, Storage | Single-node key-value store with pluggable backends |
| 2 | Ring | Consistent hashing for data partitioning across nodes |
| 3 | VectorClock, VersionedValue | Causality tracking and conflict detection |
| 4 | Coordinator, HintedHandoff | Distributed reads/writes with quorum semantics |
| 5 | Membership, FailureDetector, AntiEntropy | Gossip discovery, phi accrual detection, Merkle tree sync |
| 6 | RPC, CircuitBreaker, AdmissionController, Metrics | Communication, resilience, performance, observability |
Each subsystem is relatively simple on its own. The consistent hash ring is a sorted slice with binary search. Vector clocks are maps of counters. The coordinator is a stateless fan-out with a timeout. Gossip is a random peer exchange. The Merkle tree is recursive hashing. The circuit breaker is a three-state machine.
What makes Dynamo powerful is how these pieces compose. A write fans out through the coordinator, which consults the ring for placement, increments a vector clock for versioning, uses the RPC layer with circuit breakers for communication, stores hints via hinted handoff for failed nodes, and gets replicated to surviving nodes. A failure is detected by the phi accrual detector, propagated by gossip, worked around by sloppy quorum, and repaired by anti-entropy with Merkle trees - all while the admission controller protects foreground latency and the metrics collector records everything for operators.
The full system is about 8,000 lines of Go. Not small, but not enormous either - and every line maps to a concept from the paper. If you've followed along, you now have a working understanding of how one of the most influential distributed systems papers translates into real, running code.
The full source code is available at github.com/tripab/toy-dynamo.