Lesson 34: Cross-Region Read Replica Lag and Failover Read Paths for the RPG Metrics Warehouse
Lesson 33 capped spend. This lesson answers the next crisis that only shows up when someone opens the same dashboard from another continent: the numbers look fine locally but wrong or stale abroad, or they vanish entirely during a regional outage.
You will not implement vendor-specific replication knobs here. You will add observable lag, reader routing, and a failover read contract that still respects the read-only principals from Lesson 32.

What you will build
By the end of this lesson, you will have:
- A
replication_freshness_secondsexposure on hot-path views or snapshot tables so every chart states how old the read is - A regional reader routing table that maps teams to endpoints without giving analysts the writer
- A one-page failover runbook for switching read traffic when a region degrades, including rollback and budget re-checks from Lesson 33
- A stale-read policy tied to Lesson 28 go/yellow/red language so stale dashboards cannot silently promote a ship decision
Step 1 - Name the three clocks
Keep three timestamps beside every executive-facing row:
| Clock | Meaning | Typical owner |
|---|---|---|
event_time |
when the build or incident happened in production | game services |
warehouse_ingest_time |
when facts landed in the warehouse | ingest from Lesson 31 |
replica_read_time |
when this replica last applied upstream changes | infra |
Replication lag is roughly replica_read_time - warehouse_ingest_time, surfaced as replication_freshness_seconds in views consumed by BI tools.
If you only show event_time, executives will argue about ghosts instead of replication.
Step 2 - Publish freshness beside the rolling window
Extend the rolling train views from Lesson 30 with:
max_replication_freshness_secondsfor the query scope- a traffic-light footer rule: green under your SLA (for example ninety seconds on hot path), yellow between SLA and ten minutes, red above ten minutes
Pro tip: during release week, tighten the SLA for vw_release_train_current_window only, not for every historical drill-down, so you do not pay Lesson 33 cold-path costs for hot-path paranoia.
Step 3 - Route readers without routing writers
Create analytics_reader_routing.md with three columns:
| Team region | Primary read endpoint | Failover read endpoint |
|---|---|---|
| US West analysts | reader-usw-prod |
reader-use-prod |
| EU Central analysts | reader-euc-prod |
reader-euw-prod |
| Automation jobs | scheduled reader pool | same-region secondary |
Rules:
- writers stay on the single promoted primary documented in Lesson 32
- humans and dashboards use readers only
- cross-region reads for convenience are allowed only when labeled with Step 2 freshness
Step 4 - Failover read path that preserves contracts
When a region browns out, your runbook should do four things in order:
- Confirm writer health — if the writer is fine, this is a read outage, not a game outage. Say that out loud in the war room.
- Flip BI connection strings per the routing table, not ad hoc copies of credentials.
- Re-run Lesson 33 budget probes on the failover region for twenty-four hours — cross-region egress can spike when everyone pivots.
- Post an incident note with start time, endpoints touched, and expected maximum staleness until replication catches up.
Common mistake: promoting a read replica to writer “because dashboards are broken.” That is a different disaster. Keep failover on the read plane unless game telemetry itself is blocked.
Step 5 - Drill lag with one synthetic marker
Add a lightweight ping_fact row written every five minutes by a trusted job:
- includes
ingest_batch_idstyle lineage from Lesson 31 - read from each regional dashboard and compare
now() - ping_time
If the marker disappears in one region only, you have a routing or partition problem, not a mysterious “AI RPG metrics feel off” vibe.
Step 6 - Align exec language with confidence lessons
Cross-link Lesson 26 analytics confidence so owners know stale reads downgrade confidence scores the same way missing telemetry does.
If replication lag exceeds red thresholds during a Lesson 28 briefing, the packet should move to yellow even when game servers are healthy.
Mini challenge
- Add
replication_freshness_secondsto one materialized snapshot from Lesson 33. - Fill one row in
analytics_reader_routing.mdfor your largest analyst team. - Run the ping drill in two regions and capture screenshots with timestamps.
FAQ
Do we need multi-region writes for game telemetry?
Usually no for this course track. Game events should still land in one primary region; replicas serve read scale and resilience for analytics, not split-brain writes.
How is this different from caching dashboards?
Caches hide lineage. Freshness columns and ping facts keep honest uncertainty in the same frame as the metric.
What if our warehouse vendor hides replica lag?
Approximate with snapshot table versions and ingestion watermark tables from Lesson 31, then label dashboards “at least as fresh as watermark X.”
Should players ever see replication lag?
No. This lesson is for internal live-ops and leadership surfaces, not HUD elements.
Lesson recap
You can now:
- expose replication lag without drowning dashboards in infra jargon
- route regional analysts to readers while keeping writers locked down
- execute a read-only failover path that triggers Lesson 33 cost review
- stop stale charts from smuggling false confidence into ship decisions
Next lesson teaser
Continue with Lesson 35: Warehouse Data Residency, Deletion SLAs, and PII Minimization for Cross-Border RPG Live-Ops for residency maps, deletion SLA matrices, pii_inventory.csv, and failover pre-flight rules that keep Lesson 34 read paths from violating policy. If filenames drifted while you added warehouse chapters instead, return to Lesson 21 syllabus reconciliation.
Related learning
- Lesson 33: Warehouse Query Budgets and Cost Guardrails
- Lesson 32: Read-Only Analytics Warehouse Contracts and Service Accounts
- Lesson 24: Degraded-Mode Playtest Script for AI Dialogue Reliability — same “label uncertainty instead of hiding it” habit for player-facing systems.
If this lesson prevented a midnight “promote the replica” panic, file the runbook beside your Lesson 21 launch control references so the next regional blip has a boring, rehearsed answer.