Lesson 34: Cross-Region Read Replica Lag and Failover Read Paths for the RPG Metrics Warehouse

Lesson 33 capped spend. This lesson answers the next crisis that only shows up when someone opens the same dashboard from another continent: the numbers look fine locally but wrong or stale abroad, or they vanish entirely during a regional outage.

You will not implement vendor-specific replication knobs here. You will add observable lag, reader routing, and a failover read contract that still respects the read-only principals from Lesson 32.

Dream House illustration for cross-region warehouse read paths lesson

What you will build

By the end of this lesson, you will have:

  1. A replication_freshness_seconds exposure on hot-path views or snapshot tables so every chart states how old the read is
  2. A regional reader routing table that maps teams to endpoints without giving analysts the writer
  3. A one-page failover runbook for switching read traffic when a region degrades, including rollback and budget re-checks from Lesson 33
  4. A stale-read policy tied to Lesson 28 go/yellow/red language so stale dashboards cannot silently promote a ship decision

Step 1 - Name the three clocks

Keep three timestamps beside every executive-facing row:

Clock Meaning Typical owner
event_time when the build or incident happened in production game services
warehouse_ingest_time when facts landed in the warehouse ingest from Lesson 31
replica_read_time when this replica last applied upstream changes infra

Replication lag is roughly replica_read_time - warehouse_ingest_time, surfaced as replication_freshness_seconds in views consumed by BI tools.

If you only show event_time, executives will argue about ghosts instead of replication.

Step 2 - Publish freshness beside the rolling window

Extend the rolling train views from Lesson 30 with:

  • max_replication_freshness_seconds for the query scope
  • a traffic-light footer rule: green under your SLA (for example ninety seconds on hot path), yellow between SLA and ten minutes, red above ten minutes

Pro tip: during release week, tighten the SLA for vw_release_train_current_window only, not for every historical drill-down, so you do not pay Lesson 33 cold-path costs for hot-path paranoia.

Step 3 - Route readers without routing writers

Create analytics_reader_routing.md with three columns:

Team region Primary read endpoint Failover read endpoint
US West analysts reader-usw-prod reader-use-prod
EU Central analysts reader-euc-prod reader-euw-prod
Automation jobs scheduled reader pool same-region secondary

Rules:

  • writers stay on the single promoted primary documented in Lesson 32
  • humans and dashboards use readers only
  • cross-region reads for convenience are allowed only when labeled with Step 2 freshness

Step 4 - Failover read path that preserves contracts

When a region browns out, your runbook should do four things in order:

  1. Confirm writer health — if the writer is fine, this is a read outage, not a game outage. Say that out loud in the war room.
  2. Flip BI connection strings per the routing table, not ad hoc copies of credentials.
  3. Re-run Lesson 33 budget probes on the failover region for twenty-four hours — cross-region egress can spike when everyone pivots.
  4. Post an incident note with start time, endpoints touched, and expected maximum staleness until replication catches up.

Common mistake: promoting a read replica to writer “because dashboards are broken.” That is a different disaster. Keep failover on the read plane unless game telemetry itself is blocked.

Step 5 - Drill lag with one synthetic marker

Add a lightweight ping_fact row written every five minutes by a trusted job:

  • includes ingest_batch_id style lineage from Lesson 31
  • read from each regional dashboard and compare now() - ping_time

If the marker disappears in one region only, you have a routing or partition problem, not a mysterious “AI RPG metrics feel off” vibe.

Step 6 - Align exec language with confidence lessons

Cross-link Lesson 26 analytics confidence so owners know stale reads downgrade confidence scores the same way missing telemetry does.

If replication lag exceeds red thresholds during a Lesson 28 briefing, the packet should move to yellow even when game servers are healthy.

Mini challenge

  1. Add replication_freshness_seconds to one materialized snapshot from Lesson 33.
  2. Fill one row in analytics_reader_routing.md for your largest analyst team.
  3. Run the ping drill in two regions and capture screenshots with timestamps.

FAQ

Do we need multi-region writes for game telemetry?

Usually no for this course track. Game events should still land in one primary region; replicas serve read scale and resilience for analytics, not split-brain writes.

How is this different from caching dashboards?

Caches hide lineage. Freshness columns and ping facts keep honest uncertainty in the same frame as the metric.

What if our warehouse vendor hides replica lag?

Approximate with snapshot table versions and ingestion watermark tables from Lesson 31, then label dashboards “at least as fresh as watermark X.”

Should players ever see replication lag?

No. This lesson is for internal live-ops and leadership surfaces, not HUD elements.

Lesson recap

You can now:

  • expose replication lag without drowning dashboards in infra jargon
  • route regional analysts to readers while keeping writers locked down
  • execute a read-only failover path that triggers Lesson 33 cost review
  • stop stale charts from smuggling false confidence into ship decisions

Next lesson teaser

Continue with Lesson 35: Warehouse Data Residency, Deletion SLAs, and PII Minimization for Cross-Border RPG Live-Ops for residency maps, deletion SLA matrices, pii_inventory.csv, and failover pre-flight rules that keep Lesson 34 read paths from violating policy. If filenames drifted while you added warehouse chapters instead, return to Lesson 21 syllabus reconciliation.

Related learning

If this lesson prevented a midnight “promote the replica” panic, file the runbook beside your Lesson 21 launch control references so the next regional blip has a boring, rehearsed answer.