Lesson 76: Waiver Renewal Decision Replay Checklist for Post-Promote Telemetry Slice Divergence in RPG Live-Ops

Lesson 75 locked promotion packets into audit exports. Exports preserve what reviewers believed; they do not prove live behavior still matches that belief an hour later.

This lesson adds a decision replay checklist: a bounded, repeatable comparison between packet fields (scorecard, playbook, corrective acceptance) and post-promote telemetry slices so teams catch silent divergence while mitigation is still cheap.

Fast Food Collection illustration representing many parallel signals that need a consistent review tray

What this lesson solves

You need:

  1. A single checklist document operators run after promote, not an ad-hoc dashboard tour
  2. Explicit slice definitions (time window, cohort, metric set) so two replays are comparable
  3. A divergence log that links back to promotion_packet_id and, when applicable, export_batch_id from Lesson 75

Prerequisites: Lessons 74 (promotion packet row) and 75 (audit export log).
Expected time: 90-110 minutes including one dry-run replay on a past promote.

For governance vocabulary that overlaps release gates, keep wording aligned with 18 Free Release Gate Evidence Packet Templates for Indie Teams (2026 Q4) so replay findings do not invent a second taxonomy.

What you will build

  1. A waiver_renewal_decision_replay_checklist_policy.md contract
  2. A waiver_renewal_decision_replay_log.csv append-only schema
  3. One reference telemetry slice profile (JSON or table) your team reuses across promotes

Step 1 - Define replay policy

Create waiver_renewal_decision_replay_checklist_policy.md and specify:

  • When replay is mandatory (for example: every promote, every hotfix that touches waiver lanes, or within N hours of deploy for regulated retention classes)
  • Maximum latency between deploy complete and replay complete
  • Who may sign a replay pass versus who must only observe
  • Escalation rule when any checklist row is divergent or unknown (default: hold new waiver relaxations until resolved)

Step 2 - Map packet fields to measurable signals

For each promotion packet column your team trusts, define one telemetry binding:

packet field family example live signals
closure scorecard lane error rate, p95 latency, saturation index for that lane
playbook row completion feature flag state, config version, job success ratio tied to that mitigation
corrective acceptance test gate status, canary cohort health, debt burn metric
executive exception cap counters, exposure meters, budget telemetry tied to the exception

If a field has no honest signal, mark it non_replayable in policy and require a human attestation row instead of pretending dashboards cover it.

Step 3 - Author waiver_renewal_decision_replay_log.csv

Append one row per replay execution. Suggested columns:

column purpose
replay_row_id monotonic id
promotion_packet_id Lesson 74 reference
export_batch_id Lesson 75 pointer when export exists
deploy_marker build id, git sha, or release tag
replay_slice_id named profile (for example post_promote_t0_plus_2h_core_cohort)
slice_window_start_utc inclusive
slice_window_end_utc exclusive
replay_started_at_utc
replay_completed_at_utc
replay_operator_ack who ran it
packet_to_telemetry_mapping_version version of your binding table
overall_replay_verdict aligned, divergent, inconclusive
divergence_summary short text when not aligned
followup_ticket_id empty when aligned
replay_signoff_lane owner lane for the verdict

Treat the log as append-only; corrections add a new row referencing correction_of_replay_row_id if your tooling supports it.

Step 4 - Build one reusable slice profile

Document replay_slice_id profiles so operators do not improvise windows under pressure. Each profile should list:

  • cohort keys (region, platform, account tier, or percentage canary)
  • metric list with query anchors or dashboard deep links
  • expected stability assumptions (weekday vs weekend, event blackout)

Pro tip: Keep the first profile narrow. A two-hour window on your core paying cohort beats a twenty-four-hour global aggregate that hides regressions behind volume.

Step 5 - Run a tabletop dry-run

Pick a historical promote with known outcome. Replay using only artifacts you would still have (packet row, export pointer, telemetry snapshots). Note every gap. Update policy and bindings before you rely on this for a live gate.

Common mistakes

  • Mistake: Replay becomes a generic health review. Fix: bind each step to a specific packet field; if it does not map, mark non-replayable.
  • Mistake: Slices drift between replays. Fix: freeze replay_slice_id versions and bump packet_to_telemetry_mapping_version when queries change.
  • Mistake: Green replay while export is missing. Fix: Lesson 75 export completeness is a prerequisite row in your policy for regulated classes.

Mini challenge

  1. Take one live promotion_packet_id.
  2. List five packet fields and the exact telemetry query or dashboard tile that proves each.
  3. Identify one field that is only provable by human attestation and write the attestation wording.

FAQ

Is this redundant with canary analysis?

Canary analysis proves rollout safety for the binary. Replay proves decision documentation still matches observed live posture for waiver-specific claims.

How soon after promote should replay run?

Policy decision. Many teams run a first pass within two hours for fast feedback, then a second daily pass for slower-moving debt metrics.

What if telemetry is temporarily incomplete?

Log inconclusive with reason, open a follow-up ticket, and treat that as yellow for new waiver relaxations until signals recover.

Lesson recap

You now have a decision replay checklist pattern that compares waiver promotion packets to bounded telemetry slices, logs divergence early, and stays linked to export and packet identifiers for audit continuity.

Next lesson teaser

Continue to Lesson 77: Waiver Renewal Replay Divergence Triage Queue with SLA, Severity Rubric, and Re-Promote Gates in RPG Live-Ops, which routes divergent and inconclusive replay rows into lane owners with SLA, severity, and explicit re-promote gates.

Related learning