ElevenLabs Conversational AI NPC with Local Ollama Fallback Architecture - 2026
Your trailer shows a voice-reactive NPC. In playtest, ElevenLabs returns 502 during the second line, the dialogue box shows cloud text, and the speaker goes silent. You added Ollama as “fallback” but Unity freezes on the first token because the integration still shells ollama run with blocked stdout.
That is not an AI feature—it is two half-wired paths pretending to be resilience.
Mid-2026 moved thousands of indie teams onto Conversational AI voice tiers after price drops, while the same quarter brought 502 spikes on shared edges and partner reviewers who disable network mid-dialogue. This AI Integration / Workflow post documents a dual-stack architecture: ElevenLabs for primary voice + cloud dialogue when healthy, Ollama HTTP for local text when cloud LLM or voice fails, Piper (or cached clips) when voice must stay local, and voice_fallback_receipt_v1.json so cert lanes see proof—not promises.
Non-repetition note: Help pages fix 502 retry and Ollama Process hang in isolation. The beginner LLM fallback net covers text-only paths. This URL owns ElevenLabs + Ollama combined architecture for voice-first NPCs—not Steam metadata, not Construct saves.
Why this matters now (May 2026)
- 502 evening peaks — ElevenLabs Unity SDK help documents the spike; architecture prevents silence after retries exhaust.
- Ollama adoption without HTTP discipline — Process redirect hang is the dominant Windows 11 failure when teams copy CLI tutorials.
- Partner cert airplane mode — Reviewers expect deterministic lines when cloud fails; voice can degrade but gameplay must continue.
- Steam AI disclosure — Store disclosure checklist needs honest “online voice + local text fallback” wording.
- Cost predictability — Conversational AI billed per minute; Ollama is “free at runtime” but not free in engineering—architecture separates budgets.
Direct answer: Implement a Dialogue Director state machine with three lanes: Primary (ElevenLabs voice + cloud LLM if used), Text fallback (Ollama /api/chat streaming), Voice degrade (Piper or pre-baked WAV). Log lane switches in voice_fallback_receipt_v1.json during QA week.
Who this is for
- Unity 6 or Godot 4.5 teams shipping voice-forward fest demos
- Engineers who fixed 502 retries but still have silent NPCs on spike
- Producers preparing Q3 partner packets beside AI disclosure sprints
- Beginners who completed the hard fallback net tutorial and add ElevenLabs voice layer
Time to wire MVP: ~2–3 evenings after a working dialogue tree exists.
Architecture overview
Player selects dialogue option
│
v
+------------------+
| Dialogue Director |
+--------+---------+
│
health + budget check
│
+----+----+----+
v v v |
Primary Text Voice
(EL+cloud) (Ollama) (Piper/cache)
│ │ │
+----+----+
v
Same UI + subtitle pipeline
v
Log lane + latency in receipt
Lane definitions
| Lane | When | Output |
|---|---|---|
| Primary | ElevenLabs healthy, circuit closed, token budget OK | Streamed voice + optional cloud LLM line |
| Text fallback | Voice 502 or LLM 429/timeout | Ollama HTTP line + subtitle; Piper or silent VO |
| Voice degrade | Text OK but voice unhealthy | Cached WAV / Piper on Ollama text |
| Hard canned | Ollama down | Authored tree only—never silent |
Rule: Hard canned is mandatory; Ollama is optional enhancement; ElevenLabs is never the only path.
Beginner path (first evening)
- Keep authored dialogue tree as source of truth for story beats.
- Add
DialogueDirector.cs(or Godot autoload) with enumLane { Primary, TextFallback, VoiceDegrade, Canned }. - Health-check
http://127.0.0.1:11434/api/tagsbefore first NPC prompt. - Wrap ElevenLabs calls per 502 help—three retries then open circuit.
- On circuit open → call Ollama HTTP → Piper on resulting text → log lane in receipt JSON.
Success check: Airplane mode mid-conversation produces spoken or subtitled line within 3 seconds—no infinite spinner.
Developer path (production)
- Separate secrets — ElevenLabs API key in env/ScriptableObject; never commit.
- Token budget per scene—see OpenAI 429 help patterns for cloud text if paired.
- Warm voice cache — hash
(voice_id, text)→ WAV on disk for fest builds. - Circuit breaker — 3 consecutive 5xx in 30s → Primary disabled 60s.
- Evidence folder —
release-evidence/ai/voice_fallback_receipt_v1.jsonper milestone build.
ElevenLabs Conversational AI (primary lane)
Responsibilities
- Voice synthesis for NPC lines (Conversational AI agent or TTS endpoint per your SDK version).
- Optional agent session if you use their full conversational stack—document which subset you ship (voice-only vs full agent).
- Latency budget: target <2.5s to first audio byte on LAN; log p95 in receipt.
Retry and circuit (summary)
| Event | Action |
|---|---|
| 502/503/504 | Exponential backoff ×3 per help article |
| 401/403 | No retry—fix key, fall through |
| Circuit open | Skip Primary 60s; route Text fallback |
| Success after open | Half-open: one trial line |
Do not implement unbounded retry loops—fest traffic makes spikes worse for everyone.
Cache warming table
| Line type | Warm? |
|---|---|
| Tutorial barks | Yes—ship in build |
| Branching shop | Top 20 lines |
| Procedural LLM | No—expect fallback lane |
Store under StreamingAssets/voice_cache/{hash}.wav with manifest voice_cache_manifest.json listing text_hash, voice_id, semver.
Ollama HTTP (text fallback lane)
Why HTTP only
The Ollama Process hang fix is non-negotiable on Windows 11 Unity paths. Production uses:
POST http://127.0.0.1:11434/api/chat
{ "model": "llama3.2:3b", "stream": true, "messages": [...] }
Parse NDJSON chunks; append to UI subtitle buffer; trigger Piper when line complete.
Prompt contract (keep NPC safe)
| Field | Purpose |
|---|---|
system |
Persona + content policy + max sentences |
context |
Quest state from authored variables only |
player_choice |
Selected option text |
max_tokens |
Hard cap (e.g. 120) |
Never pass raw player chat into Ollama on fest demos—selected options only.
Health check before scene load
curl -s http://127.0.0.1:11434/api/tags
Unity: UnityWebRequest.Get("http://127.0.0.1:11434/api/tags") with 2s timeout. Failure → skip Ollama lane, go Hard canned immediately.
Model choice (2026 realistic)
| Hardware | Model hint |
|---|---|
| Dev laptop 16GB | llama3.2:3b |
| Player PC mid | Optional smaller quant in settings |
| Steam Deck | Canned only—do not assume Ollama |
Document in settings: Local dialogue: Off / Auto / Required.
Piper voice degrade (voice when ElevenLabs down)
When Ollama returns text but ElevenLabs circuit is open:
- Feed text to Piper offline TTS (or play gender-matched cached narrator WAV).
- Show subtitles always—accessibility + cert reviewers read text.
- Log
lane=VoiceDegradein receipt.
Honest limit: Piper does not clone ElevenLabs celebrity voices—disclose in AI storefront disclosure bullet.
Dialogue Director state machine (pseudo)
OnPlayerChoose(optionId):
if CircuitBreaker.Open: goto FallbackText
if TokenBudget.Exhausted: goto FallbackText
try Primary = ElevenLabs.SynthesizeAsync(line)
on success: PlayAudio(Primary); return
on retryable5xx after retries: CircuitBreaker.RecordFailure()
FallbackText:
if OllamaHealthy:
text = Ollama.StreamChat(prompt)
if VoiceDegradeEnabled: PlayPiper(text)
else: ShowSubtitleOnly(text)
else:
PlayCanned(optionId)
Godot: same logic in autoload with await and signals—avoid blocking main thread.
voice_fallback_receipt_v1.json
Store under release-evidence/ai/:
{
"receipt_type": "voice_fallback_receipt_v1",
"build_id": "fest-demo-20260522",
"tests": {
"primary_lane_smoke": "pass",
"airplane_mode_mid_dialogue": "pass",
"ollama_stopped_mid_dialogue": "pass",
"elevenlabs_502_simulated": "pass"
},
"fallback_pass": true,
"lanes_observed": ["Primary", "TextFallback", "VoiceDegrade", "Canned"],
"p95_primary_latency_ms": 2100,
"observed_date_utc": "2026-05-22"
}
QA scenarios (mandatory)
| # | Setup | Pass |
|---|---|---|
| 1 | Normal network | Primary plays |
| 2 | Airplane mode before line | Canned or Ollama+Piper <3s |
| 3 | Kill Ollama mid-scene | Canned |
| 4 | Mock 502 (proxy or test hook) | Degrade after retries |
| 5 | Token budget = 0 | No cloud spend; fallback |
Record in release-evidence/ai/qa-voice-fallback.md with build_id.
Unity vs Godot wiring notes
| Concern | Unity 6 | Godot 4.5 |
|---|---|---|
| HTTP | UnityWebRequest + coroutine | HTTPRequest node |
| Audio | AudioSource clip queue | AudioStreamPlayer |
| Threading | Main thread parse NDJSON | Signals to main |
| Secrets | ScriptableObject + env | .env not in export |
Both engines: one Dialogue Director—do not duplicate logic per NPC.
Cost and budget table (indicative)
| Path | Cost driver | Control |
|---|---|---|
| ElevenLabs | Characters / minutes | Scene cap + cache |
| Cloud LLM (if any) | Tokens | Per-session budget |
| Ollama | Dev time | Optional player setting |
| Piper | Disk | Ship voice model in build |
Publish max minutes per playtest hour in producer sheet—prevents surprise invoice week before fest.
Disclosure and store copy alignment
Store page must match runtime:
- Online: ElevenLabs voice generation (and cloud LLM if used).
- Offline: Local text model (Ollama) + offline TTS (Piper) + canned lines.
Mismatch triggers refund dashboard store-copy tags when trailer implies always-neural voice.
Link Steam Play AI disclosure checklist before upload.
Pairing with Sentis and other AI stacks
Teams mixing Unity Sentis for classification and ElevenLabs for voice should keep receipts separate:
sentis_deploy_receipt_v1.jsonfor ONNXvoice_fallback_receipt_v1.jsonfor NPC lanes
Do not conflate “AI feature” paragraphs in partner README—list each subsystem.
Resource roundup cross-links
- 15 free LLM dialogue fallback resources
- 50 free AI tools refresh — ElevenLabs Conversational AI row
- Live-ops prompt registry sprint for prompt semver beside voice
Common mistakes
- Ollama via Process on Windows—freeze.
- ElevenLabs only—silent fest demos on 502.
- LLM writes quest state—narrative breaks; context from authored vars only.
- No canned path—cert fail airplane mode.
- Unbounded retries—worse spikes + bill shock.
- Same voice ID across Piper and ElevenLabs without disclosure.
- Blocking main thread on HTTP—stutter even when fallback works.
- No receipt—partners cannot verify.
- Player free-text chat on demo—moderation risk.
- Skipping cache warm for tutorial lines—every line hits 502 surface.
Proof table
| Claim | Evidence | Pass |
|---|---|---|
| 502 handled | Simulated test log | Degrade <3s |
| Ollama HTTP | curl + in-game | NDJSON streams |
| Airplane mode | QA scenario 2 | Canned plays |
| Disclosure | Store bullet | Matches lanes |
| Cost cap | Budget sheet | Scene limits set |
Key takeaways
- 2026 voice NPCs need three lanes—Primary, Ollama text, Piper/cache degrade.
- Never shell
ollama runfrom Unity on Windows—use HTTP per help fix. - Circuit breaker after ElevenLabs 502—protect players and provider.
- Hard canned tree is cert-non-negotiable.
voice_fallback_receipt_v1.jsondocuments QA for partners.- Pairs with existing help fixes and beginner fallback net—this is combined architecture.
- AI Integration category after Construct trilogy—diversifies blog mix.
- 8 backlog pitches remain.
- Align Steam AI disclosure with actual lanes.
- Subtitles always-on for accessibility and degrade mode.
FAQ
Do I need both ElevenLabs and Ollama?
Primary voice wants ElevenLabs (or similar). Ollama covers text when cloud fails; you can ship canned-only without Ollama but not cloud-only without canned.
Can Ollama drive voice directly?
Not production-quality alone—pair with Piper or subtitles.
Godot on Steam Deck?
Assume Canned default; optional Ollama off.
Does Conversational AI agent replace dialogue tree?
No—tree owns beats; AI fills delivery within bounds.
How does this relate to Inworld/Convai?
Same director pattern—vendor SDKs still need canned + degrade lanes.
502 fix article enough?
Help fixes retries; this post wires full lane graph.
Anthropic/Gemini instead of Ollama?
Architecture identical—swap HTTP client; keep canned.
Latency SLA for fest?
Target p95 <3s to first feedback (audio or subtitle).
Legal on voice cloning?
Use licensed voice IDs; disclose in store AI section.
Conclusion
Voice-first NPCs are a system design problem in 2026—not a plugin purchase. Wire ElevenLabs behind a circuit breaker, Ollama behind HTTP, Piper behind degrade, and canned lines behind everything else. Log the receipt before you mark the demo fest-ready.
Next reads: 502 retry help, Ollama HTTP help, and beginner fallback net.
Ninety-minute architecture sketch
| Minute | Task |
|---|---|
| 0–15 | Draw Dialogue Director diagram on paper |
| 15–30 | Add canned + circuit enum stubs |
| 30–50 | Ollama health GET + one streamed line |
| 50–70 | ElevenLabs wrap with 3 retries |
| 70–90 | Write receipt JSON + airplane test |
Sketch night does not replace five QA scenarios—it starts the graph.
SEO and discovery note
Targets elevenlabs conversational ai unity npc and ollama fallback architecture 2026—distinct from help fix URLs and beginner tutorial intent.
Evidence folder layout
release-evidence/ai/
voice_fallback_receipt_v1.json
qa-voice-fallback.md
voice_cache_manifest.json
prompt_registry_semver.txt
Aligns with release evidence taxonomy and prompt registry sprint when live-ops edits ship mid-fest.
NDJSON streaming parse (Unity sketch)
Ollama returns one JSON object per line when "stream": true. Do not wait for full body on main thread.
IEnumerator StreamOllamaLine(string prompt, Action<string> onLineDone)
{
var body = JsonUtility.ToJson(new ChatRequest { model = "llama3.2:3b", stream = true, messages = BuildMessages(prompt) });
using var req = new UnityWebRequest("http://127.0.0.1:11434/api/chat", "POST");
req.uploadHandler = new UploadHandlerRaw(Encoding.UTF8.GetBytes(body));
req.downloadHandler = new DownloadHandlerBuffer();
req.SetRequestHeader("Content-Type", "application/json");
yield return req.SendWebRequest();
if (req.result != UnityWebRequest.Result.Success) { onLineDone(null); yield break; }
foreach (var line in req.downloadHandler.text.Split('\n'))
{
if (string.IsNullOrWhiteSpace(line)) continue;
var chunk = JsonUtility.FromJson<ChatChunk>(line);
if (chunk.done) onLineDone(chunk.message.content);
}
}
Godot 4.5: use HTTPRequest with request_completed, split buffer by newline, emit dialogue_line_ready signal. Same contract—no Process.
Latency measurement (receipt input)
Log per line in qa-voice-fallback.md:
| Metric | How |
|---|---|
t_request |
Player confirms option |
t_first_token |
First NDJSON chunk or first audio byte |
t_playback |
AudioSource.Play |
Compute p95 over 20 scripted lines in Primary lane and 10 in Text fallback. If Text fallback p95 > 4s on target min-spec PC, shrink model or shorten max_tokens.
Moderation and content bounds (2026 policy reality)
| Risk | Mitigation |
|---|---|
| Model invents quest | Context from authored flags only |
| Unsafe text | Blocklist filter on output string |
| Player harassment | No free-text to cloud on demo |
| Voice deepfake confusion | Store disclosure + subtitle |
Run human sign-off on prompt registry per 14-day sprint before changing system prompts mid-fest.
Fest-week operations calendar
| Day | Action |
|---|---|
| Mon | Warm voice cache for tutorial NPC |
| Wed | Check ElevenLabs status page; lower scene token cap if spikes |
| Fri | Re-run airplane mode QA; update receipt |
| Upload eve | Freeze prompt_registry semver |
Pair with BUILD_RECEIPT notes: voice_fallback_pass=true.
Troubleshooting matrix
| Symptom | Check | Fix |
|---|---|---|
| Silent after 502 | Circuit stuck open | Half-open trial |
| Editor freeze | Process Ollama | HTTP only |
| Subtitle only | Piper path disabled | Enable VoiceDegrade |
| Wrong persona | Prompt registry drift | Freeze semver |
| High bill | Cache miss | Warm lines |
| Deck stutter | Ollama on Deck | Default Canned |
| Duplicate lines | Retry without idempotency | Hash request id |
Vendor-neutral comparison (architecture level)
| Capability | ElevenLabs Conversational AI | Ollama local | Canned tree |
|---|---|---|---|
| Voice quality | High | N/A (use Piper) | Fixed WAV |
| Latency | Network | GPU/CPU bound | Instant |
| Cost | Per use | Dev time | Authored time |
| Cert offline | Degrade | Yes with HTTP | Yes |
| Narrative control | Medium | Medium | Full |
Architecture stays valid if you swap ElevenLabs for another TTS—keep lane graph.
Worked scenario — 502 during shopkeeper haggle
- Player selects “Haggle” option
opt_haggle_3. - Director checks circuit—closed. Calls ElevenLabs.
- Provider returns 502 after retries. Circuit records failure #1.
- Director routes Text fallback—Ollama returns shorter line with price hint from
quest.haggle_count. - Voice degrade—Piper speaks line; subtitle shows same text.
- Receipt logs
lane=TextFallback+VoiceDegrade,latency_ms=2400. - Player hears voice—not silence. Refund risk avoided.
If Ollama also down, step 4 becomes Canned shopkeeper_haggle_fail_02.wav—still no silence.
Integration with playtest CSV
Add columns to 18 playtest tools sheet:
voice_lane(Primary / TextFallback / VoiceDegrade / Canned)elevenlabs_502_Y/Nollama_up_Y/N
Playtesters reproduce architecture bugs faster than “voice felt weird.”
Corporate laptop note
Ollama help mentions AppLocker. Architecture default: detect blocked daemon → Canned without hanging. Log ollama_blocked=true in receipt for partner laptops.
Stretch goals
- Ship in-game AI status icon (cloud / local / offline).
- Add Anthropic 529 queue help as alternate cloud text behind same Director.
- Automate receipt generation in validate-packet script grep.
Found this useful? Bookmark the two help fixes and implement the Director once—502 and Ollama hang stops being two separate firefights.
Security checklist (ship gate)
- [ ] API keys in CI secrets—not repo
- [ ] Ollama bound to
127.0.0.1only in shipping build - [ ] No player PII in prompts
- [ ] Log redaction for dialogue text in public builds
- [ ] Rate limit per NPC to prevent spam-click cost explosion
- [ ] Disable debug “force cloud” cheat in release binaries
Dual-cloud text optional layer
Some teams use ElevenLabs for voice and OpenAI/Anthropic for line generation on Primary lane. Architecture unchanged:
- Cloud text failure (429) → open text circuit → Ollama
- Cloud text success + voice 502 → Ollama not required if cached WAV exists for that line hash
Document which cloud APIs you actually call in disclosure—three vendors is still one Director.
Audio mixer routing
Route Primary and Piper into same NPC bus with shared reverb send so degrade mode does not sound like a different character from another room. Subtitle styling stays identical across lanes—players perceive one NPC with a bad connection, not two different systems.
Build flavors table
| Flavor | Primary | Ollama | Canned |
|---|---|---|---|
DEVELOPMENT |
On | On | On |
FEST_DEMO |
On | Auto | On |
STEAM_DECK |
On | Off | On |
REVIEWER_OFFLINE |
Off | Off | On only |
Use scripting define symbols—do not #if inside Dialogue Director without tests per flavor.
Partner README one-liner
Voice: ElevenLabs online with Piper/canned fallback. Dialogue text: cloud when healthy, Ollama HTTP locally, canned tree always. Evidence:
release-evidence/ai/voice_fallback_receipt_v1.json.
Paste into partner ZIP README beside hash manifests from cold-hash challenge.
Idempotency and duplicate line prevention
When retries fire after a timeout, the player must not hear two lines for one choice. Hash (npc_id, option_id, attempt_id) and discard duplicate audio starts within 500 ms. Log duplicate_suppressed=true in QA when testing flaky Wi-Fi—proves Director is production-safe, not demo-only.
Accessibility requirements (2026 baseline)
- Subtitles for every lane including Primary.
- Subtitle speed independent of audio length when Piper runs faster than ElevenLabs.
- Visual indicator when voice is synthetic vs recorded—icons help disclosure honesty.
- Reduce motion setting should not disable fallback subtitles.
Fest reviewers increasingly check accessibility alongside airplane mode—architecture treats both as first-class QA scenarios.
When to skip ElevenLabs entirely
| Project signal | Recommendation |
|---|---|
| No VO budget | Canned + subtitles only |
| Web-only tiny scope | Text + Piper; skip cloud voice |
| Strict offline SKU | No Primary lane in that SKU |
| Narration-heavy AA | Recorded WAV primary; AI for barks only |
Architecture doc still applies—lanes shrink but Director remains. Ship the graph once; swap vendors later without rewriting NPC scenes. That is the difference between a voice feature and a shippable NPC system.