Local Whisper CUDA Silent CPU Fallback on Windows 11 Playtest VOD Batch - How to Fix

Problem: Your overnight playtest VOD batch uses local Whisper. ffmpeg extracts audio fine. python -c "import torch; print(torch.cuda.is_available())" prints True, but a 2-hour clip still takes all night—Task Manager shows Python on CPU, not your NVIDIA GPU.

Who is affected now: Indies adopting the May 2026 local Whisper playtest triage pipeline on Windows 11 laptops. The dominant failure is silent CPU fallback: CUDA looks available, but Whisper never binds the GPU.

Fastest safe fix: Run device probes in the same venv as whisper, reinstall the matching torch CUDA wheel, pass device="cuda" explicitly (or use faster-whisper with device="cuda"), log device_used on the first segment in playtest_vod_triage_receipt_v1.json, and fall back to base on CPU with an explicit receipt flag—not an accidental overnight stall.

Direct answer

torch.cuda.is_available() only proves a CUDA-capable wheel is installed—not that Whisper loaded weights on the GPU. Wrong venv, CPU-only torch, hybrid graphics routing, empty CUDA_VISIBLE_DEVICES, or a copied device="cpu" snippet all produce 10× slower batches with no obvious error line. Fix the PyTorch + CUDA pairing first; then force device logging so receipts prove which path ran.

Why this issue spikes in 2026

Overnight playtest VOD triage replaced “watch forty Discord clips” workflows.
Teams install CPU torch by default (pip install torch) then wonder why CUDA is “broken.”
Laptop hybrid graphics send python.exe to the iGPU unless the NVIDIA control panel prefers discrete.
Blog snippets omit --device cuda on CLI; OpenAI Whisper Python API defaults can surprise on mixed installs.

Pair with 15 Free Local Whisper and ffmpeg Playtest VOD Triage Tools, OBS Replay Buffer zero-duration audio when transcripts are empty but video plays, FsCheck Editor hang help when the same machine also runs save fuzz overnight, and Ollama first-token hang for local LLM summarize passes on transcripts.

Symptoms and search phrases

torch.cuda.is_available() → True, GPU usage 0% during whisper run.
First 60 s of audio takes minutes to transcribe.
Worked on a teammate’s desktop; fails on Windows 11 laptop.
nvidia-smi shows no Python process while batch runs.
Upgraded GPU driver; batch got slower, not faster.
Multiple Python installs—where python ≠ venv used in batch script.

Root causes (check in order)

CPU-only PyTorch in the active venv (+cpu wheel).
CUDA toolkit / driver mismatch with installed torch CUDA build.
Wrong venv — system Python runs Whisper, venv has CUDA torch (or reverse).
CUDA_VISIBLE_DEVICES empty or set to invalid index.
Hybrid graphics — Windows routes Python to integrated GPU.
Explicit device="cpu" or missing device in copied script.
Whisper subprocess uses different interpreter than your probe command.

Fastest safe fix path

Step 1 — Prove the venv and GPU name (same shell as batch)

cd C:\path\to\your\playtest-vod-project
.\.venv\Scripts\Activate.ps1
python -c "import torch; print('cuda_available', torch.cuda.is_available()); print('device', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'none')"
python -c "import torch; print('torch_version', torch.__version__); print('cuda', torch.version.cuda)"
where python

Pass: GPU name prints (e.g. NVIDIA GeForce RTX 4060 Laptop GPU).
Fail: cuda_available True but device errors → driver/CUDA runtime broken.
Fail: cpu in torch.__version__ → reinstall CUDA wheel (Step 2).

Step 2 — Reinstall matching PyTorch CUDA wheel

From pytorch.org pick Windows + Pip + CUDA matching your driver (example CUDA 12.x):

pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Re-run Step 1. Pin versions in requirements.txt:

torch==2.5.1+cu124
openai-whisper==20240930

Step 3 — Force Whisper device + log first segment

OpenAI Whisper Python API:

import whisper
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print("whisper_device", device)  # must appear in batch log
model = whisper.load_model("small", device=device)
result = model.transcribe("audio/clip.wav", fp16=(device == "cuda"))

CLI smoke (60 s clip):

whisper audio/smoke.wav --model small --device cuda --output_dir transcripts/

Watch Task Manager → GPU during the run—CUDA or 3D utilization should spike.

Step 4 — faster-whisper lane (often clearer device binding)

pip install faster-whisper

from faster_whisper import WhisperModel
model = WhisperModel("small", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio/clip.wav")
print("device_used", "cuda", "lang", info.language)
for s in segments:
    print(s.start, s.text)

See faster-whisper on GitHub.

Step 5 — Windows hybrid graphics

Settings → System → Display → Graphics
Add python.exe from your venv (\.venv\Scripts\python.exe).
Set High performance (NVIDIA discrete).
Reboot once after driver updates.

Step 6 — Honest CPU fallback + receipt

If GPU still unavailable after Step 2:

Switch model small → base for overnight queue.
Set receipt flag—do not pretend CUDA ran.

{
  "schema": "playtest_vod_triage_receipt_v1",
  "batch_date": "2026-05-24",
  "whisper_model": "base",
  "device_used": "cpu",
  "device_fallback_reason": "cuda_wheel_reinstall_failed",
  "clips_processed": 5,
  "gates": { "T3_transcript": true, "T6_receipt": true }
}

Pass: First 60 s clip on GPU finishes in under 2 minutes with visible GPU use; receipt shows device_used": "cuda".

Verification checklist

[ ] Step 1 GPU name prints in batch venv.
[ ] whisper_device cuda (or faster-whisper cuda) in log file.
[ ] Task Manager shows GPU activity during first segment.
[ ] playtest_vod_triage_receipt_v1.json includes device_used.
[ ] Full nightly batch completes inside your window (estimate: small ~0.3–0.5× realtime on mid laptop GPU).
[ ] where whisper / where python point to same venv in scheduled task.

Prevention

One venv per machine for playtest triage—document activate command in playtest-vod/README.md.
Pin torch + CUDA index URL in requirements.txt; no unpinned pip install torch on QA laptops.
CI smoke: 30 s clip on release runner; fail build if device_used != cuda on GPU agents.
Log device_used on segment 1, not only at batch end.
Prefer faster-whisper when you need explicit compute_type and CTranslate2 speed.
Do not run Whisper batch on the same machine as 1M FsCheck Editor tests—compete for RAM.

Troubleshooting

Symptom	Fix
`CUDA out of memory`	Use `small` not `large`; close browser; `compute_type=int8_float16` in faster-whisper
`cudnn` errors after driver update	Reinstall torch wheel matching new driver; reboot
GPU spikes then 0%	Normal between segments; watch sustained load on long files
`True` but `get_device_name` fails	Driver install corrupt—DDU + clean NVIDIA install
Batch uses wrong Python	Scheduled Task must call `\.venv\Scripts\python.exe` full path
AMD GPU laptop	CUDA path N/A—use CPU `base` or ROCm experimental; log `device_used: cpu`

FAQ

Is cloud Whisper easier?
Upload caps and 413 errors favor local batching—if legal approves cloud, chunk first per Whisper API 413 fix. Keep local as default when playtest NDAs restrict uploads.

Does WSL2 share the GPU?
Sometimes—native Windows venv is simpler for fest teams. If using WSL, install CUDA inside WSL, not only on Windows host.

Apple Silicon?
Different lane (MPS/Metal). This help is Windows 11 + NVIDIA CUDA.

Can I summarize transcripts with Ollama?
Yes—text only, HTTP API; never block on CLI stdout per Ollama help above.