I cut my Whisper bill 75% by stripping silence first
Construction audio is mostly truck noise and silence. Cut it out before you transcribe and your bill collapses.
The discovery
I was running Whisper-large-v3 over Struvo field audio. Average clip was 8 minutes. Average API cost: about $0.04 per clip. Volume: ~200 clips/week. Math: $32/week, ~$1,700/year.
Then I noticed the transcripts were 90% timestamped silence. Construction guys don't talk continuously — they walk between rooms, drive between sites, hand-shake clients. The audio is mostly silence and truck noise.
So I added a silence-strip pre-pass.
What changed
Canonical script: ~/.claude/scripts/audio/transcribe.sh. Before any audio hits Whisper, it goes through sox with a silence filter and a low-pass denoiser:
sox input.m4a output.wav silence -l 1 0.1 0.5% -1 0.5 0.5%
That collapses silences over 0.5 seconds, keeps a 0.5-second pad on each side. The output drops to about 25% of the original duration.
The dollar math
Per clip:
- Before: 8 min audio, $0.04 per clip
- After: 2 min audio, $0.01 per clip
Bill cut: 75%. Annual saving: about $1,275.
But that wasn't even the biggest win.
The hallucination kill
Whisper hallucinates on long silences. If you feed it 6 minutes of truck idling, it'll occasionally invent dialogue. Real example from my logs:
"Yeah I think we should probably go with the larger unit, like the 4-ton one would handle this whole zone."
That sentence was never spoken. It was Whisper's autocomplete on engine noise.
Silence-strip kills this. With the silence cut, there's no long-noise window for the model to hallucinate over. Hallucination rate dropped from about 1 in 30 clips to roughly 1 in 200.
What you can steal
If you're transcribing real-world audio:
- Add
soxsilence detection before any API call. Two minutes of work. - Tune the threshold against a sample. 0.5% is right for construction audio. Conference calls might want 1%.
- Track API spend before and after. Make the savings legible — that's how the change earns its slot in the pipeline.
The pattern that matters: don't pay to transcribe silence. The model can't add value where there's no signal.