2026-04-19

I cut my Whisper bill 75% by stripping silence first

Construction audio is mostly truck noise and silence. Cut it out before you transcribe and your bill collapses.

#whisper#audio#struvo#cost-optimization

The discovery

I was running Whisper-large-v3 over Struvo field audio. Average clip was 8 minutes. Average API cost: about $0.04 per clip. Volume: ~200 clips/week. Math: $32/week, ~$1,700/year.

Then I noticed the transcripts were 90% timestamped silence. Construction guys don't talk continuously — they walk between rooms, drive between sites, hand-shake clients. The audio is mostly silence and truck noise.

So I added a silence-strip pre-pass.

What changed

Canonical script: ~/.claude/scripts/audio/transcribe.sh. Before any audio hits Whisper, it goes through sox with a silence filter and a low-pass denoiser:

sox input.m4a output.wav silence -l 1 0.1 0.5% -1 0.5 0.5%

That collapses silences over 0.5 seconds, keeps a 0.5-second pad on each side. The output drops to about 25% of the original duration.

The dollar math

Per clip:

Before: 8 min audio, $0.04 per clip
After: 2 min audio, $0.01 per clip

Bill cut: 75%. Annual saving: about $1,275.

But that wasn't even the biggest win.

The hallucination kill

Whisper hallucinates on long silences. If you feed it 6 minutes of truck idling, it'll occasionally invent dialogue. Real example from my logs:

"Yeah I think we should probably go with the larger unit, like the 4-ton one would handle this whole zone."

That sentence was never spoken. It was Whisper's autocomplete on engine noise.

Silence-strip kills this. With the silence cut, there's no long-noise window for the model to hallucinate over. Hallucination rate dropped from about 1 in 30 clips to roughly 1 in 200.

What you can steal

If you're transcribing real-world audio:

Add sox silence detection before any API call. Two minutes of work.
Tune the threshold against a sample. 0.5% is right for construction audio. Conference calls might want 1%.
Track API spend before and after. Make the savings legible — that's how the change earns its slot in the pipeline.

The pattern that matters: don't pay to transcribe silence. The model can't add value where there's no signal.