Performance Benchmarks

Latency & Accuracy

Transcription time and word error rate on a 10-second English speech sample.

Model	Download Size	RAM Usage	Latency (10s clip)	WER (English)	Languages
Moonshine Tiny	27 MB	~80 MB	85 ms	8.5%	English
Moonshine Base	65 MB	~150 MB	150 ms	6.1%	English
Whisper Tiny	120 MB	~200 MB	320 ms	12.8%	90+
Whisper Base	200 MB	~350 MB	580 ms	9.4%	90+
Whisper Small	600 MB	~900 MB	1.4 s	6.5%	90+
Whisper Medium	1.5 GB	~2.2 GB	3.8 s	5.0%	90+
Whisper Large v3 Turbo	1.6 GB	~2.5 GB	3.2 s	4.2%	90+
Parakeet TDT 0.6B	150 MB	~400 MB	Real-time*	5.8%	English

* Parakeet streams partial results as you speak — effective latency is near-zero for partial transcripts, with final output arriving within ~200ms of speech end.

Recommended Models by Use Case

Pick the right model for how you work.

Use Case	Recommended Model	Why
Quick voice commands	Moonshine Tiny	Fastest response, minimal resource use. Perfect for short phrases.
Daily English coding	Moonshine Base	Best balance of speed and accuracy for English. Default for a reason.
Live captions / streaming	Parakeet TDT 0.6B	Real-time partial transcripts as you speak. Feels like live captions.
Multilingual projects	Whisper Small	Good accuracy across 90+ languages without heavy resource use.
Long dictation (English)	Moonshine Base	Fast segment processing keeps dictation mode responsive.
Maximum accuracy	Whisper Large v3 Turbo	Lowest word error rate. Worth the wait if accuracy is critical.
Low-spec machines	Moonshine Tiny	27MB download, ~80MB RAM. Runs anywhere.

📐 Methodology

Benchmarks were measured on a mid-range development machine (Apple M2, 16GB RAM) using VoxPilot's built-in ONNX Runtime inference. Results may vary based on your hardware.

Latency: wall-clock time from audio buffer submission to transcript return, averaged over 20 runs on a 10-second speech clip
WER (Word Error Rate): measured on LibriSpeech test-clean subset — lower is better
RAM usage: peak resident memory during inference, measured via process monitoring
Download size: total size of ONNX model files fetched from Hugging Face
All models run via ONNX Runtime (CPU) — no GPU required

💡 Tips

First inference after model load is slower (cold start). Subsequent transcriptions are faster.
Whisper models get significantly faster with GPU acceleration if onnxruntime-gpu is available.
Parakeet's streaming mode means you see words appear as you speak — the "latency" you feel is near-instant.
For most developers coding in English, Moonshine Base is the sweet spot. Start there.
Use the Model Manager panel (sidebar) to download and switch models without restarting.