Latency, accuracy, and model size compared across all 7 supported ASR models. Every benchmark runs 100% on your device.
Transcription time and word error rate on a 10-second English speech sample.
| Model | Download Size | RAM Usage | Latency (10s clip) | WER (English) | Languages |
|---|---|---|---|---|---|
| Moonshine Tiny | 27 MB | ~80 MB | 85 ms | English | |
| Moonshine Base | 65 MB | ~150 MB | 150 ms | English | |
| Whisper Tiny | 120 MB | ~200 MB | 320 ms | 90+ | |
| Whisper Base | 200 MB | ~350 MB | 580 ms | 90+ | |
| Whisper Small | 600 MB | ~900 MB | 1.4 s | 90+ | |
| Whisper Medium | 1.5 GB | ~2.2 GB | 3.8 s | 90+ | |
| Whisper Large v3 Turbo | 1.6 GB | ~2.5 GB | 3.2 s | 90+ | |
| Parakeet TDT 0.6B | 150 MB | ~400 MB | Real-time* | English |
* Parakeet streams partial results as you speak โ effective latency is near-zero for partial transcripts, with final output arriving within ~200ms of speech end.
Pick the right model for how you work.
| Use Case | Recommended Model | Why |
|---|---|---|
| Quick voice commands | Moonshine Tiny | Fastest response, minimal resource use. Perfect for short phrases. |
| Daily English coding | Moonshine Base | Best balance of speed and accuracy for English. Default for a reason. |
| Live captions / streaming | Parakeet TDT 0.6B | Real-time partial transcripts as you speak. Feels like live captions. |
| Multilingual projects | Whisper Small | Good accuracy across 90+ languages without heavy resource use. |
| Long dictation (English) | Moonshine Base | Fast segment processing keeps dictation mode responsive. |
| Maximum accuracy | Whisper Large v3 Turbo | Lowest word error rate. Worth the wait if accuracy is critical. |
| Low-spec machines | Moonshine Tiny | 27MB download, ~80MB RAM. Runs anywhere. |
Benchmarks were measured on a mid-range development machine (Apple M2, 16GB RAM) using VoxPilot's built-in ONNX Runtime inference. Results may vary based on your hardware.