Whisper is an open-source (MIT license) transcription library and model released by OpenAI that transcribes audio. Since voice input is an integral part of many applications, I wanted to explore the different options for running Whisper locally.

For all of these explorations, I will run a rough performance test using a sample audio file from Audible. The sample is from the first book of the Bobiverse series, We are Legion, by Dennis E. Taylor.

curl --output bobiverse-sample.mp3 \
'https://samples.audible.com/bk/adbl/027284/bk_adbl_027284_sample.mp3'
ffmpeg -i bobiverse-sample.mp3 -ar 16000 -ac 1 -c:a pcm_s16le bobiverse-sample.wav

This benchmark is not strictly scientific. For each method, I ran the test multiple times until the speed stabilized and then recorded a single representative result. All tests use the base.en model. There may be slight variations in the model weights due to conversions between formats, but they are functionally the same. Finally, I will compare the transcription outputs to identify any meaningful differences.

Whisper offers several model sizes, from tiny.en to large, each providing a different balance between speed and accuracy. For this comparison, I chose base.en as a reasonable middle ground, but for production use cases, you might select a different model based on your specific needs.

OpenAI Whisper

openai/whisper

Setup

mkdir -p openai-whisper && cd openai-whisper
uv venv --python 3.11.13
source .venv/bin/activate
uv pip install openai-whisper

Transcription performance

time whisper --model base.en --model_dir . ../bobiverse-sample.wav  --output_format json > /dev/null 2>&1
55.34s user 18.71s system 197% cpu 37.422 total

cat bobiverse-sample.json | jq ".segments.[].text"

Whisper.cpp

ggerganov/whisper.cpp

Whisper.cpp is optimized for Apple Silicon which makes it an especially compelling way to run locally,

Setup

Building whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

uv venv --python 3.11
uv pip install ane_transformers openai-whisper coremltools
source .venv/bin/activate

cmake -B build -DWHISPER_COREML=1 -DWHISPER_SDL2=ON
cmake --build build -j --config Release

Building the model with Core ML support

sh ./models/download-ggml-model.sh base.en
./models/generate-coreml-model.sh base.en

Note: As of this writing, coremltools does not support Python versions later than 3.11.

Exploring the whisper.cpp CLI tools

# capture a wav using ffmpeg and then transcribe
ffmpeg -f avfoundation -i ":0" -t 10 -ar 16000 -ac 1 -c:a pcm_s16le capture.wav
./build/bin/whisper-cli -m models/ggml-base.en.bin -f capture.wav --print-colors

# stream audio for a microphone and transcribe
./build/bin/whisper-stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000

Transcription performance

time ./build/bin/whisper-cli -m models/ggml-base.en.bin -f ../bobiverse-sample.wav -oj > /dev/null 2>&1
2.55s user 0.25s system 57% cpu 4.899 total

cat bobiverse-sample.json | jq ".transcription.[].text"

Faster Whisper

SYSTRAN/faster-whisper

faster-whisper is a reimplementation of Whisper that uses CTranslate2, a fast inference engine for Transformer models. It’s known for its speed and efficiency, especially on NVIDIA GPUs. While it doesn’t have the same level of optimization for Apple Silicon’s Neural Engine, it’s still a popular choice for many users.

Setup

mkdir -p faster-whisper && cd faster-whisper
uv venv --python 3.11.13
source .venv/bin/activate
uv pip install faster-whisper

Performance Test

transcribe-bobiverse.py

from faster_whisper import WhisperModel

# Run on CPU with INT8
model = WhisperModel("base.en", device="cpu", compute_type="int8")

segments, info = model.transcribe("../bobiverse-sample.wav", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

I used the script above to measure performance. Note that on Apple Silicon, this implementation of Faster Whisper does not utilize the GPU or the Neural Engine, so the test was performed on the CPU.

time python transcribe-bobiverse.py > bobiverse-sample.txt
57.81s user 8.97s system 399% cpu 16.714 total

Given that the performance on the CPU was not competitive with the other methods, I did not perform a detailed comparison of the transcription output.

Overall Comparisons

Comparing OpenAI and Whisper.cpp

diff --ignore-space-change \
<( cd whisper.cpp; cat bobiverse-sample.json | jq -r ".transcription.[].text") \
<(cd openai-whisper; cat bobiverse-sample.json | jq -r ".segments.[].text")

The differences were minor, relating to sentence segmentation and punctuation.

Punctuation Example

<  Well, what the hell?
---
>  Well, what the hell.

Word Difference Example

<  That appeared there was money in cryonics.
---
>  It appeared there was money in cryonics.

Conclusions

For local transcription, whisper.cpp is my preferred option. It delivers exceptional performance on Apple Silicon with no noticeable degradation in quality. While integrating the C++ implementation into my applications may present more challenges than a pure Python library, the performance benefits are compelling enough to warrant the effort.