Text to Speech
Often it is helpful to synthesize voices locally to make LLM responses more accessible. In this post, I experiment with two open-source tools for local text-to-speech (TTS):
- TTS.cpp, a fast C++ inference engine which supports the Kokoro models providing CLI and server interfaces
- Chatterbox TTS API, a Python-based HTTP server that provides a REST API for TTS and voice cloning.
Both can use Kokoro models, which are impressively small and capable.
tts.cpp
Building TTS from the repository:
git clone git@github.com:mmwillet/TTS.cpp.git
cd TTS.cpp
## GGML patch
git clone -b support-for-tts git@github.com:mmwillet/ggml.git
## Build
cmake -B build
cmake --build build --config Release
Download the Kokoro GGUF models from Hugging Face.
You can now use the models with a range of English voices:
# Listen to all the american/english voices
text="He grinned at me, happy to go along with the routine, as long as me and my wallet continued
to pay attention. And I listened"
for voice in af_heart af_alloy af_aoede af_bella af_jessica af_kore af_nicole af_nova af_river af_sarah af_sky am_adam am_echo am_eric am_fenrir am_liam am_michael am_onyx am_puck am_santa bf_alice bf_emma bf_isabella bf_lily bm_daniel bm_fable bm_george bm_lewis; echo $voice && ./build/bin/tts-cli --model-path ./models/Kokoro_no_espeak_Q4.gguf --prompt $text --play -v $voice
Hosting a TTS service:
./build/bin/tts-server --model-path ./models/Kokoro_no_espeak_Q4.gguf -v am_michael &
curl http://127.0.0.1:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "He grinned at me, happy to go along with the routine, as long as me and my wallet continued to pay attention. And I listened",
"temperature": 0.8,
"top_k": 20,
"repetition_penalty": 1.1,
"response_format": "wav"
}' \
| ffplay -f wav -i pipe:0 -autoexit -nodisp
Chatterbox TTS API
FastAPI-powered REST API for Chatterbox TTS, providing OpenAI-compatible text-to-speech endpoints with voice cloning capabilities.
Install and start server
git clone https://github.com/travisvn/chatterbox-tts-api
cd chatterbox-tts-api
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
uvicorn app.main:app --host 0.0.0.0 --port 4123
Using the service
curl -X POST http://localhost:4123/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Dramatic speech!", "exaggeration": 1.2, "cfg_weight": 0.3, "temperature": 0.9}' \
| ffplay -f wav -i pipe:0 -autoexit -nodisp