Often it is helpful to synthesize voices locally to make LLM responses more accessible. In this post, I experiment with two open-source tools for local text-to-speech (TTS):

  1. TTS.cpp, a fast C++ inference engine which supports the Kokoro models providing CLI and server interfaces
  2. Chatterbox TTS API, a Python-based HTTP server that provides a REST API for TTS and voice cloning.

Both can use Kokoro models, which are impressively small and capable.

tts.cpp

Building TTS from the repository:

git clone git@github.com:mmwillet/TTS.cpp.git
cd TTS.cpp

## GGML patch
git clone -b support-for-tts git@github.com:mmwillet/ggml.git

## Build
cmake -B build
cmake --build build --config Release

Download the Kokoro GGUF models from Hugging Face.

You can now use the models with a range of English voices:

# Listen to all the american/english voices
text="He grinned at me, happy to go along with the routine, as long as me and my wallet continued
 to pay attention. And I listened"
 
for voice in af_heart af_alloy af_aoede af_bella af_jessica af_kore af_nicole af_nova af_river af_sarah af_sky am_adam am_echo am_eric am_fenrir am_liam am_michael am_onyx am_puck am_santa bf_alice bf_emma bf_isabella bf_lily bm_daniel bm_fable bm_george bm_lewis; echo $voice && ./build/bin/tts-cli --model-path ./models/Kokoro_no_espeak_Q4.gguf --prompt $text --play -v $voice

Hosting a TTS service:

./build/bin/tts-server --model-path ./models/Kokoro_no_espeak_Q4.gguf -v am_michael &

    
curl http://127.0.0.1:8080/v1/audio/speech  \
  -H "Content-Type: application/json" \
  -d '{
    "input": "He grinned at me, happy to go along with the routine, as long as me and my wallet continued to pay attention. And I listened",
    "temperature": 0.8,
    "top_k": 20,
    "repetition_penalty": 1.1,
    "response_format": "wav"
  }' \
  | ffplay -f wav -i pipe:0 -autoexit -nodisp

Chatterbox TTS API

FastAPI-powered REST API for Chatterbox TTS, providing OpenAI-compatible text-to-speech endpoints with voice cloning capabilities.

Install and start server

git clone https://github.com/travisvn/chatterbox-tts-api
cd chatterbox-tts-api

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

uvicorn app.main:app --host 0.0.0.0 --port 4123

Using the service

curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Dramatic speech!", "exaggeration": 1.2, "cfg_weight": 0.3, "temperature": 0.9}' \
| ffplay -f wav -i pipe:0 -autoexit -nodisp