Voice — STT & TTS
DjinnBot supports bidirectional voice: incoming voice messages are transcribed to text (STT), and agent responses can be spoken back as audio (TTS). This works across all messaging channels — Telegram, Discord, WhatsApp, Signal, Slack, and the dashboard.
Architecture Overview
graph LR
Voice["Voice message"] --> STT["STT (faster-whisper)"]
STT --> Agent["Agent"]
Agent --> TTS["TTS (Fish Audio / Voicebox)"]
TTS --> Audio["Audio reply"]
style Voice fill:#3b82f6,color:#fff
style STT fill:#8b5cf6,color:#fff
style Agent fill:#f59e0b,color:#fff
style TTS fill:#8b5cf6,color:#fff
style Audio fill:#10b981,color:#fff
Voice messages received on any channel are transcribed server-side at upload time. The transcript is stored as extracted_text on the attachment record, so agents read it as plain text — no special handling needed. If TTS is enabled for the agent, the response is synthesized back to audio and delivered as a native voice message on the channel.
Speech-to-Text (STT)
Engine: faster-whisper
STT uses faster-whisper, a CTranslate2-based reimplementation of OpenAI’s Whisper. It runs approximately 4x faster than the original Python Whisper on CPU.
Model: base by default (~150 MB). Override with the WHISPER_MODEL_SIZE environment variable.
Performance: ~3-5 seconds per 30-second clip on CPU with INT8 quantization.
Features:
- Automatic language detection
- VAD (Voice Activity Detection) filtering to skip silence
- Lazy model loading — the model is downloaded on first use and cached
Model Caching
The whisper model is persisted on JuiceFS at /jfs/cache/whisper-models so it survives container restarts without re-downloading. If JuiceFS is not mounted, it falls back to the default HuggingFace cache directory (ephemeral — re-downloads on restart).
Override the cache directory with the WHISPER_MODEL_DIR environment variable.
How It Works
- A voice message arrives on any channel (Telegram, Discord, WhatsApp, Signal, Slack)
- The audio is uploaded and stored as a
ChatAttachment - A background task converts the audio to 16kHz mono WAV via ffmpeg
- faster-whisper transcribes the WAV with beam search (beam_size=5) and VAD filtering
- The transcript is saved as
extracted_texton the attachment - The agent receives the transcript as normal text input
No agent-side configuration is needed — STT happens automatically for all audio attachments.
Text-to-Speech (TTS)
TTS converts agent text responses into audio. Two providers are supported:
| Provider | Type | Model | Cost | Latency |
|---|---|---|---|---|
| Fish Audio | Cloud API | S1 | $15 / 1M UTF-8 bytes | Low (cloud) |
| Voicebox | Local (self-hosted) | Qwen3-TTS | Free | Depends on GPU |
Fish Audio
Fish Audio is the default cloud TTS provider. It uses the S1 model and supports voice cloning via reference IDs.
Setup:
- Get an API key from fish.audio
- Add it in the dashboard under Settings > TTS Providers > Fish Audio
- Or set it as an instance-level key via the admin panel
Voice selection: Fish Audio has a library of pre-built voices. Voices are browsable in the dashboard when configuring per-agent TTS settings. You can also use custom voice clones by providing a reference ID.
API key resolution follows the same priority chain as LLM keys:
- User’s own key (from per-user TTS provider settings)
- Admin-shared key (granted by an admin to specific users or all users)
- Instance-level key (global fallback)
Cost calculation: $15.00 per 1M UTF-8 bytes of input text. Costs are logged per-call in the tts_call_logs table and visible in the dashboard.
Voicebox
Voicebox is a local TTS server that runs Qwen3-TTS. It generates speech on your own hardware — no API key needed, no per-call cost.
Setup:
- Install and run Voicebox (see the Voicebox repo)
- In the dashboard under Settings > TTS, set the default provider to Voicebox (Local)
- Configure the Voicebox URL (default:
http://localhost:8000) - Use the health check in the dashboard to verify connectivity
How Voicebox synthesis works:
- DjinnBot sends a
POST /generaterequest with the text and optional voice profile ID - Voicebox returns a generation ID
- DjinnBot fetches the audio via
GET /audio/{generation_id} - The WAV output is converted to the channel’s target format via ffmpeg
Voice profiles: Voicebox supports voice profiles. Select a profile in the per-agent TTS settings or let it use the default voice.
Channel Audio Formats
Each messaging platform has its own preferred audio format. DjinnBot handles conversion automatically via ffmpeg:
| Channel | Format | Notes |
|---|---|---|
| Telegram | OGG/Opus | Required for native voice messages |
| OGG/Opus | Required for push-to-talk voice notes | |
| Discord | OGG/Opus | Standard voice message format |
| Slack | MP3 | Native audio attachment |
| Signal | MP3 | Handles MP3 attachments |
| Dashboard | MP3 | Universal web playback |
For Fish Audio, audio is requested as MP3 and converted to OGG/Opus where needed (container remux via ffmpeg — no transcoding). Voicebox outputs WAV, which is transcoded to the target format.
Per-Agent TTS Configuration
TTS is configured per-agent in the database (not YAML), so settings persist across container restarts. Each agent has:
| Setting | Description |
|---|---|
tts_enabled | Whether this agent generates voice replies |
tts_provider | fish-audio or voicebox (overrides the global default) |
tts_voice_id | Voice/profile ID for this agent |
tts_voice_name | Display name of the selected voice |
Configure these in the dashboard under Agents > [Agent Name] > TTS Settings.
When TTS Triggers
TTS is not generated for every response. It triggers when all of these conditions are met:
- The incoming message was a voice message (voice-in, voice-out)
- The agent has
tts_enabled: truein its DB settings - TTS is globally enabled in admin settings
- The response text is within the character threshold (default: 1000 characters)
This means agents only speak back when spoken to — text messages always get text replies.
Global TTS Settings
Admins control TTS behavior from the dashboard (Settings > TTS):
| Setting | Default | Description |
|---|---|---|
ttsEnabled | true | Global kill switch for all TTS |
defaultTtsProvider | fish-audio | Default provider when no agent/user override exists |
ttsCharacterThreshold | 1000 | Max response length that will be synthesized |
ttsMaxConcurrentRequests | 5 | Concurrency limit (semaphore) for TTS API calls |
voiceboxUrl | http://localhost:8000 | Base URL of the Voicebox instance |
Provider Resolution
When generating TTS, the provider is resolved in priority order:
- Agent DB settings — per-agent
tts_provideroverride - User preference — per-user default stored in global settings
- Admin default — the
defaultTtsProviderglobal setting - Fallback —
fish-audio
Rate Limiting & Cost Tracking
- A global semaphore limits concurrent TTS requests (configurable via
ttsMaxConcurrentRequests) - Every TTS call is logged to the
tts_call_logstable with: provider, model, input size, output size, cost, latency, channel, and key source - Logs are published to Redis (
djinnbot:tts-calls:live) for real-time SSE streaming in the dashboard - Fish Audio costs are calculated at $15.00 / 1M UTF-8 bytes; Voicebox calls are always $0
API Endpoints
Key TTS-related endpoints under /v1:
| Method | Path | Description |
|---|---|---|
GET | /tts/voices | List Fish Audio voices (proxied) |
GET | /tts/voices/{voice_id} | Get Fish Audio voice details |
GET | /tts/voicebox/profiles | List Voicebox voice profiles |
GET | /tts/voicebox/health | Check Voicebox connectivity |
POST | /internal/tts/synthesize | Generate speech (internal, called by engine) |
GET | /tts-calls | Query TTS call logs |
GET | /agents/{agent_id}/tts-settings | Get agent TTS settings |
PUT | /agents/{agent_id}/tts-settings | Update agent TTS settings |
GET | /admin/tts-settings | Get global TTS settings |
PUT | /admin/tts-settings | Update global TTS settings |
Dependencies
- ffmpeg — required for audio format conversion (STT and TTS)
- faster-whisper — Python package for STT (
pip install faster-whisper) - fishaudio — Python SDK for Fish Audio TTS (
pip install fishaudio) - Voicebox — optional, self-hosted TTS server (separate install)