Pipeline: XTTS-Server → xtts-bridge → aria-bridge → RVS → App AudioTrack
XTTS-Bridge (Gaming-PC):
- streamXTTSAsPCM(): liest /tts_to_audio/ Response inkrementell,
parst WAV-Header (samplerate/channels), teilt PCM in 8KB-Chunks
(~170ms bei 24kHz s16 mono) und sendet jeden als audio_pcm.
- Finaler Chunk mit final=true nach letztem Text-Chunk
aria-bridge:
- audio_pcm Handler leitet payload 1:1 weiter, filled messageId aus
requestId → messageId Map falls XTTS-Bridge messageId nicht hatte
- Alter xtts_response Pfad bleibt als Legacy-Fallback (WAV)
RVS: audio_pcm in ALLOWED_TYPES
Android Native:
- PcmStreamPlayerModule (Kotlin): AudioTrack MODE_STREAM mit
Writer-Thread und BlockingQueue. start(rate, ch) / writeChunk(b64)
/ end() / stop()
- 8x MinBufferSize grosszuegig dimensioniert, glatt auch bei
Netz-Aussetzern
- Registered im MainApplication via PcmStreamPlayerPackage
App JS:
- audioService.handlePcmChunk(): erkennt neue Session (messageId-Wechsel),
started nativen Stream, cached PCM-Bytes pro Message. Bei final=true
Stream sauber schliessen + _savePcmBufferAsWav → WAV-File im
tts_cache/<messageId>.wav
- _savePcmBufferAsWav: baut 44-byte WAV-Header (PCM s16le, korrekte
samplerate/channels), haengt alle gesammelten base64-PCM-Chunks an
- stopPlayback beendet auch aktiven PCM-Stream
- ChatScreen routet type=audio_pcm an handlePcmChunk, bei final
setzt audioPath in der Message
Play-Button: falls messageId einen audioPath hat → WAV aus Cache
(Sound-basiert), egal ob Original-TTS Piper oder XTTS war.
Audio-Focus:
- requestDuck() beim Stream-Start, release() bei Stream-Ende
- Andere Apps (Spotify etc.) werden leiser waehrend ARIA spricht
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Back to streaming: render chunk → send immediately → next chunk
- App plays with preloading queue (no waiting for all chunks)
- Comma instead of dot between sentences in chunk (no "Punkt" read aloud)
- Sentence-ending dots already removed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- All chunks rendered and PCM concatenated (consistent voice)
- Split into ~8 second WAV parts (not per-sentence)
- 8s is long enough for preload overlap, small enough for WebSocket
- Parts include part/totalParts metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- All chunks rendered sequentially, PCM data concatenated
- Single WAV with proper header sent back (no queue needed in app)
- If total > 800KB, split into parts (WebSocket limit)
- Eliminates stuttering between sentences
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 2-3 sentences per chunk (more context = stable voice/volume)
- Max 250 chars per chunk (keeps WebSocket packets manageable)
- Dots re-added between sentences within a chunk (natural pauses)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Preload next audio while current plays (eliminates gap between sentences)
- Remove trailing dots from sentences (XTTS reads them aloud)
- stopPlayback cleans up preloaded audio
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- XTTS-Bridge does sentence splitting (not ARIA-Bridge)
- Sequential rendering: correct order guaranteed
- Each sentence sent as separate xtts_response
- Markdown removal before splitting
- App starts playback after first sentence (faster UX)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- XTTS API runs on port 8020 (not 8000)
- Bridge waits up to 5min for model download (30x10s)
- Health check uses / instead of /docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>