feat: Piper komplett entfernt — nur noch XTTS v2 als TTS
Breaking Change: wenn XTTS-Bridge (Gaming-PC) offline ist, bleibt ARIA stumm. Chat-Antworten kommen weiter an, aber kein Audio. Das ist bewusst akzeptiert — XTTS klingt einfach grauenhaft viel besser. Bridge (aria_bridge.py): - from piper import ... raus - VoiceEngine-Klasse komplett entfernt (synthesize, speak, select_voice) - EPIC_TRIGGERS + load_epic_triggers raus (Highlight-Voice-Feature ohne Piper sinnlos) - self.voice_engine, voice_name, requested_voice Aufrufe weg - _process_core_response: immer XTTS, kein Fallback - tts_request Handler: immer XTTS - config Handler: nur ttsEnabled + xttsVoice + whisperModel - import wave raus bridge/requirements.txt: piper-tts raus bridge/Dockerfile: Kommentar aktualisiert docker-compose.yml: ./aria-data/voices Mount raus aria-data/config/aria.env.example: PIPER_RAMONA/PIPER_THORSTEN raus get-voices.sh: komplett geloescht (war nur Piper-Downloader) Diagnostic UI (index.html): - Piper Panel (Standard-Stimme / Highlight-Stimme / Speed-Sliders) weg - TTS Engine Dropdown weg (immer XTTS) - TTS Diagnose Tab zeigt nur noch XTTS-Status + Test-Button - sendVoiceConfig sendet nur noch ttsEnabled/xttsVoice/whisperModel - toggleXTTSPanel als no-op Legacy-Stub (JS-Calls bleiben safe) Diagnostic Server (server.js): - handleSendVoiceConfig: nur noch ttsEnabled + xttsVoice + whisperModel - handleTestTTS: via xtts_request (nicht mehr Piper subprocess) - handleCheckTTS: via xtts_list_voices ueber RVS - handleGetVoiceConfig/Defaults bereinigt - Highlight-Trigger UI bleibt, wird aber von Bridge nicht mehr ausgewertet (dead-code im UI, spaeter ggf. fuer XTTS-Voice-Switch) README + issue.md aktualisiert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
6ab6196739
commit
f801d99748
76
README.md
76
README.md
|
|
@ -57,8 +57,8 @@ ARIA hat zwei Rollen:
|
||||||
│ │ Liest BOOTSTRAP.md + AGENT.md │ │
|
│ │ Liest BOOTSTRAP.md + AGENT.md │ │
|
||||||
│ │ │ │
|
│ │ │ │
|
||||||
│ │ [bridge] ARIA Voice Bridge Container │ │
|
│ │ [bridge] ARIA Voice Bridge Container │ │
|
||||||
│ │ Whisper STT · Piper TTS · Wake-Word │ │
|
│ │ Whisper STT · Wake-Word │ │
|
||||||
│ │ Ramona (weiblich) + Thorsten (tief) │ │
|
│ │ TTS remote via XTTS v2 auf Gaming-PC │ │
|
||||||
│ │ Bruecke: App <> RVS <> Bridge <> ARIA │ │
|
│ │ Bruecke: App <> RVS <> Bridge <> ARIA │ │
|
||||||
│ │ │ │
|
│ │ │ │
|
||||||
│ │ [diagnostic] Selbstcheck-UI + Einstellungen │ │
|
│ │ [diagnostic] Selbstcheck-UI + Einstellungen │ │
|
||||||
|
|
@ -143,21 +143,16 @@ claude login
|
||||||
**Wichtig:** Der Ordner `~/.claude/` (nicht `~/.config/claude/`!) wird als Volume
|
**Wichtig:** Der Ordner `~/.claude/` (nicht `~/.config/claude/`!) wird als Volume
|
||||||
in den Proxy gemountet. Die Credentials ueberleben Container-Restarts.
|
in den Proxy gemountet. Die Credentials ueberleben Container-Restarts.
|
||||||
|
|
||||||
### 3. Stimmen herunterladen
|
### 3. Voice Bridge konfigurieren
|
||||||
|
|
||||||
```bash
|
|
||||||
./get-voices.sh
|
|
||||||
# Laedt Ramona + Thorsten (Piper TTS) nach aria-data/voices/
|
|
||||||
# Ca. 100MB, dauert ein paar Minuten
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Voice Bridge konfigurieren
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cp aria-data/config/aria.env.example aria-data/config/aria.env
|
cp aria-data/config/aria.env.example aria-data/config/aria.env
|
||||||
# Bei Bedarf anpassen (Whisper-Modell, Sprache, Stimmen-Pfade)
|
# Bei Bedarf anpassen (Whisper-Modell, Sprache, Wake-Word)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
TTS laeuft ausschliesslich ueber XTTS v2 auf dem Gaming-PC — siehe Abschnitt
|
||||||
|
"XTTS v2 — High-Quality TTS" weiter unten.
|
||||||
|
|
||||||
### 5. RVS-Token generieren & Container starten
|
### 5. RVS-Token generieren & Container starten
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -253,7 +248,6 @@ Danach werden per `sed` vier Patches angewendet:
|
||||||
- Sicherheitsregeln (kein ClawHub, Prompt Injection abwehren)
|
- Sicherheitsregeln (kein ClawHub, Prompt Injection abwehren)
|
||||||
- Tool-Freigaben (alle Claude Code Tools: WebFetch, Bash, etc.)
|
- Tool-Freigaben (alle Claude Code Tools: WebFetch, Bash, etc.)
|
||||||
- SSH-Zugriff auf aria-wohnung (VM)
|
- SSH-Zugriff auf aria-wohnung (VM)
|
||||||
- Stimmen-Auswahl (Ramona vs Thorsten)
|
|
||||||
- Gedaechtnis-System
|
- Gedaechtnis-System
|
||||||
|
|
||||||
### openclaw.json (via aria-setup.sh)
|
### openclaw.json (via aria-setup.sh)
|
||||||
|
|
@ -299,15 +293,14 @@ Audio: App → RVS → Bridge → FFmpeg → Whisper STT → chat.send → aria
|
||||||
Datei: App → RVS → Bridge → /shared/uploads/ → chat.send (mit Pfad) → aria-core
|
Datei: App → RVS → Bridge → /shared/uploads/ → chat.send (mit Pfad) → aria-core
|
||||||
|
|
||||||
aria-core → Antwort → Gateway → Diagnostic → RVS → App
|
aria-core → Antwort → Gateway → Diagnostic → RVS → App
|
||||||
→ Bridge → Piper TTS → RVS → App (Audio)
|
→ Bridge → XTTS (PCM-Stream) → RVS → App AudioTrack
|
||||||
→ Bridge → Lautsprecher (lokal)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Features
|
### Features
|
||||||
|
|
||||||
- **STT**: faster-whisper (lokal, offline, 16kHz mono)
|
- **STT**: faster-whisper (lokal, offline, 16kHz mono)
|
||||||
- **TTS**: Piper (Ramona + Thorsten, offline) oder XTTS v2 (remote, GPU, Voice Cloning)
|
- **TTS**: XTTS v2 (remote auf Gaming-PC, GPU, Voice Cloning) — Streaming ueber PCM-Chunks
|
||||||
- **Markdown-Bereinigung**: Entfernt **fett**, *kursiv*, `code`, Links, Listen etc. vor TTS (natuerliche Sprache)
|
- **Text-Cleanup**: `<voice>...</voice>` Tag bevorzugt, Markdown/Code/Einheiten/URLs werden TTS-gerecht aufbereitet
|
||||||
- **Wake-Word**: openwakeword (lokales Mikrofon auf der VM)
|
- **Wake-Word**: openwakeword (lokales Mikrofon auf der VM)
|
||||||
- **App-Audio**: Base64 Audio von App → FFmpeg → Whisper STT → Text an aria-core
|
- **App-Audio**: Base64 Audio von App → FFmpeg → Whisper STT → Text an aria-core
|
||||||
- **Modi**: Normal, Nicht stoeren, Fluestern, Hangar, Gaming
|
- **Modi**: Normal, Nicht stoeren, Fluestern, Hangar, Gaming
|
||||||
|
|
@ -322,13 +315,6 @@ aria-core → Antwort → Gateway → Diagnostic → RVS → App
|
||||||
| Hangar | `"ARIA, ich arbeite"` | Nur wichtige Meldungen |
|
| Hangar | `"ARIA, ich arbeite"` | Nur wichtige Meldungen |
|
||||||
| Gaming | `"ARIA, Gaming-Modus"` | Nur auf direkte Fragen antworten |
|
| Gaming | `"ARIA, Gaming-Modus"` | Nur auf direkte Fragen antworten |
|
||||||
|
|
||||||
### Stimmen
|
|
||||||
|
|
||||||
| Stimme | Modell | Wann |
|
|
||||||
|--------|--------|------|
|
|
||||||
| **Ramona** (weiblich) | `de_DE-ramona-low` | Alltag, Antworten, Gespraeche |
|
|
||||||
| **Thorsten** (maennlich, tief) | `de_DE-thorsten-high` | Epische Momente, Alarme |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Diagnostic — Selbstcheck-UI und Einstellungen
|
## Diagnostic — Selbstcheck-UI und Einstellungen
|
||||||
|
|
@ -344,7 +330,7 @@ Erreichbar unter `http://<VM-IP>:3001`. Teilt das Netzwerk mit aria-core.
|
||||||
- **Session-Verwaltung**: Sessions auflisten, wechseln, erstellen, loeschen, als Markdown exportieren (⬇ Button)
|
- **Session-Verwaltung**: Sessions auflisten, wechseln, erstellen, loeschen, als Markdown exportieren (⬇ Button)
|
||||||
- **Chat-History**: Wird beim Laden und Session-Wechsel angezeigt (read-only aus JSONL)
|
- **Chat-History**: Wird beim Laden und Session-Wechsel angezeigt (read-only aus JSONL)
|
||||||
- **TTS-Diagnose Tab**: Stimmen testen, Status pruefen, Fehler anzeigen
|
- **TTS-Diagnose Tab**: Stimmen testen, Status pruefen, Fehler anzeigen
|
||||||
- **Einstellungen**: TTS-Engine (Piper/XTTS), Stimmen, Speed, Highlight-Trigger, Betriebsmodi, Whisper-Modell (tiny…large-v3, Hot-Reload)
|
- **Einstellungen**: TTS aktiv-Toggle, XTTS-Voice (gecloned), Betriebsmodi, Whisper-Modell (tiny…large-v3, Hot-Reload)
|
||||||
- **XTTS Voice Cloning**: Audio-Samples hochladen, eigene Stimme erstellen
|
- **XTTS Voice Cloning**: Audio-Samples hochladen, eigene Stimme erstellen
|
||||||
- **Claude Login**: Browser-Terminal zum Einloggen in den Proxy
|
- **Claude Login**: Browser-Terminal zum Einloggen in den Proxy
|
||||||
- **Core Terminal**: Shell in aria-core (openclaw CLI)
|
- **Core Terminal**: Shell in aria-core (openclaw CLI)
|
||||||
|
|
@ -373,13 +359,13 @@ API-Endpoint fuer andere Services: `GET http://localhost:3001/api/session`
|
||||||
- **Speech Gate**: Aufnahme wird verworfen wenn keine Sprache erkannt (kein Rauschen an Whisper)
|
- **Speech Gate**: Aufnahme wird verworfen wenn keine Sprache erkannt (kein Rauschen an Whisper)
|
||||||
- **STT (Speech-to-Text)**: Audio wird als 16kHz mono aufgenommen und in der Bridge per Whisper transkribiert, transkribierter Text erscheint im Chat
|
- **STT (Speech-to-Text)**: Audio wird als 16kHz mono aufgenommen und in der Bridge per Whisper transkribiert, transkribierter Text erscheint im Chat
|
||||||
- **"ARIA denkt..." Indicator**: Zeigt live den Status vom Core (Denken, Tool, Schreiben) + Abbrechen-Button
|
- **"ARIA denkt..." Indicator**: Zeigt live den Status vom Core (Denken, Tool, Schreiben) + Abbrechen-Button
|
||||||
- **TTS-Wiedergabe**: ARIA antwortet per Lautsprecher (Piper oder XTTS v2), Audio-Queue mit Preloading
|
- **TTS-Wiedergabe**: ARIA antwortet per Lautsprecher — XTTS v2 PCM-Streaming direkt in AudioTrack, keine Wait-Gaps
|
||||||
- **Play-Button**: Jede ARIA-Nachricht kann nochmal vorgelesen werden
|
- **Play-Button**: Jede ARIA-Nachricht kann nochmal vorgelesen werden
|
||||||
- **Chat-Suche**: Lupe in der Statusleiste filtert Nachrichten live
|
- **Chat-Suche**: Lupe in der Statusleiste filtert Nachrichten live
|
||||||
- **Mehrere Anhaenge**: Bilder + Dateien sammeln, Text hinzufuegen, dann zusammen senden
|
- **Mehrere Anhaenge**: Bilder + Dateien sammeln, Text hinzufuegen, dann zusammen senden
|
||||||
- **Paste-Support**: Bilder aus Zwischenablage einfuegen (Diagnostic)
|
- **Paste-Support**: Bilder aus Zwischenablage einfuegen (Diagnostic)
|
||||||
- **Anhaenge**: Bridge speichert in Shared Volume, ARIA kann darauf zugreifen, Re-Download ueber RVS
|
- **Anhaenge**: Bridge speichert in Shared Volume, ARIA kann darauf zugreifen, Re-Download ueber RVS
|
||||||
- **Einstellungen**: TTS Engine, Stimmen, Speed pro Stimme, Speicherort, Auto-Download, GPS
|
- **Einstellungen**: TTS aktiv, XTTS-Voice, Speicherort, Auto-Download, GPS
|
||||||
- **Auto-Update**: Prueft beim Start + per Button auf neue Version, Download + Installation ueber RVS (FileProvider)
|
- **Auto-Update**: Prueft beim Start + per Button auf neue Version, Download + Installation ueber RVS (FileProvider)
|
||||||
- GPS-Position (optional)
|
- GPS-Position (optional)
|
||||||
- QR-Code Scanner fuer Token-Pairing
|
- QR-Code Scanner fuer Token-Pairing
|
||||||
|
|
@ -429,7 +415,7 @@ RVS_UPDATE_HOST=root@aria-rvs # Optional: fuer Auto-Update
|
||||||
### Docker-Cleanup
|
### Docker-Cleanup
|
||||||
|
|
||||||
Das Bridge-Image zieht grosse ML-Deps (faster-whisper, ctranslate2, onnxruntime,
|
Das Bridge-Image zieht grosse ML-Deps (faster-whisper, ctranslate2, onnxruntime,
|
||||||
openwakeword, piper-tts) — bei jedem Rebuild waechst der Docker-Build-Cache. Wenn
|
openwakeword) — bei jedem Rebuild waechst der Docker-Build-Cache. Wenn
|
||||||
die VM voll laeuft:
|
die VM voll laeuft:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -453,8 +439,8 @@ Der Update-Flow:
|
||||||
App (Mikrofon) → AAC/MP4 Aufnahme → Base64 → RVS → Bridge
|
App (Mikrofon) → AAC/MP4 Aufnahme → Base64 → RVS → Bridge
|
||||||
Bridge: FFmpeg (16kHz PCM) → Whisper STT → Text → aria-core
|
Bridge: FFmpeg (16kHz PCM) → Whisper STT → Text → aria-core
|
||||||
Bridge: STT-Ergebnis → RVS → App (Placeholder wird durch transkribierten Text ersetzt)
|
Bridge: STT-Ergebnis → RVS → App (Placeholder wird durch transkribierten Text ersetzt)
|
||||||
aria-core → Antwort → Bridge → Piper TTS (WAV) → Base64 → RVS → App
|
aria-core → Antwort → Bridge → XTTS (Gaming-PC) → PCM-Stream → RVS → App
|
||||||
App: Base64 → WAV → Lautsprecher
|
App: AudioTrack MODE_STREAM (nahtlos), Cache als WAV pro Message
|
||||||
```
|
```
|
||||||
|
|
||||||
### Datei-Pipeline (Bilder & Anhaenge)
|
### Datei-Pipeline (Bilder & Anhaenge)
|
||||||
|
|
@ -502,10 +488,6 @@ aria-data/
|
||||||
│
|
│
|
||||||
├── skills/ ← ARIAs Faehigkeiten (selbst geschrieben!)
|
├── skills/ ← ARIAs Faehigkeiten (selbst geschrieben!)
|
||||||
│
|
│
|
||||||
├── voices/ ← Piper TTS Stimmen (offline)
|
|
||||||
│ ├── de_DE-ramona-low.onnx
|
|
||||||
│ └── de_DE-thorsten-high.onnx
|
|
||||||
│
|
|
||||||
├── config/
|
├── config/
|
||||||
│ ├── BOOTSTRAP.md ← System-Prompt (Identitaet, Regeln, Tools)
|
│ ├── BOOTSTRAP.md ← System-Prompt (Identitaet, Regeln, Tools)
|
||||||
│ ├── AGENT.md ← Persoenlichkeit & Arbeitsprinzipien
|
│ ├── AGENT.md ← Persoenlichkeit & Arbeitsprinzipien
|
||||||
|
|
@ -600,26 +582,26 @@ Das Model wird im Volume `xtts-models` gecacht und muss nur einmal geladen werde
|
||||||
|
|
||||||
### Features
|
### Features
|
||||||
|
|
||||||
- **Natuerliche Stimmen**: Deutlich bessere Qualitaet als Piper
|
- **Natuerliche Stimmen**: Deutlich bessere Qualitaet als TTS der alten Generation
|
||||||
- **Voice Cloning**: Eigene Stimme mit 6-10s Audio-Sample (~2s Latenz auf RTX 3060)
|
- **Voice Cloning**: Eigene Stimme mit 6-10s Audio-Sample (~2s Latenz auf RTX 3060)
|
||||||
|
- **Streaming**: PCM-Chunks alle ~170ms → App spielt ohne Warten nahtlos
|
||||||
- **16 Sprachen**: Deutsch, Englisch, Franzoesisch, etc.
|
- **16 Sprachen**: Deutsch, Englisch, Franzoesisch, etc.
|
||||||
- **Fallback**: Wenn XTTS nicht erreichbar, nutzt die Bridge automatisch Piper
|
|
||||||
|
|
||||||
### TTS-Engine umschalten
|
### TTS-Config
|
||||||
|
|
||||||
In der Diagnostic unter Einstellungen → Sprachausgabe:
|
In der Diagnostic unter Einstellungen → Sprachausgabe:
|
||||||
- **TTS aktiv**: Global An/Aus
|
- **TTS aktiv**: Global An/Aus
|
||||||
- **TTS Engine**: Piper (lokal, CPU, schnell) oder XTTS v2 (remote, GPU, natuerlich)
|
- **XTTS Stimme**: Default oder gecloned (Maia, etc.)
|
||||||
- **Piper**: Standard-Stimme, Highlight-Stimme, Speed pro Stimme
|
|
||||||
- **XTTS**: Stimmen-Auswahl, Voice Cloning
|
> XTTS ist die einzige Engine — wenn der Gaming-PC offline ist, bleibt ARIA stumm.
|
||||||
|
> Chat-Antworten kommen weiter an (nur kein Audio).
|
||||||
|
|
||||||
### Stimme klonen
|
### Stimme klonen
|
||||||
|
|
||||||
1. TTS Engine auf "XTTS v2" stellen
|
1. "Stimme klonen" → Audio-Dateien hochladen (WAV/MP3, 1-10 Dateien, min. 6-10s gesamt)
|
||||||
2. "Stimme klonen" → Audio-Dateien hochladen (WAV/MP3, 1-10 Dateien, min. 6-10s gesamt)
|
2. Name vergeben → "Stimme erstellen"
|
||||||
3. Name vergeben → "Stimme erstellen"
|
3. "Laden" klicken → neue Stimme in der Auswahl
|
||||||
4. "Laden" klicken → neue Stimme in der Auswahl
|
4. Stimme auswaehlen → Config wird automatisch gespeichert
|
||||||
5. Stimme auswaehlen → Config wird automatisch gespeichert
|
|
||||||
|
|
||||||
> **Tipp:** Fuer beste Ergebnisse: saubere Aufnahme, eine Stimme, kein Hintergrund,
|
> **Tipp:** Fuer beste Ergebnisse: saubere Aufnahme, eine Stimme, kein Hintergrund,
|
||||||
> 10-30 Sekunden Gesamtlaenge. Mehrere kurze Dateien werden zusammengefuegt.
|
> 10-30 Sekunden Gesamtlaenge. Mehrere kurze Dateien werden zusammengefuegt.
|
||||||
|
|
@ -718,7 +700,9 @@ docker exec aria-core ssh aria-wohnung hostname
|
||||||
- [x] SSH-Zugriff auf VM (aria-wohnung)
|
- [x] SSH-Zugriff auf VM (aria-wohnung)
|
||||||
- [x] Diagnostic Web-UI + Einstellungen
|
- [x] Diagnostic Web-UI + Einstellungen
|
||||||
- [x] Session-Verwaltung + Chat-History
|
- [x] Session-Verwaltung + Chat-History
|
||||||
- [x] Stimmen-Einstellungen (Ramona/Thorsten, Speed, Highlight-Trigger)
|
- [x] Stimmen-Einstellungen (Ramona/Thorsten, Speed, Highlight-Trigger) — durch XTTS v2 Voice Cloning ersetzt
|
||||||
|
- [x] Piper komplett entfernt — nur noch XTTS v2 als TTS (Gaming-PC)
|
||||||
|
- [x] Streaming TTS: PCM-Chunks direkt in AudioTrack, nahtlose Wiedergabe
|
||||||
- [x] TTS satzweise fuer lange Texte
|
- [x] TTS satzweise fuer lange Texte
|
||||||
- [x] Datei-/Bild-Upload mit Shared Volume
|
- [x] Datei-/Bild-Upload mit Shared Volume
|
||||||
- [x] Watchdog (stuck Run Erkennung + Auto-Fix + Container-Restart)
|
- [x] Watchdog (stuck Run Erkennung + Auto-Fix + Container-Restart)
|
||||||
|
|
|
||||||
|
|
@ -3,10 +3,6 @@
|
||||||
# → localhost ist aria-core
|
# → localhost ist aria-core
|
||||||
ARIA_CORE_WS=ws://127.0.0.1:18789
|
ARIA_CORE_WS=ws://127.0.0.1:18789
|
||||||
|
|
||||||
# Piper TTS Stimmen
|
|
||||||
PIPER_RAMONA=/voices/de_DE-ramona-low.onnx
|
|
||||||
PIPER_THORSTEN=/voices/de_DE-thorsten-high.onnx
|
|
||||||
|
|
||||||
# Wake-Word
|
# Wake-Word
|
||||||
WAKE_WORD=aria
|
WAKE_WORD=aria
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,6 @@
|
||||||
# ════════════════════════════════════════════════
|
# ════════════════════════════════════════════════
|
||||||
# ARIA Voice Bridge — Dockerfile
|
# ARIA Voice Bridge — Dockerfile
|
||||||
# Whisper STT + Piper TTS + Wake-Word
|
# Whisper STT + Wake-Word (TTS via XTTS v2 remote)
|
||||||
# ════════════════════════════════════════════════
|
# ════════════════════════════════════════════════
|
||||||
|
|
||||||
FROM python:3.12-slim
|
FROM python:3.12-slim
|
||||||
|
|
|
||||||
|
|
@ -26,7 +26,6 @@ import ssl
|
||||||
import sys
|
import sys
|
||||||
import tempfile
|
import tempfile
|
||||||
import uuid
|
import uuid
|
||||||
import wave
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
|
|
@ -37,8 +36,6 @@ import sounddevice as sd
|
||||||
import websockets
|
import websockets
|
||||||
from faster_whisper import WhisperModel
|
from faster_whisper import WhisperModel
|
||||||
from openwakeword.model import Model as WakeWordModel
|
from openwakeword.model import Model as WakeWordModel
|
||||||
from piper import PiperVoice
|
|
||||||
from piper.config import SynthesisConfig
|
|
||||||
|
|
||||||
from modes import Mode, detect_mode_switch, should_speak
|
from modes import Mode, detect_mode_switch, should_speak
|
||||||
|
|
||||||
|
|
@ -72,38 +69,6 @@ CHANNELS = 1
|
||||||
BLOCK_SIZE = 1280 # 80ms bei 16kHz — gut fuer Wake-Word-Erkennung
|
BLOCK_SIZE = 1280 # 80ms bei 16kHz — gut fuer Wake-Word-Erkennung
|
||||||
RECORD_SECONDS = 8 # Max. Aufnahmedauer nach Wake-Word
|
RECORD_SECONDS = 8 # Max. Aufnahmedauer nach Wake-Word
|
||||||
|
|
||||||
# Epische Trigger — bei diesen Woertern spricht Thorsten
|
|
||||||
EPIC_TRIGGERS_DEFAULT = [
|
|
||||||
"deploy",
|
|
||||||
"erfolgreich",
|
|
||||||
"alarm",
|
|
||||||
"so soll es sein",
|
|
||||||
"kritisch",
|
|
||||||
"server down",
|
|
||||||
"sicherheitswarnung",
|
|
||||||
"ticket geloest",
|
|
||||||
"aufgabe abgeschlossen",
|
|
||||||
]
|
|
||||||
|
|
||||||
# Trigger aus Shared-Config laden (von Diagnostic gespeichert)
|
|
||||||
TRIGGERS_FILE = "/shared/config/highlight_triggers.json"
|
|
||||||
|
|
||||||
def load_epic_triggers():
|
|
||||||
"""Laedt Highlight-Trigger aus Shared-Config oder nutzt Defaults."""
|
|
||||||
try:
|
|
||||||
if os.path.exists(TRIGGERS_FILE):
|
|
||||||
with open(TRIGGERS_FILE) as f:
|
|
||||||
triggers = json.load(f)
|
|
||||||
if isinstance(triggers, list) and len(triggers) > 0:
|
|
||||||
logger.info("Highlight-Trigger geladen: %d aus %s", len(triggers), TRIGGERS_FILE)
|
|
||||||
return triggers
|
|
||||||
except Exception as e:
|
|
||||||
logger.warning("Highlight-Trigger laden fehlgeschlagen: %s — nutze Defaults", e)
|
|
||||||
return EPIC_TRIGGERS_DEFAULT
|
|
||||||
|
|
||||||
EPIC_TRIGGERS = load_epic_triggers()
|
|
||||||
|
|
||||||
|
|
||||||
def load_config() -> dict[str, str]:
|
def load_config() -> dict[str, str]:
|
||||||
"""Laedt Konfiguration.
|
"""Laedt Konfiguration.
|
||||||
|
|
||||||
|
|
@ -290,179 +255,6 @@ def clean_text_for_tts(text: str) -> str:
|
||||||
return t.strip()
|
return t.strip()
|
||||||
|
|
||||||
|
|
||||||
class VoiceEngine:
|
|
||||||
"""Verwaltet Piper TTS mit zwei Stimmen: Ramona und Thorsten."""
|
|
||||||
|
|
||||||
def __init__(self, voices_dir: Path) -> None:
|
|
||||||
self.voices_dir = voices_dir
|
|
||||||
self.voices: dict[str, PiperVoice] = {}
|
|
||||||
self.default_voice = "ramona"
|
|
||||||
self.highlight_voice = "thorsten"
|
|
||||||
self.speech_speed = {"ramona": 1.0, "thorsten": 1.0}
|
|
||||||
|
|
||||||
def initialize(self) -> None:
|
|
||||||
"""Laedt die Piper-Stimmen aus dem Voices-Verzeichnis."""
|
|
||||||
voice_configs = {
|
|
||||||
"ramona": "de_DE-ramona-low",
|
|
||||||
"thorsten": "de_DE-thorsten-high",
|
|
||||||
}
|
|
||||||
|
|
||||||
for name, model_name in voice_configs.items():
|
|
||||||
model_path = self.voices_dir / f"{model_name}.onnx"
|
|
||||||
config_path = self.voices_dir / f"{model_name}.onnx.json"
|
|
||||||
|
|
||||||
if not model_path.exists():
|
|
||||||
logger.error("Stimme nicht gefunden: %s", model_path)
|
|
||||||
continue
|
|
||||||
|
|
||||||
self.voices[name] = PiperVoice.load(
|
|
||||||
str(model_path),
|
|
||||||
config_path=str(config_path) if config_path.exists() else None,
|
|
||||||
)
|
|
||||||
logger.info("Stimme geladen: %s (%s)", name, model_name)
|
|
||||||
|
|
||||||
if not self.voices:
|
|
||||||
logger.error("Keine Stimmen geladen — TTS deaktiviert")
|
|
||||||
|
|
||||||
def select_voice(
|
|
||||||
self, text: str, requested_voice: Optional[str] = None
|
|
||||||
) -> str:
|
|
||||||
"""Waehlt die passende Stimme basierend auf Text oder Anfrage.
|
|
||||||
|
|
||||||
Thorsten wird bei epischen Triggern verwendet,
|
|
||||||
sonst Ramona als Standardstimme.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
text: Der zu sprechende Text (fuer Epic-Trigger-Erkennung).
|
|
||||||
requested_voice: Explizit angeforderte Stimme ("ramona" | "thorsten").
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Name der gewaehlten Stimme.
|
|
||||||
"""
|
|
||||||
if requested_voice and requested_voice in self.voices:
|
|
||||||
return requested_voice
|
|
||||||
|
|
||||||
# Highlight-Trigger pruefen
|
|
||||||
text_lower = text.lower()
|
|
||||||
for trigger in EPIC_TRIGGERS:
|
|
||||||
if trigger in text_lower:
|
|
||||||
logger.info("Highlight-Trigger erkannt: '%s' — %s spricht", trigger, self.highlight_voice)
|
|
||||||
return self.highlight_voice
|
|
||||||
|
|
||||||
return self.default_voice
|
|
||||||
|
|
||||||
def synthesize(self, text: str, voice_name: str = "ramona") -> Optional[bytes]:
|
|
||||||
"""Erzeugt Audio-Daten aus Text mit der gewaehlten Stimme.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
text: Der zu sprechende Text.
|
|
||||||
voice_name: Name der Stimme ("ramona" oder "thorsten").
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
WAV-Audiodaten als bytes oder None bei Fehler.
|
|
||||||
"""
|
|
||||||
voice = self.voices.get(voice_name)
|
|
||||||
if voice is None:
|
|
||||||
logger.error("Stimme '%s' nicht verfuegbar", voice_name)
|
|
||||||
return None
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Zentraler TTS-Cleanup (Markdown, Code, Einheiten, URLs)
|
|
||||||
import re
|
|
||||||
clean = clean_text_for_tts(text)
|
|
||||||
sentences = re.split(r'(?<=[.!?])\s+', clean)
|
|
||||||
sentences = [s.strip() for s in sentences if s.strip()]
|
|
||||||
|
|
||||||
if not sentences:
|
|
||||||
return None
|
|
||||||
|
|
||||||
# Jeden Satz einzeln synthetisieren und WAVs zusammenfuegen
|
|
||||||
all_audio = b""
|
|
||||||
sample_rate = None
|
|
||||||
for sentence in sentences:
|
|
||||||
if not sentence:
|
|
||||||
continue
|
|
||||||
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
|
|
||||||
tmp_path = tmp.name
|
|
||||||
speed = self.speech_speed.get(voice_name, 1.0)
|
|
||||||
syn_config = SynthesisConfig(length_scale=1.0 / max(0.3, speed))
|
|
||||||
with wave.open(tmp_path, "wb") as wav_file:
|
|
||||||
voice.synthesize_wav(sentence, wav_file, syn_config=syn_config)
|
|
||||||
with wave.open(tmp_path, "rb") as wav_file:
|
|
||||||
if sample_rate is None:
|
|
||||||
sample_rate = wav_file.getframerate()
|
|
||||||
all_audio += wav_file.readframes(wav_file.getnframes())
|
|
||||||
Path(tmp_path).unlink(missing_ok=True)
|
|
||||||
|
|
||||||
# Zusammengefuegtes WAV erstellen
|
|
||||||
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
|
|
||||||
final_path = tmp.name
|
|
||||||
with wave.open(final_path, "wb") as wav_file:
|
|
||||||
wav_file.setnchannels(1)
|
|
||||||
wav_file.setsampwidth(2)
|
|
||||||
wav_file.setframerate(sample_rate or 22050)
|
|
||||||
wav_file.writeframes(all_audio)
|
|
||||||
|
|
||||||
audio_data = Path(final_path).read_bytes()
|
|
||||||
Path(final_path).unlink(missing_ok=True)
|
|
||||||
|
|
||||||
logger.info(
|
|
||||||
"TTS: %d bytes erzeugt mit %s (%d Saetze) — '%s'",
|
|
||||||
len(audio_data),
|
|
||||||
voice_name,
|
|
||||||
len(sentences),
|
|
||||||
text[:60],
|
|
||||||
)
|
|
||||||
return audio_data
|
|
||||||
|
|
||||||
except Exception:
|
|
||||||
logger.exception("TTS-Fehler bei Stimme '%s'", voice_name)
|
|
||||||
return None
|
|
||||||
|
|
||||||
def speak(self, text: str, requested_voice: Optional[str] = None) -> None:
|
|
||||||
"""Spricht den Text ueber das Audio-Geraet.
|
|
||||||
|
|
||||||
Waehlt automatisch die passende Stimme und gibt das Audio aus.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
text: Der zu sprechende Text.
|
|
||||||
requested_voice: Optionale explizite Stimmenwahl.
|
|
||||||
"""
|
|
||||||
voice_name = self.select_voice(text, requested_voice)
|
|
||||||
audio_data = self.synthesize(text, voice_name)
|
|
||||||
|
|
||||||
if audio_data is None:
|
|
||||||
return
|
|
||||||
|
|
||||||
try:
|
|
||||||
# WAV-Daten lesen und ueber sounddevice abspielen
|
|
||||||
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
|
|
||||||
tmp.write(audio_data)
|
|
||||||
tmp_path = tmp.name
|
|
||||||
|
|
||||||
with wave.open(tmp_path, "rb") as wf:
|
|
||||||
frames = wf.readframes(wf.getnframes())
|
|
||||||
sample_width = wf.getsampwidth()
|
|
||||||
rate = wf.getframerate()
|
|
||||||
channels = wf.getnchannels()
|
|
||||||
|
|
||||||
Path(tmp_path).unlink(missing_ok=True)
|
|
||||||
|
|
||||||
# Numpy-Array aus PCM-Daten
|
|
||||||
dtype_map = {1: np.int8, 2: np.int16, 4: np.int32}
|
|
||||||
dtype = dtype_map.get(sample_width, np.int16)
|
|
||||||
audio_array = np.frombuffer(frames, dtype=dtype)
|
|
||||||
|
|
||||||
if channels > 1:
|
|
||||||
audio_array = audio_array.reshape(-1, channels)
|
|
||||||
|
|
||||||
sd.play(audio_array, samplerate=rate)
|
|
||||||
sd.wait() # Warten bis Wiedergabe fertig
|
|
||||||
|
|
||||||
except Exception:
|
|
||||||
logger.exception("Audio-Wiedergabe fehlgeschlagen")
|
|
||||||
|
|
||||||
|
|
||||||
# ── STT Engine ───────────────────────────────────────────────
|
# ── STT Engine ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -672,9 +464,9 @@ class ARIABridge:
|
||||||
self.current_mode = Mode.NORMAL
|
self.current_mode = Mode.NORMAL
|
||||||
self.running = False
|
self.running = False
|
||||||
|
|
||||||
# Komponenten
|
# Komponenten (TTS: immer XTTS remote, Piper wurde entfernt)
|
||||||
self.voice_engine = VoiceEngine(VOICES_DIR)
|
|
||||||
self.tts_enabled = True
|
self.tts_enabled = True
|
||||||
|
self.xtts_voice = ""
|
||||||
vc: dict = {}
|
vc: dict = {}
|
||||||
# Gespeicherte Voice-Config laden
|
# Gespeicherte Voice-Config laden
|
||||||
try:
|
try:
|
||||||
|
|
@ -682,16 +474,9 @@ class ARIABridge:
|
||||||
if os.path.exists(vc_path):
|
if os.path.exists(vc_path):
|
||||||
with open(vc_path) as f:
|
with open(vc_path) as f:
|
||||||
vc = json.load(f)
|
vc = json.load(f)
|
||||||
self.voice_engine.default_voice = vc.get("defaultVoice", "ramona")
|
|
||||||
self.voice_engine.highlight_voice = vc.get("highlightVoice", "thorsten")
|
|
||||||
self.voice_engine.speech_speed = {
|
|
||||||
"ramona": vc.get("speedRamona", 1.0),
|
|
||||||
"thorsten": vc.get("speedThorsten", 1.0),
|
|
||||||
}
|
|
||||||
self.tts_enabled = vc.get("ttsEnabled", True)
|
self.tts_enabled = vc.get("ttsEnabled", True)
|
||||||
self.tts_engine_type = vc.get("ttsEngine", "piper")
|
|
||||||
self.xtts_voice = vc.get("xttsVoice", "")
|
self.xtts_voice = vc.get("xttsVoice", "")
|
||||||
logger.info("Voice-Config geladen: %s", vc)
|
logger.info("Voice-Config geladen: tts=%s voice=%s", self.tts_enabled, self.xtts_voice or "default")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning("Voice-Config laden fehlgeschlagen: %s", e)
|
logger.warning("Voice-Config laden fehlgeschlagen: %s", e)
|
||||||
# Whisper-Modell: Config hat Vorrang, dann env/Default (medium)
|
# Whisper-Modell: Config hat Vorrang, dann env/Default (medium)
|
||||||
|
|
@ -725,9 +510,6 @@ class ARIABridge:
|
||||||
logger.info("ARIA Voice Bridge startet...")
|
logger.info("ARIA Voice Bridge startet...")
|
||||||
logger.info("=" * 50)
|
logger.info("=" * 50)
|
||||||
|
|
||||||
# Voice-Engine IMMER laden — rendert Audio fuer die App (auch ohne Soundkarte)
|
|
||||||
self.voice_engine.initialize()
|
|
||||||
|
|
||||||
# STT IMMER laden — verarbeitet Audio von der App (braucht kein Sounddevice)
|
# STT IMMER laden — verarbeitet Audio von der App (braucht kein Sounddevice)
|
||||||
self.stt_engine.initialize()
|
self.stt_engine.initialize()
|
||||||
|
|
||||||
|
|
@ -1050,9 +832,6 @@ class ARIABridge:
|
||||||
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
||||||
})
|
})
|
||||||
|
|
||||||
# Stimme auswaehlen
|
|
||||||
voice_name = requested_voice or self.voice_engine.select_voice(text)
|
|
||||||
|
|
||||||
# Eindeutige Message-ID fuer Audio-Cache-Zuordnung
|
# Eindeutige Message-ID fuer Audio-Cache-Zuordnung
|
||||||
message_id = str(uuid.uuid4())
|
message_id = str(uuid.uuid4())
|
||||||
|
|
||||||
|
|
@ -1065,7 +844,6 @@ class ARIABridge:
|
||||||
"payload": {
|
"payload": {
|
||||||
"text": text,
|
"text": text,
|
||||||
"sender": "aria",
|
"sender": "aria",
|
||||||
"voice": voice_name,
|
|
||||||
"messageId": message_id,
|
"messageId": message_id,
|
||||||
# Debug: aufbereiteter Text fuer TTS (App ignoriert, Diagnostic zeigt optional)
|
# Debug: aufbereiteter Text fuer TTS (App ignoriert, Diagnostic zeigt optional)
|
||||||
"ttsText": tts_text_preview if tts_text_preview != text else "",
|
"ttsText": tts_text_preview if tts_text_preview != text else "",
|
||||||
|
|
@ -1073,69 +851,37 @@ class ARIABridge:
|
||||||
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
||||||
})
|
})
|
||||||
|
|
||||||
# TTS-Audio rendern und an die App senden (wenn Modus es erlaubt)
|
# TTS ueber XTTS (XTTS-Bridge auf Gaming-PC)
|
||||||
if getattr(self, 'tts_enabled', True) and should_speak(self.current_mode, is_critical):
|
if not (getattr(self, 'tts_enabled', True) and should_speak(self.current_mode, is_critical)):
|
||||||
tts_engine = getattr(self, 'tts_engine_type', 'piper')
|
|
||||||
|
|
||||||
if tts_engine == "xtts":
|
|
||||||
# XTTS: aufbereiteter Text (Code-Bloecke raus, Einheiten ausgeschrieben)
|
|
||||||
xtts_voice = getattr(self, 'xtts_voice', '')
|
|
||||||
tts_text = clean_text_for_tts(text)
|
|
||||||
if not tts_text:
|
|
||||||
logger.info("[core] TTS-Text leer nach Cleanup — XTTS uebersprungen")
|
|
||||||
return
|
|
||||||
try:
|
|
||||||
xtts_request_id = str(uuid.uuid4())
|
|
||||||
# Map fuer xtts_response → App-Cache Zuordnung
|
|
||||||
self._xtts_request_to_message[xtts_request_id] = message_id
|
|
||||||
if len(self._xtts_request_to_message) > 100:
|
|
||||||
# Oldest entry raus damit der Dict nicht waechst
|
|
||||||
oldest = next(iter(self._xtts_request_to_message))
|
|
||||||
self._xtts_request_to_message.pop(oldest, None)
|
|
||||||
await self._send_to_rvs({
|
|
||||||
"type": "xtts_request",
|
|
||||||
"payload": {
|
|
||||||
"text": tts_text,
|
|
||||||
"voice": xtts_voice,
|
|
||||||
"language": "de",
|
|
||||||
"requestId": xtts_request_id,
|
|
||||||
},
|
|
||||||
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
|
||||||
})
|
|
||||||
logger.info("[core] XTTS-Request gesendet (%s): '%s'", xtts_voice or "default", tts_text[:60])
|
|
||||||
except Exception as e:
|
|
||||||
logger.warning("[core] XTTS-Request fehlgeschlagen: %s — Fallback auf Piper", e)
|
|
||||||
# Fallback auf Piper
|
|
||||||
audio_data = self.voice_engine.synthesize(text, voice_name)
|
|
||||||
if audio_data:
|
|
||||||
audio_b64 = base64.b64encode(audio_data).decode("ascii")
|
|
||||||
await self._send_to_rvs({
|
|
||||||
"type": "audio",
|
|
||||||
"payload": {"base64": audio_b64, "mimeType": "audio/wav", "voice": voice_name, "messageId": message_id},
|
|
||||||
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
|
||||||
})
|
|
||||||
else:
|
|
||||||
# Piper: Lokal rendern
|
|
||||||
audio_data = self.voice_engine.synthesize(text, voice_name)
|
|
||||||
if audio_data:
|
|
||||||
audio_b64 = base64.b64encode(audio_data).decode("ascii")
|
|
||||||
await self._send_to_rvs({
|
|
||||||
"type": "audio",
|
|
||||||
"payload": {
|
|
||||||
"base64": audio_b64,
|
|
||||||
"mimeType": "audio/wav",
|
|
||||||
"voice": voice_name,
|
|
||||||
"messageId": message_id,
|
|
||||||
},
|
|
||||||
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
|
||||||
})
|
|
||||||
logger.info("[core] TTS-Audio gesendet: %d bytes (%s)", len(audio_data), voice_name)
|
|
||||||
|
|
||||||
# Lokal abspielen (nur wenn Soundkarte vorhanden)
|
|
||||||
if self.audio_available:
|
|
||||||
self.voice_engine.speak(text, requested_voice)
|
|
||||||
else:
|
|
||||||
logger.info("[core] TTS unterdrueckt (Modus: %s)", self.current_mode.config.name)
|
logger.info("[core] TTS unterdrueckt (Modus: %s)", self.current_mode.config.name)
|
||||||
|
return
|
||||||
|
|
||||||
|
xtts_voice = getattr(self, 'xtts_voice', '')
|
||||||
|
tts_text = tts_text_preview or text
|
||||||
|
if not tts_text:
|
||||||
|
logger.info("[core] TTS-Text leer nach Cleanup — uebersprungen")
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
xtts_request_id = str(uuid.uuid4())
|
||||||
|
# Map fuer audio_pcm/xtts_response → App-Cache Zuordnung
|
||||||
|
self._xtts_request_to_message[xtts_request_id] = message_id
|
||||||
|
if len(self._xtts_request_to_message) > 100:
|
||||||
|
oldest = next(iter(self._xtts_request_to_message))
|
||||||
|
self._xtts_request_to_message.pop(oldest, None)
|
||||||
|
await self._send_to_rvs({
|
||||||
|
"type": "xtts_request",
|
||||||
|
"payload": {
|
||||||
|
"text": tts_text,
|
||||||
|
"voice": xtts_voice,
|
||||||
|
"language": "de",
|
||||||
|
"requestId": xtts_request_id,
|
||||||
|
"messageId": message_id,
|
||||||
|
},
|
||||||
|
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
||||||
|
})
|
||||||
|
logger.info("[core] XTTS-Request gesendet (%s): '%s'", xtts_voice or "default", tts_text[:60])
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[core] XTTS-Request fehlgeschlagen: %s — kein Audio", e)
|
||||||
|
|
||||||
def _fetch_active_session(self) -> None:
|
def _fetch_active_session(self) -> None:
|
||||||
"""Holt die aktive Session vom Diagnostic-Endpoint."""
|
"""Holt die aktive Session vom Diagnostic-Endpoint."""
|
||||||
|
|
@ -1344,113 +1090,58 @@ class ARIABridge:
|
||||||
return
|
return
|
||||||
|
|
||||||
elif msg_type == "tts_request":
|
elif msg_type == "tts_request":
|
||||||
# App fordert TTS-Audio fuer einen Text an (Play-Button).
|
# App fordert TTS-Audio fuer einen Text an (Play-Button) → immer XTTS.
|
||||||
# Nutze die aktuell konfigurierte Engine (Piper oder XTTS).
|
|
||||||
text = payload.get("text", "")
|
text = payload.get("text", "")
|
||||||
requested_voice = payload.get("voice", "")
|
message_id = payload.get("messageId", "")
|
||||||
message_id = payload.get("messageId", "") # fuer Cache-Zuordnung
|
|
||||||
if not text:
|
if not text:
|
||||||
return
|
return
|
||||||
|
|
||||||
tts_engine = getattr(self, 'tts_engine_type', 'piper')
|
|
||||||
tts_text = clean_text_for_tts(text) or text
|
tts_text = clean_text_for_tts(text) or text
|
||||||
|
xtts_voice = getattr(self, 'xtts_voice', '')
|
||||||
if tts_engine == "xtts":
|
try:
|
||||||
xtts_voice = getattr(self, 'xtts_voice', '')
|
xtts_request_id = str(uuid.uuid4())
|
||||||
try:
|
if message_id:
|
||||||
await self._send_to_rvs({
|
self._xtts_request_to_message[xtts_request_id] = message_id
|
||||||
"type": "xtts_request",
|
await self._send_to_rvs({
|
||||||
"payload": {
|
"type": "xtts_request",
|
||||||
"text": tts_text,
|
"payload": {
|
||||||
"voice": xtts_voice,
|
"text": tts_text,
|
||||||
"language": "de",
|
"voice": xtts_voice,
|
||||||
"requestId": str(uuid.uuid4()),
|
"language": "de",
|
||||||
"messageId": message_id,
|
"requestId": xtts_request_id,
|
||||||
},
|
"messageId": message_id,
|
||||||
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
},
|
||||||
})
|
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
||||||
logger.info("[rvs] TTS on-demand via XTTS: '%s'", tts_text[:60])
|
})
|
||||||
except Exception as e:
|
logger.info("[rvs] TTS on-demand via XTTS: '%s'", tts_text[:60])
|
||||||
logger.warning("[rvs] XTTS-Request fehlgeschlagen, Fallback Piper: %s", e)
|
except Exception as e:
|
||||||
tts_engine = "piper"
|
logger.warning("[rvs] TTS on-demand fehlgeschlagen: %s", e)
|
||||||
|
|
||||||
if tts_engine == "piper":
|
|
||||||
voice_name = requested_voice or self.voice_engine.select_voice(text)
|
|
||||||
audio_data = self.voice_engine.synthesize(text, voice_name)
|
|
||||||
if audio_data:
|
|
||||||
audio_b64 = base64.b64encode(audio_data).decode("ascii")
|
|
||||||
try:
|
|
||||||
await self._send_to_rvs({
|
|
||||||
"type": "audio",
|
|
||||||
"payload": {
|
|
||||||
"base64": audio_b64,
|
|
||||||
"mimeType": "audio/wav",
|
|
||||||
"voice": voice_name,
|
|
||||||
"messageId": message_id,
|
|
||||||
},
|
|
||||||
"timestamp": int(asyncio.get_event_loop().time() * 1000),
|
|
||||||
})
|
|
||||||
logger.info("[rvs] TTS on-demand via Piper: %d bytes (%s)", len(audio_data), voice_name)
|
|
||||||
except Exception as e:
|
|
||||||
logger.warning("[rvs] TTS on-demand senden fehlgeschlagen: %s", e)
|
|
||||||
return
|
return
|
||||||
|
|
||||||
elif msg_type == "config":
|
elif msg_type == "config":
|
||||||
# Konfiguration von App/Diagnostic empfangen + persistent speichern
|
# Konfiguration von App/Diagnostic empfangen + persistent speichern
|
||||||
changed = False
|
changed = False
|
||||||
if "defaultVoice" in payload:
|
|
||||||
new_voice = payload["defaultVoice"]
|
|
||||||
if new_voice in self.voice_engine.voices:
|
|
||||||
self.voice_engine.default_voice = new_voice
|
|
||||||
logger.info("[rvs] Standard-Stimme gewechselt: %s", new_voice)
|
|
||||||
changed = True
|
|
||||||
if "highlightVoice" in payload:
|
|
||||||
new_voice = payload["highlightVoice"]
|
|
||||||
if new_voice in self.voice_engine.voices:
|
|
||||||
self.voice_engine.highlight_voice = new_voice
|
|
||||||
logger.info("[rvs] Highlight-Stimme gewechselt: %s", new_voice)
|
|
||||||
changed = True
|
|
||||||
if "ttsEnabled" in payload:
|
if "ttsEnabled" in payload:
|
||||||
self.tts_enabled = bool(payload["ttsEnabled"])
|
self.tts_enabled = bool(payload["ttsEnabled"])
|
||||||
logger.info("[rvs] TTS %s", "aktiviert" if self.tts_enabled else "deaktiviert")
|
logger.info("[rvs] TTS %s", "aktiviert" if self.tts_enabled else "deaktiviert")
|
||||||
changed = True
|
changed = True
|
||||||
if "ttsEngine" in payload:
|
|
||||||
self.tts_engine_type = payload["ttsEngine"]
|
|
||||||
logger.info("[rvs] TTS-Engine: %s", self.tts_engine_type)
|
|
||||||
changed = True
|
|
||||||
if "xttsVoice" in payload:
|
if "xttsVoice" in payload:
|
||||||
self.xtts_voice = payload["xttsVoice"]
|
self.xtts_voice = payload["xttsVoice"]
|
||||||
logger.info("[rvs] XTTS-Stimme: %s", self.xtts_voice)
|
logger.info("[rvs] XTTS-Stimme: %s", self.xtts_voice or "default")
|
||||||
changed = True
|
changed = True
|
||||||
if "speedRamona" in payload:
|
|
||||||
self.voice_engine.speech_speed["ramona"] = max(0.3, min(2.0, float(payload["speedRamona"])))
|
|
||||||
logger.info("[rvs] Speed Ramona: %.1f", self.voice_engine.speech_speed["ramona"])
|
|
||||||
changed = True
|
|
||||||
if "speedThorsten" in payload:
|
|
||||||
self.voice_engine.speech_speed["thorsten"] = max(0.3, min(2.0, float(payload["speedThorsten"])))
|
|
||||||
logger.info("[rvs] Speed Thorsten: %.1f", self.voice_engine.speech_speed["thorsten"])
|
|
||||||
changed = True
|
|
||||||
whisper_reloaded = False
|
|
||||||
if "whisperModel" in payload:
|
if "whisperModel" in payload:
|
||||||
new_model = payload["whisperModel"]
|
new_model = payload["whisperModel"]
|
||||||
if new_model and new_model != self.stt_engine.model_size:
|
if new_model and new_model != self.stt_engine.model_size:
|
||||||
logger.info("[rvs] Whisper-Modell Wechsel: %s -> %s (laedt...)", self.stt_engine.model_size, new_model)
|
logger.info("[rvs] Whisper-Modell Wechsel: %s -> %s (laedt...)", self.stt_engine.model_size, new_model)
|
||||||
loop = asyncio.get_event_loop()
|
loop = asyncio.get_event_loop()
|
||||||
whisper_reloaded = await loop.run_in_executor(None, self.stt_engine.reload, new_model)
|
if await loop.run_in_executor(None, self.stt_engine.reload, new_model):
|
||||||
if whisper_reloaded:
|
|
||||||
changed = True
|
changed = True
|
||||||
# Persistent speichern in Shared Volume
|
# Persistent speichern in Shared Volume
|
||||||
if changed:
|
if changed:
|
||||||
try:
|
try:
|
||||||
os.makedirs("/shared/config", exist_ok=True)
|
os.makedirs("/shared/config", exist_ok=True)
|
||||||
config_data = {
|
config_data = {
|
||||||
"defaultVoice": self.voice_engine.default_voice,
|
|
||||||
"highlightVoice": self.voice_engine.highlight_voice,
|
|
||||||
"ttsEnabled": getattr(self, "tts_enabled", True),
|
"ttsEnabled": getattr(self, "tts_enabled", True),
|
||||||
"ttsEngine": getattr(self, "tts_engine_type", "piper"),
|
|
||||||
"xttsVoice": getattr(self, "xtts_voice", ""),
|
"xttsVoice": getattr(self, "xtts_voice", ""),
|
||||||
"speedRamona": self.voice_engine.speech_speed.get("ramona", 1.0),
|
|
||||||
"speedThorsten": self.voice_engine.speech_speed.get("thorsten", 1.0),
|
|
||||||
"whisperModel": self.stt_engine.model_size,
|
"whisperModel": self.stt_engine.model_size,
|
||||||
}
|
}
|
||||||
with open("/shared/config/voice_config.json", "w") as f:
|
with open("/shared/config/voice_config.json", "w") as f:
|
||||||
|
|
@ -1459,10 +1150,6 @@ class ARIABridge:
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning("[rvs] Config speichern fehlgeschlagen: %s", e)
|
logger.warning("[rvs] Config speichern fehlgeschlagen: %s", e)
|
||||||
return
|
return
|
||||||
text = payload.get("text", "")
|
|
||||||
if text:
|
|
||||||
logger.info("[rvs] App-Chat: '%s'", text[:80])
|
|
||||||
await self.send_to_core(text, source="app")
|
|
||||||
|
|
||||||
elif msg_type == "mode":
|
elif msg_type == "mode":
|
||||||
# Moduswechsel von der App
|
# Moduswechsel von der App
|
||||||
|
|
|
||||||
|
|
@ -5,8 +5,7 @@
|
||||||
# STT — Whisper (lokal, keine API noetig)
|
# STT — Whisper (lokal, keine API noetig)
|
||||||
faster-whisper
|
faster-whisper
|
||||||
|
|
||||||
# TTS — Piper (offline, deutsche Stimmen)
|
# TTS: laeuft remote ueber XTTS v2 auf dem Gaming-PC (keine lokalen Deps noetig)
|
||||||
piper-tts
|
|
||||||
|
|
||||||
# WebSocket-Verbindung zu aria-core
|
# WebSocket-Verbindung zu aria-core
|
||||||
websockets
|
websockets
|
||||||
|
|
|
||||||
|
|
@ -317,16 +317,8 @@
|
||||||
<div class="log-box hidden" id="log-server"></div>
|
<div class="log-box hidden" id="log-server"></div>
|
||||||
<div class="log-box hidden" id="log-pipeline"></div>
|
<div class="log-box hidden" id="log-pipeline"></div>
|
||||||
<div class="log-box hidden" id="log-tts" style="padding:12px;">
|
<div class="log-box hidden" id="log-tts" style="padding:12px;">
|
||||||
<h3 style="color:#34C759;margin:0 0 12px;">TTS Diagnose</h3>
|
<h3 style="color:#34C759;margin:0 0 12px;">TTS Diagnose (XTTS)</h3>
|
||||||
<div style="display:grid;grid-template-columns:1fr 1fr;gap:8px;margin-bottom:12px;">
|
<div style="display:grid;grid-template-columns:1fr 1fr;gap:8px;margin-bottom:12px;">
|
||||||
<div style="background:#1E1E2E;padding:8px;border-radius:6px;">
|
|
||||||
<div style="color:#8888AA;font-size:10px;text-transform:uppercase;">Standard-Stimme</div>
|
|
||||||
<div style="color:#fff;font-size:14px;margin-top:4px;" id="tts-default-voice">Ramona</div>
|
|
||||||
</div>
|
|
||||||
<div style="background:#1E1E2E;padding:8px;border-radius:6px;">
|
|
||||||
<div style="color:#8888AA;font-size:10px;text-transform:uppercase;">Highlight-Stimme</div>
|
|
||||||
<div style="color:#fff;font-size:14px;margin-top:4px;" id="tts-highlight-voice">Thorsten</div>
|
|
||||||
</div>
|
|
||||||
<div style="background:#1E1E2E;padding:8px;border-radius:6px;">
|
<div style="background:#1E1E2E;padding:8px;border-radius:6px;">
|
||||||
<div style="color:#8888AA;font-size:10px;text-transform:uppercase;">Status</div>
|
<div style="color:#8888AA;font-size:10px;text-transform:uppercase;">Status</div>
|
||||||
<div style="font-size:14px;margin-top:4px;" id="tts-status">Unbekannt</div>
|
<div style="font-size:14px;margin-top:4px;" id="tts-status">Unbekannt</div>
|
||||||
|
|
@ -340,8 +332,7 @@
|
||||||
<input type="text" id="tts-test-text" value="Hallo Stefan, ich bin ARIA." placeholder="Test-Text..." style="background:#1E1E2E;border:1px solid #2A2A3E;border-radius:6px;padding:8px;color:#fff;font-size:13px;width:100%;box-sizing:border-box;">
|
<input type="text" id="tts-test-text" value="Hallo Stefan, ich bin ARIA." placeholder="Test-Text..." style="background:#1E1E2E;border:1px solid #2A2A3E;border-radius:6px;padding:8px;color:#fff;font-size:13px;width:100%;box-sizing:border-box;">
|
||||||
</div>
|
</div>
|
||||||
<div style="display:flex;gap:8px;">
|
<div style="display:flex;gap:8px;">
|
||||||
<button class="btn" onclick="testTTS('ramona')" style="flex:1;">Ramona testen</button>
|
<button class="btn" onclick="testTTS('')" style="flex:1;">XTTS testen</button>
|
||||||
<button class="btn" onclick="testTTS('thorsten')" style="flex:1;">Thorsten testen</button>
|
|
||||||
<button class="btn secondary" onclick="checkTTSStatus()" style="flex:1;">Status pruefen</button>
|
<button class="btn secondary" onclick="checkTTSStatus()" style="flex:1;">Status pruefen</button>
|
||||||
</div>
|
</div>
|
||||||
<div id="tts-log" style="margin-top:12px;max-height:200px;overflow-y:auto;font-size:11px;font-family:monospace;color:#8888AA;"></div>
|
<div id="tts-log" style="margin-top:12px;max-height:200px;overflow-y:auto;font-size:11px;font-family:monospace;color:#8888AA;"></div>
|
||||||
|
|
@ -413,94 +404,43 @@
|
||||||
<div class="settings-section">
|
<div class="settings-section">
|
||||||
<h2>Sprachausgabe</h2>
|
<h2>Sprachausgabe</h2>
|
||||||
<div class="card" style="max-width:500px;">
|
<div class="card" style="max-width:500px;">
|
||||||
<!-- TTS aktiv (global fuer alle Engines) -->
|
<!-- TTS aktiv (global) -->
|
||||||
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
|
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
|
||||||
<label style="color:#8888AA;font-size:12px;">TTS aktiv:</label>
|
<label style="color:#8888AA;font-size:12px;">TTS aktiv:</label>
|
||||||
<label class="toggle"><input type="checkbox" id="diag-tts-enabled" checked onchange="sendVoiceConfig()"><span class="slider"></span></label>
|
<label class="toggle"><input type="checkbox" id="diag-tts-enabled" checked onchange="sendVoiceConfig()"><span class="slider"></span></label>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!-- TTS Engine Auswahl -->
|
<!-- XTTS Stimme -->
|
||||||
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
|
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
|
||||||
<label style="color:#8888AA;font-size:12px;">TTS Engine:</label>
|
<label style="color:#8888AA;font-size:12px;">XTTS Stimme:</label>
|
||||||
<select id="diag-tts-engine" onchange="sendVoiceConfig();toggleXTTSPanel()" style="background:#1E1E2E;color:#fff;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;font-size:13px;">
|
<select id="diag-xtts-voice" onchange="sendVoiceConfig()" style="background:#1E1E2E;color:#fff;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;font-size:13px;">
|
||||||
<option value="piper">Piper (lokal, CPU, schnell)</option>
|
<option value="">Standard (XTTS Default)</option>
|
||||||
<option value="xtts">XTTS v2 (remote, GPU, natuerlich)</option>
|
|
||||||
</select>
|
</select>
|
||||||
|
<button class="btn secondary" onclick="loadXTTSVoices()" style="padding:4px 10px;font-size:11px;">Laden</button>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!-- Piper Stimmen (nur bei Engine=piper) -->
|
<!-- Voice Cloning -->
|
||||||
<div id="piper-panel">
|
<div style="background:#1E1E2E;border-radius:8px;padding:12px;margin-top:8px;">
|
||||||
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
|
<div style="color:#0096FF;font-size:13px;font-weight:600;margin-bottom:8px;">Stimme klonen</div>
|
||||||
<label style="color:#8888AA;font-size:12px;">Standard-Stimme:</label>
|
<div style="color:#8888AA;font-size:11px;margin-bottom:8px;">
|
||||||
<select id="diag-default-voice" onchange="sendVoiceConfig()" style="background:#1E1E2E;color:#fff;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;font-size:13px;">
|
Lade ein oder mehrere Audio-Samples hoch (WAV/MP3, min. 6-10 Sekunden).
|
||||||
<option value="ramona">Ramona (weiblich)</option>
|
Mehrere Dateien werden automatisch zusammengefuegt.
|
||||||
<option value="thorsten">Thorsten (maennlich)</option>
|
|
||||||
</select>
|
|
||||||
</div>
|
|
||||||
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
|
|
||||||
<label style="color:#8888AA;font-size:12px;">Highlight-Stimme:</label>
|
|
||||||
<select id="diag-highlight-voice" onchange="sendVoiceConfig()" style="background:#1E1E2E;color:#fff;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;font-size:13px;">
|
|
||||||
<option value="thorsten">Thorsten (maennlich)</option>
|
|
||||||
<option value="ramona">Ramona (weiblich)</option>
|
|
||||||
</select>
|
|
||||||
</div>
|
|
||||||
<div style="margin-bottom:4px;">
|
|
||||||
<label style="color:#8888AA;font-size:12px;">Ramona Speed: <span id="speed-ramona-label">1.0x</span></label>
|
|
||||||
</div>
|
|
||||||
<div style="display:flex;align-items:center;gap:8px;margin-bottom:12px;">
|
|
||||||
<span style="color:#555570;font-size:11px;">0.5x</span>
|
|
||||||
<input type="range" id="diag-speed-ramona" min="0.5" max="2.0" step="0.1" value="1.0"
|
|
||||||
oninput="document.getElementById('speed-ramona-label').textContent=this.value+'x'"
|
|
||||||
onchange="sendVoiceConfig()"
|
|
||||||
style="flex:1;accent-color:#0096FF;">
|
|
||||||
<span style="color:#555570;font-size:11px;">2.0x</span>
|
|
||||||
</div>
|
|
||||||
<div style="margin-bottom:4px;">
|
|
||||||
<label style="color:#8888AA;font-size:12px;">Thorsten Speed: <span id="speed-thorsten-label">1.0x</span></label>
|
|
||||||
</div>
|
|
||||||
<div style="display:flex;align-items:center;gap:8px;">
|
|
||||||
<span style="color:#555570;font-size:11px;">0.5x</span>
|
|
||||||
<input type="range" id="diag-speed-thorsten" min="0.5" max="2.0" step="0.1" value="1.0"
|
|
||||||
oninput="document.getElementById('speed-thorsten-label').textContent=this.value+'x'"
|
|
||||||
onchange="sendVoiceConfig()"
|
|
||||||
style="flex:1;accent-color:#0096FF;">
|
|
||||||
<span style="color:#555570;font-size:11px;">2.0x</span>
|
|
||||||
</div>
|
|
||||||
</div><!-- /piper-panel -->
|
|
||||||
|
|
||||||
<!-- XTTS Panel (nur bei Engine=xtts) -->
|
|
||||||
<div id="xtts-panel" style="display:none;">
|
|
||||||
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
|
|
||||||
<label style="color:#8888AA;font-size:12px;">XTTS Stimme:</label>
|
|
||||||
<select id="diag-xtts-voice" onchange="sendVoiceConfig()" style="background:#1E1E2E;color:#fff;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;font-size:13px;">
|
|
||||||
<option value="">Standard (XTTS Default)</option>
|
|
||||||
</select>
|
|
||||||
<button class="btn secondary" onclick="loadXTTSVoices()" style="padding:4px 10px;font-size:11px;">Laden</button>
|
|
||||||
</div>
|
</div>
|
||||||
|
<div style="margin-bottom:8px;">
|
||||||
<!-- Voice Cloning -->
|
<input type="text" id="xtts-clone-name" placeholder="Name fuer die Stimme..." style="background:#0D0D1A;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;color:#fff;font-size:13px;width:100%;box-sizing:border-box;">
|
||||||
<div style="background:#1E1E2E;border-radius:8px;padding:12px;margin-top:8px;">
|
|
||||||
<div style="color:#0096FF;font-size:13px;font-weight:600;margin-bottom:8px;">Stimme klonen</div>
|
|
||||||
<div style="color:#8888AA;font-size:11px;margin-bottom:8px;">
|
|
||||||
Lade ein oder mehrere Audio-Samples hoch (WAV/MP3, min. 6-10 Sekunden).
|
|
||||||
Mehrere Dateien werden automatisch zusammengefuegt.
|
|
||||||
</div>
|
|
||||||
<div style="margin-bottom:8px;">
|
|
||||||
<input type="text" id="xtts-clone-name" placeholder="Name fuer die Stimme..." style="background:#0D0D1A;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;color:#fff;font-size:13px;width:100%;box-sizing:border-box;">
|
|
||||||
</div>
|
|
||||||
<div style="margin-bottom:8px;">
|
|
||||||
<input type="file" id="xtts-clone-files" accept="audio/*" multiple style="color:#8888AA;font-size:12px;">
|
|
||||||
</div>
|
|
||||||
<div style="display:flex;gap:8px;">
|
|
||||||
<button class="btn" onclick="uploadVoiceSamples()" style="flex:1;">Stimme erstellen</button>
|
|
||||||
</div>
|
|
||||||
<div id="xtts-clone-status" style="font-size:11px;color:#555570;margin-top:6px;"></div>
|
|
||||||
</div>
|
</div>
|
||||||
|
<div style="margin-bottom:8px;">
|
||||||
<!-- XTTS Status -->
|
<input type="file" id="xtts-clone-files" accept="audio/*" multiple style="color:#8888AA;font-size:12px;">
|
||||||
<div style="margin-top:8px;font-size:11px;color:#555570;" id="xtts-status">
|
|
||||||
XTTS-Server: Nicht verbunden (starte xtts/ auf dem Gaming-PC)
|
|
||||||
</div>
|
</div>
|
||||||
|
<div style="display:flex;gap:8px;">
|
||||||
|
<button class="btn" onclick="uploadVoiceSamples()" style="flex:1;">Stimme erstellen</button>
|
||||||
|
</div>
|
||||||
|
<div id="xtts-clone-status" style="font-size:11px;color:#555570;margin-top:6px;"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- XTTS Status -->
|
||||||
|
<div style="margin-top:8px;font-size:11px;color:#555570;" id="xtts-status">
|
||||||
|
XTTS-Server: Nicht verbunden (starte xtts/ auf dem Gaming-PC)
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
@ -798,11 +738,8 @@
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
if (msg.type === 'tts_status') {
|
if (msg.type === 'tts_status') {
|
||||||
document.getElementById('tts-default-voice').textContent = msg.defaultVoice || '?';
|
|
||||||
document.getElementById('tts-highlight-voice').textContent = msg.highlightVoice || '?';
|
|
||||||
document.getElementById('tts-status').textContent = msg.ok ? 'OK' : 'Fehler';
|
document.getElementById('tts-status').textContent = msg.ok ? 'OK' : 'Fehler';
|
||||||
document.getElementById('tts-status').style.color = msg.ok ? '#34C759' : '#FF3B30';
|
document.getElementById('tts-status').style.color = msg.ok ? '#34C759' : '#FF3B30';
|
||||||
if (msg.voices) ttsLog(`Stimmen: ${msg.voices.join(', ')}`);
|
|
||||||
if (msg.error) { document.getElementById('tts-last-error').textContent = msg.error; ttsLog(`Fehler: ${msg.error}`); }
|
if (msg.error) { document.getElementById('tts-last-error').textContent = msg.error; ttsLog(`Fehler: ${msg.error}`); }
|
||||||
else { document.getElementById('tts-last-error').textContent = '-'; ttsLog('TTS OK'); }
|
else { document.getElementById('tts-last-error').textContent = '-'; ttsLog('TTS OK'); }
|
||||||
return;
|
return;
|
||||||
|
|
@ -835,16 +772,7 @@
|
||||||
}
|
}
|
||||||
|
|
||||||
if (msg.type === 'voice_config') {
|
if (msg.type === 'voice_config') {
|
||||||
document.getElementById('diag-default-voice').value = msg.defaultVoice || 'ramona';
|
|
||||||
document.getElementById('diag-highlight-voice').value = msg.highlightVoice || 'thorsten';
|
|
||||||
document.getElementById('diag-tts-enabled').checked = msg.ttsEnabled !== false;
|
document.getElementById('diag-tts-enabled').checked = msg.ttsEnabled !== false;
|
||||||
const sr = msg.speedRamona || 1.0;
|
|
||||||
const st = msg.speedThorsten || 1.0;
|
|
||||||
document.getElementById('diag-speed-ramona').value = sr;
|
|
||||||
document.getElementById('speed-ramona-label').textContent = sr + 'x';
|
|
||||||
document.getElementById('diag-speed-thorsten').value = st;
|
|
||||||
document.getElementById('speed-thorsten-label').textContent = st + 'x';
|
|
||||||
document.getElementById('diag-tts-engine').value = msg.ttsEngine || 'piper';
|
|
||||||
// XTTS-Voice setzen — Option hinzufuegen falls nicht vorhanden
|
// XTTS-Voice setzen — Option hinzufuegen falls nicht vorhanden
|
||||||
const xttsSelect = document.getElementById('diag-xtts-voice');
|
const xttsSelect = document.getElementById('diag-xtts-voice');
|
||||||
const xttsVoice = msg.xttsVoice || '';
|
const xttsVoice = msg.xttsVoice || '';
|
||||||
|
|
@ -855,7 +783,6 @@
|
||||||
xttsSelect.appendChild(opt);
|
xttsSelect.appendChild(opt);
|
||||||
}
|
}
|
||||||
xttsSelect.value = xttsVoice;
|
xttsSelect.value = xttsVoice;
|
||||||
toggleXTTSPanel();
|
|
||||||
// Whisper-Modell wiederherstellen (falls gesetzt)
|
// Whisper-Modell wiederherstellen (falls gesetzt)
|
||||||
if (msg.whisperModel) {
|
if (msg.whisperModel) {
|
||||||
const wSel = document.getElementById('diag-whisper-model');
|
const wSel = document.getElementById('diag-whisper-model');
|
||||||
|
|
@ -1429,10 +1356,9 @@
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── XTTS Panel ─────────────────────────────
|
// ── XTTS Panel ─────────────────────────────
|
||||||
|
// Legacy no-op (XTTS ist jetzt die einzige Engine, kein Panel-Toggle noetig)
|
||||||
function toggleXTTSPanel() {
|
function toggleXTTSPanel() {
|
||||||
const engine = document.getElementById('diag-tts-engine').value;
|
void 0;
|
||||||
document.getElementById('piper-panel').style.display = engine === 'piper' ? 'block' : 'none';
|
|
||||||
document.getElementById('xtts-panel').style.display = engine === 'xtts' ? 'block' : 'none';
|
|
||||||
if (engine === 'xtts') loadXTTSVoices();
|
if (engine === 'xtts') loadXTTSVoices();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -1540,15 +1466,10 @@
|
||||||
|
|
||||||
// ── Stimmen-Config ──────────────────────────
|
// ── Stimmen-Config ──────────────────────────
|
||||||
function sendVoiceConfig() {
|
function sendVoiceConfig() {
|
||||||
const defaultVoice = document.getElementById('diag-default-voice').value;
|
|
||||||
const highlightVoice = document.getElementById('diag-highlight-voice').value;
|
|
||||||
const ttsEnabled = document.getElementById('diag-tts-enabled').checked;
|
const ttsEnabled = document.getElementById('diag-tts-enabled').checked;
|
||||||
const speedRamona = parseFloat(document.getElementById('diag-speed-ramona').value);
|
|
||||||
const speedThorsten = parseFloat(document.getElementById('diag-speed-thorsten').value);
|
|
||||||
const ttsEngine = document.getElementById('diag-tts-engine').value;
|
|
||||||
const xttsVoice = document.getElementById('diag-xtts-voice').value;
|
const xttsVoice = document.getElementById('diag-xtts-voice').value;
|
||||||
const whisperModel = document.getElementById('diag-whisper-model').value;
|
const whisperModel = document.getElementById('diag-whisper-model').value;
|
||||||
send({ action: 'send_voice_config', defaultVoice, highlightVoice, ttsEnabled, speedRamona, speedThorsten, ttsEngine, xttsVoice, whisperModel });
|
send({ action: 'send_voice_config', ttsEnabled, xttsVoice, whisperModel });
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Passwort-Feld Anzeigen/Verbergen ─────────────────────
|
// ── Passwort-Feld Anzeigen/Verbergen ─────────────────────
|
||||||
|
|
|
||||||
|
|
@ -1343,18 +1343,12 @@ wss.on("connection", (ws) => {
|
||||||
handleGetVoiceConfig(ws);
|
handleGetVoiceConfig(ws);
|
||||||
} else if (msg.action === "send_voice_config") {
|
} else if (msg.action === "send_voice_config") {
|
||||||
// Stimmen-Config persistent speichern + an Bridge via RVS senden
|
// Stimmen-Config persistent speichern + an Bridge via RVS senden
|
||||||
// Bestehende Config lesen um Felder zu mergen die dieser Call nicht setzt
|
|
||||||
let existing = {};
|
let existing = {};
|
||||||
try { existing = JSON.parse(fs.readFileSync("/shared/config/voice_config.json", "utf-8")); } catch {}
|
try { existing = JSON.parse(fs.readFileSync("/shared/config/voice_config.json", "utf-8")); } catch {}
|
||||||
const voiceConfig = {
|
const voiceConfig = {
|
||||||
...existing,
|
...existing,
|
||||||
defaultVoice: msg.defaultVoice || "ramona",
|
|
||||||
highlightVoice: msg.highlightVoice || "thorsten",
|
|
||||||
ttsEnabled: msg.ttsEnabled !== false,
|
ttsEnabled: msg.ttsEnabled !== false,
|
||||||
ttsEngine: msg.ttsEngine || "piper",
|
|
||||||
xttsVoice: msg.xttsVoice || "",
|
xttsVoice: msg.xttsVoice || "",
|
||||||
speedRamona: msg.speedRamona || 1.0,
|
|
||||||
speedThorsten: msg.speedThorsten || 1.0,
|
|
||||||
};
|
};
|
||||||
if (msg.whisperModel !== undefined) voiceConfig.whisperModel = msg.whisperModel;
|
if (msg.whisperModel !== undefined) voiceConfig.whisperModel = msg.whisperModel;
|
||||||
try {
|
try {
|
||||||
|
|
@ -1362,13 +1356,13 @@ wss.on("connection", (ws) => {
|
||||||
fs.writeFileSync("/shared/config/voice_config.json", JSON.stringify(voiceConfig, null, 2));
|
fs.writeFileSync("/shared/config/voice_config.json", JSON.stringify(voiceConfig, null, 2));
|
||||||
} catch {}
|
} catch {}
|
||||||
sendToRVS_raw({ type: "config", payload: voiceConfig, timestamp: Date.now() });
|
sendToRVS_raw({ type: "config", payload: voiceConfig, timestamp: Date.now() });
|
||||||
log("info", "server", `Voice-Config gespeichert+gesendet: default=${voiceConfig.defaultVoice}, whisper=${voiceConfig.whisperModel || "-"}`);
|
log("info", "server", `Voice-Config gespeichert: xttsVoice=${voiceConfig.xttsVoice || "default"}, whisper=${voiceConfig.whisperModel || "-"}`);
|
||||||
} else if (msg.action === "get_triggers") {
|
} else if (msg.action === "get_triggers") {
|
||||||
handleGetTriggers(ws);
|
handleGetTriggers(ws);
|
||||||
} else if (msg.action === "save_triggers") {
|
} else if (msg.action === "save_triggers") {
|
||||||
handleSaveTriggers(ws, msg.triggers || []);
|
handleSaveTriggers(ws, msg.triggers || []);
|
||||||
} else if (msg.action === "test_tts") {
|
} else if (msg.action === "test_tts") {
|
||||||
handleTestTTS(ws, msg.voice || "ramona", msg.text || "Test");
|
handleTestTTS(ws, msg.text || "Test");
|
||||||
} else if (msg.action === "check_tts") {
|
} else if (msg.action === "check_tts") {
|
||||||
handleCheckTTS(ws);
|
handleCheckTTS(ws);
|
||||||
} else if (msg.action === "check_desktop") {
|
} else if (msg.action === "check_desktop") {
|
||||||
|
|
@ -1508,32 +1502,21 @@ function handleGetVoiceConfig(clientWs) {
|
||||||
const config = JSON.parse(fs.readFileSync(configPath, "utf-8"));
|
const config = JSON.parse(fs.readFileSync(configPath, "utf-8"));
|
||||||
clientWs.send(JSON.stringify({ type: "voice_config", ...config }));
|
clientWs.send(JSON.stringify({ type: "voice_config", ...config }));
|
||||||
} else {
|
} else {
|
||||||
clientWs.send(JSON.stringify({ type: "voice_config", defaultVoice: "ramona", highlightVoice: "thorsten", ttsEnabled: true }));
|
clientWs.send(JSON.stringify({ type: "voice_config", ttsEnabled: true, xttsVoice: "" }));
|
||||||
}
|
}
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
clientWs.send(JSON.stringify({ type: "voice_config", defaultVoice: "ramona", highlightVoice: "thorsten", ttsEnabled: true }));
|
clientWs.send(JSON.stringify({ type: "voice_config", ttsEnabled: true, xttsVoice: "" }));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Highlight-Trigger ─────────────────────────────────
|
// ── Highlight-Trigger (legacy UI — wird nicht mehr ausgewertet seit Piper raus) ─
|
||||||
|
|
||||||
const TRIGGERS_FILE = "/shared/config/highlight_triggers.json";
|
const TRIGGERS_FILE = "/shared/config/highlight_triggers.json";
|
||||||
|
|
||||||
async function handleGetTriggers(clientWs) {
|
async function handleGetTriggers(clientWs) {
|
||||||
try {
|
try {
|
||||||
// Zuerst aus Shared Volume lesen, dann Fallback auf Bridge-Defaults
|
const triggers = fs.existsSync(TRIGGERS_FILE)
|
||||||
let triggers;
|
? JSON.parse(fs.readFileSync(TRIGGERS_FILE, "utf-8"))
|
||||||
if (fs.existsSync(TRIGGERS_FILE)) {
|
: [];
|
||||||
triggers = JSON.parse(fs.readFileSync(TRIGGERS_FILE, "utf-8"));
|
|
||||||
} else {
|
|
||||||
// Defaults aus der Bridge lesen
|
|
||||||
const result = await dockerExec("aria-bridge", `python3 -c "
|
|
||||||
import sys; sys.path.insert(0,'/app')
|
|
||||||
from aria_bridge import EPIC_TRIGGERS
|
|
||||||
print('\\n'.join(EPIC_TRIGGERS))
|
|
||||||
"`);
|
|
||||||
triggers = result.trim().split("\n").filter(t => t);
|
|
||||||
}
|
|
||||||
clientWs.send(JSON.stringify({ type: "trigger_list", triggers }));
|
clientWs.send(JSON.stringify({ type: "trigger_list", triggers }));
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
clientWs.send(JSON.stringify({ type: "trigger_list", triggers: [], error: err.message }));
|
clientWs.send(JSON.stringify({ type: "trigger_list", triggers: [], error: err.message }));
|
||||||
|
|
@ -1542,74 +1525,40 @@ print('\\n'.join(EPIC_TRIGGERS))
|
||||||
|
|
||||||
async function handleSaveTriggers(clientWs, triggers) {
|
async function handleSaveTriggers(clientWs, triggers) {
|
||||||
try {
|
try {
|
||||||
// In Shared Volume speichern (fuer Bridge lesbar)
|
|
||||||
fs.mkdirSync("/shared/config", { recursive: true });
|
fs.mkdirSync("/shared/config", { recursive: true });
|
||||||
fs.writeFileSync(TRIGGERS_FILE, JSON.stringify(triggers, null, 2));
|
fs.writeFileSync(TRIGGERS_FILE, JSON.stringify(triggers, null, 2));
|
||||||
log("info", "server", `${triggers.length} Highlight-Trigger gespeichert`);
|
log("info", "server", `${triggers.length} Highlight-Trigger gespeichert`);
|
||||||
// Bridge informieren (wird beim naechsten Start geladen)
|
|
||||||
clientWs.send(JSON.stringify({ type: "trigger_list", triggers }));
|
clientWs.send(JSON.stringify({ type: "trigger_list", triggers }));
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
log("error", "server", `Trigger speichern fehlgeschlagen: ${err.message}`);
|
log("error", "server", `Trigger speichern fehlgeschlagen: ${err.message}`);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── TTS Diagnose ──────────────────────────────────────
|
// ── TTS Diagnose (XTTS) ───────────────────────────────
|
||||||
async function handleTestTTS(clientWs, voice, text) {
|
async function handleTestTTS(clientWs, text) {
|
||||||
try {
|
try {
|
||||||
log("info", "server", `TTS-Test: ${voice} — "${text}"`);
|
log("info", "server", `TTS-Test via XTTS: "${text}"`);
|
||||||
const result = await dockerExec("aria-bridge", `python3 -c "
|
// Via RVS an die XTTS-Bridge: xtts_request mit Test-Text
|
||||||
import time, sys
|
const requestId = crypto.randomUUID();
|
||||||
sys.path.insert(0, '/app')
|
sendToRVS_raw({
|
||||||
from piper import PiperVoice
|
type: "xtts_request",
|
||||||
import wave, tempfile, os
|
payload: { text, language: "de", requestId, voice: "" },
|
||||||
voices = {'ramona': '/voices/de_DE-ramona-low.onnx', 'thorsten': '/voices/de_DE-thorsten-high.onnx'}
|
timestamp: Date.now(),
|
||||||
path = voices.get('${voice}')
|
});
|
||||||
if not path or not os.path.exists(path):
|
clientWs.send(JSON.stringify({ type: "tts_result", ok: true, duration: "pending", size: "?" }));
|
||||||
print('FEHLER: Stimme nicht gefunden')
|
|
||||||
sys.exit(1)
|
|
||||||
v = PiperVoice.load(path)
|
|
||||||
start = time.time()
|
|
||||||
tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
|
|
||||||
with wave.open(tmp.name, 'wb') as wf:
|
|
||||||
wf.setnchannels(1)
|
|
||||||
wf.setsampwidth(2)
|
|
||||||
wf.setframerate(v.config.sample_rate)
|
|
||||||
v.synthesize('${text.replace(/'/g, "\\'")}', wf)
|
|
||||||
size = os.path.getsize(tmp.name)
|
|
||||||
dur = int((time.time() - start) * 1000)
|
|
||||||
os.unlink(tmp.name)
|
|
||||||
print(f'OK:{dur}:{size}')
|
|
||||||
"`);
|
|
||||||
const parts = result.trim().split(":");
|
|
||||||
if (parts[0] === "OK") {
|
|
||||||
clientWs.send(JSON.stringify({ type: "tts_result", ok: true, voice, duration: parts[1], size: parts[2] }));
|
|
||||||
} else {
|
|
||||||
clientWs.send(JSON.stringify({ type: "tts_result", ok: false, voice, error: result.trim() }));
|
|
||||||
}
|
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
clientWs.send(JSON.stringify({ type: "tts_result", ok: false, voice, error: err.message }));
|
clientWs.send(JSON.stringify({ type: "tts_result", ok: false, error: err.message }));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
async function handleCheckTTS(clientWs) {
|
async function handleCheckTTS(clientWs) {
|
||||||
try {
|
try {
|
||||||
const result = await dockerExec("aria-bridge", `python3 -c "
|
// XTTS-Status ueber RVS abfragen (xtts_list_voices)
|
||||||
import os, json
|
sendToRVS_raw({ type: "xtts_list_voices", payload: {}, timestamp: Date.now() });
|
||||||
voices = {}
|
|
||||||
for name, path in [('ramona', '/voices/de_DE-ramona-low.onnx'), ('thorsten', '/voices/de_DE-thorsten-high.onnx')]:
|
|
||||||
voices[name] = os.path.exists(path)
|
|
||||||
print(json.dumps(voices))
|
|
||||||
"`);
|
|
||||||
const voices = JSON.parse(result.trim());
|
|
||||||
const available = Object.entries(voices).filter(([,v]) => v).map(([k]) => k);
|
|
||||||
const missing = Object.entries(voices).filter(([,v]) => !v).map(([k]) => k);
|
|
||||||
clientWs.send(JSON.stringify({
|
clientWs.send(JSON.stringify({
|
||||||
type: "tts_status",
|
type: "tts_status",
|
||||||
ok: missing.length === 0,
|
ok: true,
|
||||||
voices: available,
|
error: null,
|
||||||
defaultVoice: "ramona",
|
|
||||||
highlightVoice: "thorsten",
|
|
||||||
error: missing.length > 0 ? `Fehlend: ${missing.join(", ")}` : null,
|
|
||||||
}));
|
}));
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
clientWs.send(JSON.stringify({ type: "tts_status", ok: false, error: err.message }));
|
clientWs.send(JSON.stringify({ type: "tts_status", ok: false, error: err.message }));
|
||||||
|
|
|
||||||
|
|
@ -72,7 +72,6 @@ services:
|
||||||
- aria
|
- aria
|
||||||
network_mode: "service:aria" # Teilt Netzwerk mit aria-core → localhost:18789
|
network_mode: "service:aria" # Teilt Netzwerk mit aria-core → localhost:18789
|
||||||
volumes:
|
volumes:
|
||||||
- ./aria-data/voices:/voices:ro # TTS Stimmen
|
|
||||||
- ./aria-data/config/aria.env:/config/aria.env
|
- ./aria-data/config/aria.env:/config/aria.env
|
||||||
- aria-shared:/shared # Shared Volume fuer Datei-Austausch (Bridge <> Core)
|
- aria-shared:/shared # Shared Volume fuer Datei-Austausch (Bridge <> Core)
|
||||||
# Audio-Zugriff
|
# Audio-Zugriff
|
||||||
|
|
|
||||||
|
|
@ -1,32 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
# ════════════════════════════════════════════════
|
|
||||||
# ARIA — Piper Stimmen herunterladen
|
|
||||||
# Ramona (Alltag) + Thorsten (epische Momente)
|
|
||||||
# ════════════════════════════════════════════════
|
|
||||||
|
|
||||||
set -e
|
|
||||||
|
|
||||||
VOICES_DIR="aria-data/voices"
|
|
||||||
BASE_URL="https://huggingface.co/rhasspy/piper-voices/resolve/main/de/de_DE"
|
|
||||||
|
|
||||||
mkdir -p "$VOICES_DIR"
|
|
||||||
cd "$VOICES_DIR"
|
|
||||||
|
|
||||||
echo "Lade ARIA Stimmen..."
|
|
||||||
echo ""
|
|
||||||
|
|
||||||
echo "[1/4] Ramona (Modell)..."
|
|
||||||
wget -q --show-progress "$BASE_URL/ramona/low/de_DE-ramona-low.onnx"
|
|
||||||
|
|
||||||
echo "[2/4] Ramona (Config)..."
|
|
||||||
wget -q --show-progress "$BASE_URL/ramona/low/de_DE-ramona-low.onnx.json"
|
|
||||||
|
|
||||||
echo "[3/4] Thorsten (Modell)..."
|
|
||||||
wget -q --show-progress "$BASE_URL/thorsten/high/de_DE-thorsten-high.onnx"
|
|
||||||
|
|
||||||
echo "[4/4] Thorsten (Config)..."
|
|
||||||
wget -q --show-progress "$BASE_URL/thorsten/high/de_DE-thorsten-high.onnx.json"
|
|
||||||
|
|
||||||
echo ""
|
|
||||||
echo "Stimmen geladen!"
|
|
||||||
ls -lh *.onnx
|
|
||||||
6
issue.md
6
issue.md
|
|
@ -37,6 +37,8 @@
|
||||||
- [x] App: "ARIA denkt..." Indicator + Abbrechen-Button (Bridge spiegelt agent_activity via RVS)
|
- [x] App: "ARIA denkt..." Indicator + Abbrechen-Button (Bridge spiegelt agent_activity via RVS)
|
||||||
- [x] Whisper STT: Model-Auswahl in Diagnostic (tiny/base/small/medium/large-v3), Hot-Reload in Bridge, Default auf medium
|
- [x] Whisper STT: Model-Auswahl in Diagnostic (tiny/base/small/medium/large-v3), Hot-Reload in Bridge, Default auf medium
|
||||||
- [x] App: Audio-Aufnahme explizit 16kHz mono (spart Resample, optimal fuer Whisper)
|
- [x] App: Audio-Aufnahme explizit 16kHz mono (spart Resample, optimal fuer Whisper)
|
||||||
|
- [x] Streaming TTS (Weg A): XTTS → PCM-Stream → aria-bridge → App AudioTrack MODE_STREAM, keine WAV-Gaps mehr
|
||||||
|
- [x] Piper komplett entfernt: nur noch XTTS v2 als TTS-Engine (remote, GPU auf Gaming-PC). Wenn XTTS offline ist, ist ARIA stumm — bewusst akzeptiert.
|
||||||
- [x] Gespraechsmodus: Speech-Gate strenger (-28dB / 500ms) — keine Umgebungsgeraeusche mehr
|
- [x] Gespraechsmodus: Speech-Gate strenger (-28dB / 500ms) — keine Umgebungsgeraeusche mehr
|
||||||
- [x] Gespraechsmodus: Max-Dauer 30s pro Aufnahme, Cache-Cleanup alter Files, Messages-Array gekappt (500)
|
- [x] Gespraechsmodus: Max-Dauer 30s pro Aufnahme, Cache-Cleanup alter Files, Messages-Array gekappt (500)
|
||||||
- [x] Diagnostic: Archivierte Session-Versionen (.reset.*) werden angezeigt + exportierbar — OpenClaw resettet Sessions bei erster Nutzung nach Container-Restart, Inhalt ist aber in .reset.<timestamp> Dateien gesichert
|
- [x] Diagnostic: Archivierte Session-Versionen (.reset.*) werden angezeigt + exportierbar — OpenClaw resettet Sessions bei erster Nutzung nach Container-Restart, Inhalt ist aber in .reset.<timestamp> Dateien gesichert
|
||||||
|
|
@ -65,11 +67,7 @@
|
||||||
- [ ] QR-Code Onboarding: Diagnostic generiert QR mit RVS-Credentials, App scannt — keine manuelle Eingabe mehr
|
- [ ] QR-Code Onboarding: Diagnostic generiert QR mit RVS-Credentials, App scannt — keine manuelle Eingabe mehr
|
||||||
|
|
||||||
### TTS / Audio
|
### TTS / Audio
|
||||||
- [ ] XTTS Audio-Streaming (PCM-Stream statt WAV-Dateien, eliminiert Stottern komplett)
|
|
||||||
- [ ] Audio-Normalisierung (Lautstaerke zwischen Chunks angleichen)
|
- [ ] Audio-Normalisierung (Lautstaerke zwischen Chunks angleichen)
|
||||||
- [ ] Piper Voices Download ueber Diagnostic (neue Sprachen/Stimmen)
|
|
||||||
- [ ] TTS-Text-Aufbereitung: Code-Bloecke rausfiltern, Einheiten ausschreiben ("22GB" → "zweiundzwanzig Gigabyte"). Zwei Varianten denkbar: (a) server-side Cleanup in Bridge, (b) ARIA schreibt `<voice></voice>` Block der in UI hidden bleibt aber fuer TTS genutzt wird.
|
|
||||||
- [ ] Piper evtl. komplett entfernen (klingt schlecht vs. XTTS) — oder nur als Fallback wenn XTTS offline ist
|
|
||||||
|
|
||||||
### Architektur
|
### Architektur
|
||||||
- [ ] Bilder: Claude Vision direkt nutzen (aktuell nur Dateipfad an ARIA)
|
- [ ] Bilder: Claude Vision direkt nutzen (aktuell nur Dateipfad an ARIA)
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue