Compare commits

...

22 Commits

Author SHA1 Message Date
duffyduck ed2f1bb5ee release: bump version to 0.0.5.4 2026-04-24 14:45:17 +02:00
duffyduck 0a04972455 feat: Stille-Toleranz fuer Aufnahme einstellbar in App-Settings
Neuer +/- Block in SettingsScreen → Spracheingabe → "Stille-Toleranz",
1.0-8.0s, Default 2.8s. Wert in AsyncStorage (aria_vad_silence_sec).
audio.ts liest den Wert beim Aufnahme-Start und nutzt ihn fuer den
VAD-Auto-Stop-Schwellwert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:44:17 +02:00
duffyduck 2a4379eb64 release: bump version to 0.0.5.3 2026-04-24 14:41:59 +02:00
duffyduck e64df23bb7 fix: TTS pausiert andere Apps statt zu ducken + VAD/Mic laenger
AudioFocus.requestDuck nutzt jetzt AUDIOFOCUS_GAIN_TRANSIENT (statt
TRANSIENT_MAY_DUCK) — Spotify/YouTube pausieren komplett solange ARIA
spricht und kommen nicht mitten drin wieder hoch.

PcmStreamPlayer.end() resolved jetzt erst wenn der native Writer-Thread
wirklich fertig ist (alle Samples aus dem Pre-Roll-Puffer ausgespielt).
audio.ts wartet entsprechend, bevor AudioFocus.release() gerufen wird —
behebt das "Musik dreht hoch waehrend Antwort noch laeuft"-Problem.

Mic-Aufnahme: VAD_SILENCE_DURATION_MS 1800 → 2800ms (mehr Toleranz fuer
Sprechpausen), MAX_RECORDING_MS 30s → 120s (laengere Erklaerungen
moeglich, Notbremse bleibt).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:40:58 +02:00
duffyduck 576ae925dd feat(phase2): XTTS durch F5-TTS ersetzt — Voice Cloning auf der Gamebox
Neuer aria-f5tts-bridge Container:
  - Python-Service, laedt F5TTS_v1_Base beim Start
  - Empfaengt xtts_request via RVS, synthetisiert mit Voice-Cloning,
    streamt PCM-Chunks (audio_pcm, 16-bit s16le) wie zuvor die XTTS-Bridge
  - Teilt lange Texte an Satzgrenzen, streamt satzweise
  - Fade-In auf erstem Chunk, Queue gegen parallel-Render

Voice-Management:
  - Speicherort weiterhin /voices/, aber jetzt als Paar
    {name}.wav + {name}.txt (F5-TTS braucht Referenz-Transkription)
  - voice_upload: WAV speichern, intern stt_request an whisper-bridge
    senden, Transkription als .txt ablegen → user muss nichts eintippen
  - On-the-fly Transkribierung: wenn eine WAV ohne .txt liegt, wird
    bei erstem Render/Preload nachgezogen
  - Bestehende RVS-Messages (voice_upload/xtts_list_voices/... etc.)
    bleiben unveraendert → keine App/Diagnostic-Aenderung noetig

Gaming-PC docker-compose:
  - xtts + xtts-bridge Services entfernt
  - f5tts-bridge + whisper-bridge bleiben/kommen rein
  - Volume xtts-models → f5tts-models

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:34:11 +02:00
duffyduck e170991222 fix: _send_to_rvs gibt Success-Bool zurueck, _stt_remote bricht bei
Send-Fehler sofort ab statt in den 45s-Timeout zu laufen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:16:08 +02:00
duffyduck a1343ee18f debug: Logs beim stt_request-Roundtrip — aria-bridge loggt beim Senden,
whisper-bridge loggt eingehende stt_request (id + Audio-Groesse).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:13:41 +02:00
duffyduck b2d3c935d8 fix(whisper): requests explizit als Dependency — faster-whisper 1.0.3
zieht sie selber nicht rein, Container crashed sonst beim Import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:59:59 +02:00
duffyduck 49089eee4b release: bump version to 0.0.5.2 2026-04-24 13:50:19 +02:00
duffyduck e544992c9f feat(phase1): Whisper STT auf die Gamebox ausgelagert
Neuer Container aria-whisper-bridge auf der Gamebox — faster-whisper
CUDA mit float16. Der Container verbindet sich per WebSocket an den RVS,
nimmt stt_request entgegen, laeuft ffmpeg+Whisper, antwortet mit
stt_response. Hoert zusaetzlich auf config-Broadcasts und lädt das
Modell hot-swap bei Diagnostic-Wechsel.

aria-bridge ruft jetzt primaer die Gamebox an; nur wenn die nicht binnen
45s antwortet, faellt auf lokales Whisper (CPU) zurueck. Das lokale
Modell wird lazy geladen, spart RAM auf der VM.

RVS: stt_request/stt_response zur ALLOWED_TYPES-Liste.

Diagnostic-Voice-Config (whisperModel-Feld) bleibt unveraendert —
die Auswahl wird an die Gamebox durchgereicht.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:42:07 +02:00
duffyduck 97a1a3089a release: bump version to 0.0.5.1 2026-04-23 22:02:17 +02:00
duffyduck 64f18e97a0 release: bump version to 0.0.5.0 2026-04-23 15:31:18 +02:00
duffyduck 9cbea27455 feat: voice_preload/voice_ready — Feedback wenn neue Stimme geladen ist
XTTS-Bridge:
  - empfaengt neuen voice_preload Type, rendert stumm "ja." fuer die Stimme
    via TTS-Queue (damit kein Konflikt mit echtem TTS)
  - horcht zusaetzlich auf config-Broadcasts: wenn Diagnostic global die
    Stimme wechselt, wird auto-preloaded
  - broadcastet voice_ready mit Dauer (loadMs) oder error

RVS: voice_preload + voice_ready zur ALLOWED_TYPES-Liste.

App (SettingsScreen): beim Wechsel senden wir voice_preload, zeigen einen
Spinner in der Voice-Row und einen Toast mit "Stimme X bereit (Ns)".
App (ChatScreen): Toast auch hier — falls User gerade nicht in Settings ist.

Diagnostic (server+UI): voice_ready wird an Browser durchgereicht, ein
Status-Text unter dem Voice-Dropdown zeigt "wird geladen" → "bereit".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:24:08 +02:00
duffyduck c8881f9e4d release: bump version to 0.0.4.9 2026-04-22 23:02:28 +02:00
duffyduck 028e3b2240 fix: Voice-Auswahl funktioniert endlich + Diagnostic setzt alle Apps zurueck
XTTS-Bridge: im daswer123 local-Mode erwartet der Server speaker_wav als
Basename (z.B. "Maia"), nicht als Pfad. Wir haben bisher "/voices/Maia.wav"
geschickt, was der Server stumm verwirft und Default nimmt. Jetzt: speaker
name pur senden + Warnlog wenn File fehlt.

App: ChatScreen + SettingsScreen horchen auf type "config" vom RVS —
wenn in Diagnostic die globale XTTS-Voice gewechselt wird, werden alle
Apps auf den neuen Wert zurueckgesetzt (wie vom User gewuenscht).
Lokale App-Wahl bleibt sonst intakt und gewinnt pro Request.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:32:40 +02:00
duffyduck c042f27106 feat: generisches Buchstabieren fuer unbekannte Akronyme
Nach der expliziten _UNIT_WORDS-Liste greift eine Fallback-Regel:
alle verbleibenden 2-5-Zeichen-Grossbuchstaben-Woerter werden
buchstabiert. XTTS → X T T S, USB → U S B, DNS → D N S, JSON → J S O N.

Spezielle Faelle (WLAN, NATO — als Wort gesprochen) koennen bei
Bedarf in _UNIT_WORDS explizit ueberschrieben werden.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:17:04 +02:00
duffyduck 4ceadf8be5 release: bump version to 0.0.4.8 2026-04-22 19:08:00 +02:00
duffyduck ddd30b3059 feat: Pre-Roll-Buffer fuer TTS einstellbar in App-Settings
- Kotlin start() nimmt jetzt prerollSeconds als dritten Parameter
  (1.0-6.0s geclampt, Fallback 3.5s bei ungueltigem Wert)
- audio.ts liest Wert aus AsyncStorage vor jedem Stream-Start,
  exportiert Default/Min/Max/Key als Konstanten
- SettingsScreen: +/- Buttons direkt unter dem TTS-Toggle,
  Default auf 3.5s (von 2.5s) angehoben

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:06:55 +02:00
duffyduck 6c8ba5fe2d fix: Fade-In auf ersten PCM-Chunk — maskiert XTTS-Warmup-Glitches
XTTS daswer123 hat am Anfang jedes Renders Warmup-Artefakte — die
ersten autoregressiv generierten Tokens haben wenig Kontext und klingen
verzerrt. Ein 120ms Linear-Fade-In auf den ersten ausgehenden PCM-Chunk
blendet das sanft auf und versteckt die Glitches, ohne dass das echte
Audio danach leiser klingt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:01:36 +02:00
duffyduck 32ddac002f fix: stream_chunk_size auf 250 erhoeht — weniger Render-Artefakte
XTTS daswer123 erzeugt an Chunk-Grenzen oft Glitches in den Worten
die ueber die Grenze gehen. 100 → 250 = weniger Grenzen pro Satz =
sauberere Sprachausgabe. Erste-Audio-Latenz steigt um ein paar Sekunden,
was aber OK ist seit die App Pre-Roll gepuffert ist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:56:00 +02:00
duffyduck bbbe69d928 release: bump version to 0.0.4.7 2026-04-22 18:46:25 +02:00
duffyduck 23c39d5bba feat: Dezimalzahlen fuer TTS ausschreiben + Leading-Silence im Stream
- aria_bridge clean_text_for_tts: "0.1" / "0,5" / "1,25" wird jetzt als
  "null komma eins" / "null komma fuenf" / "eins komma zwei fuenf"
  ausgeschrieben. Lookahead verhindert Match auf IP-artige Strings.
- PcmStreamPlayer: 200ms Stille am Stream-Anfang, damit AudioTrack
  sauber anfaehrt und die ersten Worte nicht verschluckt werden.
  (XTTS-Warmup + play()-Startup-Latenz)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:44:38 +02:00
21 changed files with 1426 additions and 600 deletions
+2 -2
View File
@@ -79,8 +79,8 @@ android {
applicationId "com.ariacockpit"
minSdkVersion rootProject.ext.minSdkVersion
targetSdkVersion rootProject.ext.targetSdkVersion
versionCode 406
versionName "0.0.4.6"
versionCode 504
versionName "0.0.5.4"
// Fallback fuer Libraries mit Product Flavors
missingDimensionStrategy 'react-native-camera', 'general'
}
@@ -53,11 +53,17 @@ class AudioFocusModule(reactContext: ReactApplicationContext) : ReactContextBase
promise.resolve(result == AudioManager.AUDIOFOCUS_REQUEST_GRANTED)
}
/** Andere Apps werden leiser (TTS spricht). */
/** Andere Apps werden pausiert (TTS spricht).
*
* TRANSIENT (statt TRANSIENT_MAY_DUCK): Spotify/YouTube pausieren komplett
* statt nur leiser zu werden. Verhindert auch das "kommt-wieder-hoch"-
* Problem mit MAY_DUCK, wo das System nach kurzer Zeit den Duck-Effekt
* wieder aufgehoben hat obwohl wir den Fokus noch hielten.
*/
@ReactMethod
fun requestDuck(promise: Promise) {
requestFocus(
AudioManager.AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK,
AudioManager.AUDIOFOCUS_GAIN_TRANSIENT,
AudioAttributes.USAGE_ASSISTANT,
promise,
)
@@ -30,9 +30,13 @@ import java.util.concurrent.LinkedBlockingQueue
class PcmStreamPlayerModule(reactContext: ReactApplicationContext) : ReactContextBaseJavaModule(reactContext) {
companion object {
private const val TAG = "PcmStreamPlayer"
// Sekunden Audio die VOR play()-Start gepuffert sein muessen.
// 2.5s Vorrat = genug um XTTS-Render-Pausen zwischen Chunks zu puffern.
private const val PREROLL_SECONDS = 2.5
// Fallback wenn JS keinen Wert uebergibt.
private const val DEFAULT_PREROLL_SECONDS = 3.5
private const val MIN_PREROLL_SECONDS = 0.5
private const val MAX_PREROLL_SECONDS = 10.0
// Stille am Stream-Anfang, damit AudioTrack sauber anfaehrt und die
// ersten Samples nicht abgeschnitten werden (XTTS-Warmup + play()-Latenz).
private const val LEADING_SILENCE_SECONDS = 0.2
}
override fun getName() = "PcmStreamPlayer"
@@ -50,17 +54,21 @@ class PcmStreamPlayerModule(reactContext: ReactApplicationContext) : ReactContex
// ── Lifecycle ──
@ReactMethod
fun start(sampleRate: Int, channels: Int, promise: Promise) {
fun start(sampleRate: Int, channels: Int, prerollSeconds: Double, promise: Promise) {
try {
// Alte Session beenden falls vorhanden
stopInternal()
val prerollSec = prerollSeconds
.coerceIn(MIN_PREROLL_SECONDS, MAX_PREROLL_SECONDS)
.let { if (it.isFinite() && it > 0) it else DEFAULT_PREROLL_SECONDS }
val channelConfig = if (channels == 2) AudioFormat.CHANNEL_OUT_STEREO else AudioFormat.CHANNEL_OUT_MONO
val encoding = AudioFormat.ENCODING_PCM_16BIT
val minBuf = AudioTrack.getMinBufferSize(sampleRate, channelConfig, encoding)
val bytesPerSecond = sampleRate * channels * 2 // 16-bit = 2 bytes
// Buffer muss mindestens PREROLL + etwas Spielraum fassen.
val prerollTarget = (bytesPerSecond * PREROLL_SECONDS).toInt()
val prerollTarget = (bytesPerSecond * prerollSec).toInt()
val bufferSize = (minBuf * 32).coerceAtLeast(prerollTarget * 2)
prerollBytes = prerollTarget
bytesBuffered = 0
@@ -94,6 +102,18 @@ class PcmStreamPlayerModule(reactContext: ReactApplicationContext) : ReactContex
writerThread = Thread({
val t = track ?: return@Thread
try {
// Leading-Silence in den Buffer — gibt AudioTrack Zeit anzufahren.
val silenceBytes = ((sampleRate * channels * 2) * LEADING_SILENCE_SECONDS).toInt() and 0x7FFFFFFE
if (silenceBytes > 0) {
val silence = ByteArray(silenceBytes)
var silOff = 0
while (silOff < silence.size && !writerShouldStop) {
val w = t.write(silence, silOff, silence.size - silOff)
if (w <= 0) break
silOff += w
}
bytesBuffered += silence.size
}
while (!writerShouldStop) {
val data = queue.poll(50, java.util.concurrent.TimeUnit.MILLISECONDS) ?: run {
if (endRequested) {
@@ -158,7 +178,7 @@ class PcmStreamPlayerModule(reactContext: ReactApplicationContext) : ReactContex
}
}, "PcmStreamWriter").apply { start() }
Log.i(TAG, "Stream gestartet: ${sampleRate}Hz ch=$channels buf=${bufferSize}B preroll=${prerollBytes}B")
Log.i(TAG, "Stream gestartet: ${sampleRate}Hz ch=$channels buf=${bufferSize}B preroll=${prerollBytes}B (${prerollSec}s)")
promise.resolve(true)
} catch (e: Exception) {
Log.e(TAG, "start fehlgeschlagen", e)
@@ -181,11 +201,27 @@ class PcmStreamPlayerModule(reactContext: ReactApplicationContext) : ReactContex
}
}
/** Signalisiert: keine weiteren Chunks. Writer wartet auf Queue-Abschluss, dann stoppt. */
/** Signalisiert: keine weiteren Chunks. Writer spielt aus, dann stoppt.
* Das Promise resolved erst wenn der Writer-Thread fertig ist —
* wichtig damit der Aufrufer den AudioFocus erst NACH dem letzten
* abgespielten Sample wieder freigibt (sonst dreht Spotify hoch
* waehrend das Pre-Roll noch ausspielt).
*/
@ReactMethod
fun end(promise: Promise) {
endRequested = true
promise.resolve(true)
val t = writerThread
if (t == null || !t.isAlive) {
promise.resolve(true)
return
}
// Im Hintergrund auf den Writer warten — kein Threading-Block fuer JS-Bridge
Thread({
try {
t.join(15_000) // hartes Cap, falls Writer haengt
} catch (_: InterruptedException) {}
promise.resolve(true)
}, "PcmStreamEndWaiter").start()
}
/** Harter Stop (Cancel) — Queue verwerfen. */
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "aria-cockpit",
"version": "0.0.4.6",
"version": "0.0.5.4",
"private": true,
"scripts": {
"android": "react-native run-android",
+21
View File
@@ -18,6 +18,7 @@ import {
Image,
ScrollView,
Modal,
ToastAndroid,
} from 'react-native';
import AsyncStorage from '@react-native-async-storage/async-storage';
import RNFS from 'react-native-fs';
@@ -325,6 +326,26 @@ const ChatScreen: React.FC = () => {
const tool = (message.payload.tool as string) || '';
setAgentActivity({ activity, tool });
}
// Voice-Config aus Diagnostic — setzt die lokale App-Stimme auf den
// gerade in Diagnostic gewaehlten Wert zurueck. User-Wahl in der App
// wird dadurch ueberschrieben.
if (message.type === ('config' as any)) {
const newVoice = ((message.payload as any).xttsVoice as string) ?? '';
localXttsVoiceRef.current = newVoice;
AsyncStorage.setItem('aria_xtts_voice', newVoice);
}
// XTTS-Bridge meldet Stimme fertig geladen (kurzer Status-Toast)
if (message.type === ('voice_ready' as any)) {
const v = ((message.payload as any).voice as string) ?? '';
const err = (message.payload as any).error as string | undefined;
if (err) {
ToastAndroid.show(`Stimme "${v}" Fehler: ${err}`, ToastAndroid.LONG);
} else {
ToastAndroid.show(`Stimme "${v || 'Standard'}" bereit`, ToastAndroid.SHORT);
}
}
});
const unsubState = rvs.onStateChange((state) => {
+168 -1
View File
@@ -15,11 +15,23 @@ import {
StyleSheet,
Alert,
Platform,
ToastAndroid,
ActivityIndicator,
} from 'react-native';
import AsyncStorage from '@react-native-async-storage/async-storage';
import RNFS from 'react-native-fs';
import DocumentPicker from 'react-native-document-picker';
import rvs, { ConnectionState, RVSMessage, ConnectionConfig, ConnectionLogEntry } from '../services/rvs';
import {
TTS_PREROLL_DEFAULT_SEC,
TTS_PREROLL_MIN_SEC,
TTS_PREROLL_MAX_SEC,
TTS_PREROLL_STORAGE_KEY,
VAD_SILENCE_DEFAULT_SEC,
VAD_SILENCE_MIN_SEC,
VAD_SILENCE_MAX_SEC,
VAD_SILENCE_STORAGE_KEY,
} from '../services/audio';
import ModeSelector from '../components/ModeSelector';
import QRScanner from '../components/QRScanner';
import VoiceCloneModal from '../components/VoiceCloneModal';
@@ -73,8 +85,11 @@ const SettingsScreen: React.FC = () => {
const [autoDownload, setAutoDownload] = useState(true);
const [storageSize, setStorageSize] = useState('...');
const [ttsEnabled, setTtsEnabled] = useState(true);
const [ttsPrerollSec, setTtsPrerollSec] = useState<number>(TTS_PREROLL_DEFAULT_SEC);
const [vadSilenceSec, setVadSilenceSec] = useState<number>(VAD_SILENCE_DEFAULT_SEC);
const [editingPath, setEditingPath] = useState(false);
const [xttsVoice, setXttsVoice] = useState('');
const [loadingVoice, setLoadingVoice] = useState<string | null>(null);
const [availableVoices, setAvailableVoices] = useState<Array<{name: string, size: number}>>([]);
const [voiceCloneVisible, setVoiceCloneVisible] = useState(false);
const [tempPath, setTempPath] = useState('');
@@ -99,6 +114,22 @@ const SettingsScreen: React.FC = () => {
AsyncStorage.getItem('aria_tts_enabled').then(saved => {
if (saved !== null) setTtsEnabled(saved === 'true');
});
AsyncStorage.getItem(TTS_PREROLL_STORAGE_KEY).then(saved => {
if (saved != null) {
const n = parseFloat(saved);
if (isFinite(n) && n >= TTS_PREROLL_MIN_SEC && n <= TTS_PREROLL_MAX_SEC) {
setTtsPrerollSec(n);
}
}
});
AsyncStorage.getItem(VAD_SILENCE_STORAGE_KEY).then(saved => {
if (saved != null) {
const n = parseFloat(saved);
if (isFinite(n) && n >= VAD_SILENCE_MIN_SEC && n <= VAD_SILENCE_MAX_SEC) {
setVadSilenceSec(n);
}
}
});
AsyncStorage.getItem('aria_xtts_voice').then(saved => {
if (saved) setXttsVoice(saved);
});
@@ -250,6 +281,31 @@ const SettingsScreen: React.FC = () => {
}
rvs.send('xtts_list_voices' as any, {});
}
// Diagnostic-Voice-Wechsel → lokale App-Stimme auf den neuen Default zuruecksetzen.
// Zusaetzlich Preload triggern, damit der User weiss wann's geladen ist.
if (message.type === ('config' as any)) {
const newVoice = ((message.payload as any).xttsVoice as string) ?? '';
setXttsVoice(newVoice);
AsyncStorage.setItem('aria_xtts_voice', newVoice);
if (newVoice) {
setLoadingVoice(newVoice);
}
}
// XTTS-Bridge meldet: Stimme fertig geladen
if (message.type === ('voice_ready' as any)) {
const v = ((message.payload as any).voice as string) ?? '';
const err = (message.payload as any).error as string | undefined;
const ms = (message.payload as any).loadMs as number | undefined;
setLoadingVoice(null);
if (err) {
ToastAndroid.show(`Stimme "${v}" konnte nicht geladen werden: ${err}`, ToastAndroid.LONG);
} else {
const suffix = ms ? ` (${(ms / 1000).toFixed(1)}s)` : '';
ToastAndroid.show(`Stimme "${v || 'Standard'}" bereit${suffix}`, ToastAndroid.SHORT);
}
}
});
return () => {
@@ -318,6 +374,13 @@ const SettingsScreen: React.FC = () => {
const selectVoice = useCallback((voiceName: string) => {
setXttsVoice(voiceName);
AsyncStorage.setItem('aria_xtts_voice', voiceName);
// Preload nur fuer Custom-Voices — "Standard" braucht keinen Ladevorgang
if (voiceName) {
setLoadingVoice(voiceName);
rvs.send('voice_preload' as any, { voice: voiceName, source: 'app' });
} else {
setLoadingVoice(null);
}
}, []);
const deleteVoice = useCallback((name: string) => {
@@ -505,6 +568,43 @@ const SettingsScreen: React.FC = () => {
</View>
</View>
{/* === Spracheingabe (geraetelokal) === */}
<Text style={styles.sectionTitle}>Spracheingabe</Text>
<View style={styles.card}>
<Text style={styles.toggleLabel}>Stille-Toleranz</Text>
<Text style={styles.toggleHint}>
Wie lange du eine Sprechpause machen darfst, bevor die Aufnahme
automatisch beendet und gesendet wird. Hoeher = mehr Zeit zum
Nachdenken; niedriger = schnelleres Senden.
Default: {VAD_SILENCE_DEFAULT_SEC.toFixed(1)}s.
</Text>
<View style={styles.prerollRow}>
<TouchableOpacity
style={styles.prerollButton}
onPress={() => {
const next = Math.max(VAD_SILENCE_MIN_SEC, Math.round((vadSilenceSec - 0.5) * 10) / 10);
setVadSilenceSec(next);
AsyncStorage.setItem(VAD_SILENCE_STORAGE_KEY, String(next));
}}
disabled={vadSilenceSec <= VAD_SILENCE_MIN_SEC}
>
<Text style={styles.prerollButtonText}>0.5</Text>
</TouchableOpacity>
<Text style={styles.prerollValue}>{vadSilenceSec.toFixed(1)} s</Text>
<TouchableOpacity
style={styles.prerollButton}
onPress={() => {
const next = Math.min(VAD_SILENCE_MAX_SEC, Math.round((vadSilenceSec + 0.5) * 10) / 10);
setVadSilenceSec(next);
AsyncStorage.setItem(VAD_SILENCE_STORAGE_KEY, String(next));
}}
disabled={vadSilenceSec >= VAD_SILENCE_MAX_SEC}
>
<Text style={styles.prerollButtonText}>+0.5</Text>
</TouchableOpacity>
</View>
</View>
{/* === Sprachausgabe (geraetelokal) === */}
<Text style={styles.sectionTitle}>Sprachausgabe</Text>
<View style={styles.card}>
@@ -527,6 +627,42 @@ const SettingsScreen: React.FC = () => {
/>
</View>
{ttsEnabled && (
<View style={{marginTop: 20}}>
<Text style={styles.toggleLabel}>Puffer vor Wiedergabestart</Text>
<Text style={styles.toggleHint}>
Wie viel Audio gesammelt wird bevor die Wiedergabe startet.
Hoeher = robuster gegen Render-Pausen, aber mehr Startverzoegerung.
Default: {TTS_PREROLL_DEFAULT_SEC.toFixed(1)}s.
</Text>
<View style={styles.prerollRow}>
<TouchableOpacity
style={styles.prerollButton}
onPress={() => {
const next = Math.max(TTS_PREROLL_MIN_SEC, Math.round((ttsPrerollSec - 0.5) * 10) / 10);
setTtsPrerollSec(next);
AsyncStorage.setItem(TTS_PREROLL_STORAGE_KEY, String(next));
}}
disabled={ttsPrerollSec <= TTS_PREROLL_MIN_SEC}
>
<Text style={styles.prerollButtonText}>0.5</Text>
</TouchableOpacity>
<Text style={styles.prerollValue}>{ttsPrerollSec.toFixed(1)} s</Text>
<TouchableOpacity
style={styles.prerollButton}
onPress={() => {
const next = Math.min(TTS_PREROLL_MAX_SEC, Math.round((ttsPrerollSec + 0.5) * 10) / 10);
setTtsPrerollSec(next);
AsyncStorage.setItem(TTS_PREROLL_STORAGE_KEY, String(next));
}}
disabled={ttsPrerollSec >= TTS_PREROLL_MAX_SEC}
>
<Text style={styles.prerollButtonText}>+0.5</Text>
</TouchableOpacity>
</View>
</View>
)}
{ttsEnabled && (
<View style={{marginTop: 20}}>
<Text style={styles.toggleLabel}>Stimme (geraetelokal)</Text>
@@ -561,7 +697,10 @@ const SettingsScreen: React.FC = () => {
</Text>
<Text style={styles.voiceRowMeta}>{(v.size / 1024).toFixed(0)} KB</Text>
</TouchableOpacity>
{xttsVoice === v.name && <Text style={styles.voiceRowCheck}>{'\u2713'}</Text>}
{loadingVoice === v.name && (
<ActivityIndicator size="small" color="#0096FF" style={{marginRight: 8}} />
)}
{xttsVoice === v.name && loadingVoice !== v.name && <Text style={styles.voiceRowCheck}>{'\u2713'}</Text>}
<TouchableOpacity onPress={() => deleteVoice(v.name)} style={styles.voiceRowDelete}>
<Text style={styles.voiceRowDeleteIcon}>X</Text>
</TouchableOpacity>
@@ -1118,6 +1257,34 @@ const styles = StyleSheet.create({
bottomSpacer: {
height: 40,
},
prerollRow: {
flexDirection: 'row',
alignItems: 'center',
justifyContent: 'center',
marginTop: 12,
gap: 16,
},
prerollButton: {
backgroundColor: '#2A2A3E',
paddingHorizontal: 18,
paddingVertical: 10,
borderRadius: 8,
minWidth: 72,
alignItems: 'center',
},
prerollButtonText: {
color: '#FFFFFF',
fontSize: 16,
fontWeight: '600',
},
prerollValue: {
color: '#FFFFFF',
fontSize: 20,
fontWeight: '700',
minWidth: 80,
textAlign: 'center',
},
});
export default SettingsScreen;
+55 -7
View File
@@ -9,6 +9,7 @@
import { Platform, PermissionsAndroid, NativeModules } from 'react-native';
import Sound from 'react-native-sound';
import RNFS from 'react-native-fs';
import AsyncStorage from '@react-native-async-storage/async-storage';
import AudioRecorderPlayer, {
AudioEncoderAndroidType,
AudioSourceAndroidType,
@@ -41,7 +42,7 @@ const { AudioFocus, PcmStreamPlayer } = NativeModules as {
release: () => Promise<boolean>;
};
PcmStreamPlayer?: {
start: (sampleRate: number, channels: number) => Promise<boolean>;
start: (sampleRate: number, channels: number, prerollSeconds: number) => Promise<boolean>;
writeChunk: (base64Pcm: string) => Promise<boolean>;
end: () => Promise<boolean>;
stop: () => Promise<boolean>;
@@ -73,12 +74,52 @@ const AUDIO_ENCODING = 'audio/wav';
// VAD (Voice Activity Detection) — Stille-Erkennung
const VAD_SILENCE_THRESHOLD_DB = -45; // dB unter dem als "Stille" gilt
const VAD_SILENCE_DURATION_MS = 1800; // ms Stille bevor Auto-Stop
const VAD_SPEECH_THRESHOLD_DB = -28; // dB ueber dem als "Sprache" gilt (Sprach-Gate) — hoeher = weniger Umgebungsgeraeusche
const VAD_SPEECH_MIN_MS = 500; // ms Sprache bevor Aufnahme zaehlt — laenger = keine Huestler/Klopfer mehr
// Max-Dauer einer Aufnahme in Gespraechsmodus (Notbremse gegen Runaway-Loops)
const MAX_RECORDING_MS = 30000;
// VAD-Stille (in Sekunden) — wie lange Sprechpause toleriert wird, bevor
// die Aufnahme automatisch beendet wird. Einstellbar in den App-Settings.
export const VAD_SILENCE_DEFAULT_SEC = 2.8;
export const VAD_SILENCE_MIN_SEC = 1.0;
export const VAD_SILENCE_MAX_SEC = 8.0;
export const VAD_SILENCE_STORAGE_KEY = 'aria_vad_silence_sec';
async function loadVadSilenceMs(): Promise<number> {
try {
const raw = await AsyncStorage.getItem(VAD_SILENCE_STORAGE_KEY);
if (raw != null) {
const n = parseFloat(raw);
if (isFinite(n) && n >= VAD_SILENCE_MIN_SEC && n <= VAD_SILENCE_MAX_SEC) {
return Math.round(n * 1000);
}
}
} catch {}
return Math.round(VAD_SILENCE_DEFAULT_SEC * 1000);
}
// Max-Dauer einer Aufnahme (Notbremse gegen Runaway-Loops). Auf 2 Minuten
// hochgezogen damit auch laengere Erklaerungen durchgehen.
const MAX_RECORDING_MS = 120000;
// Pre-Roll: Wie lange Audio im AudioTrack-Buffer liegt bevor play() startet.
// Einstellbar via Diagnostic/Settings (Key: aria_tts_preroll_sec).
export const TTS_PREROLL_DEFAULT_SEC = 3.5;
export const TTS_PREROLL_MIN_SEC = 1.0;
export const TTS_PREROLL_MAX_SEC = 6.0;
export const TTS_PREROLL_STORAGE_KEY = 'aria_tts_preroll_sec';
async function loadPrerollSec(): Promise<number> {
try {
const raw = await AsyncStorage.getItem(TTS_PREROLL_STORAGE_KEY);
if (raw != null) {
const n = parseFloat(raw);
if (isFinite(n) && n >= TTS_PREROLL_MIN_SEC && n <= TTS_PREROLL_MAX_SEC) {
return n;
}
}
} catch {}
return TTS_PREROLL_DEFAULT_SEC;
}
// --- Audio-Service ---
@@ -216,12 +257,14 @@ class AudioService {
// Andere Apps waehrend der Aufnahme pausieren (Musik, Videos etc.)
AudioFocus?.requestExclusive().catch(() => {});
// VAD aktivieren
// VAD aktivieren — Stille-Dauer aus AsyncStorage (Settings-konfigurierbar)
this.vadEnabled = autoStop;
if (autoStop) {
const vadSilenceMs = await loadVadSilenceMs();
console.log('[Audio] VAD-Stille:', vadSilenceMs, 'ms');
this.vadTimer = setInterval(() => {
const silenceDuration = Date.now() - this.lastSpeechTime;
if (silenceDuration >= VAD_SILENCE_DURATION_MS) {
if (silenceDuration >= vadSilenceMs) {
console.log(`[Audio] VAD: ${silenceDuration}ms Stille — Auto-Stop`);
this.silenceListeners.forEach(cb => cb());
}
@@ -373,8 +416,9 @@ class AudioService {
this.pcmBuffer = [];
this.pcmBytesCollected = 0;
if (!silent) {
const prerollSec = await loadPrerollSec();
try {
await PcmStreamPlayer!.start(sampleRate, channels);
await PcmStreamPlayer!.start(sampleRate, channels, prerollSec);
} catch (err) {
console.error('[Audio] PcmStreamPlayer.start fehlgeschlagen:', err);
this.pcmStreamActive = false;
@@ -397,6 +441,10 @@ class AudioService {
if (isFinal) {
if (!silent) {
// end() resolved jetzt erst wenn der native Writer-Thread fertig
// ist (alle Samples ausgespielt) — danach erst AudioFocus freigeben,
// damit Spotify/YouTube nicht waehrend des Pre-Roll-Ausklangs
// wieder aufdrehen.
try { await PcmStreamPlayer!.end(); } catch {}
AudioFocus?.release().catch(() => {});
}
+160 -47
View File
@@ -150,6 +150,15 @@ def _small_range_to_words(m):
return f"{_num_to_words_de(a)} bis {_num_to_words_de(b)}"
def _decimal_to_words(m):
"""'0.1' / '0,1''null komma eins', '1,25''eins komma zwei fuenf'."""
int_part = int(m.group(1))
dec_part = m.group(2)
int_word = _num_to_words_de(int_part) if 0 <= int_part <= 59 else str(int_part)
dec_words = " ".join(_num_to_words_de(int(d)) for d in dec_part)
return f"{int_word} komma {dec_words}"
_UNIT_WORDS = [
(r'\bTB\b', 'Terabyte'),
(r'\bGB\b', 'Gigabyte'),
@@ -236,6 +245,11 @@ def clean_text_for_tts(text: str) -> str:
# Kleine Zahlen-Bereiche ohne "Uhr": "5-6" → "fuenf bis sechs"
t = _re_tts.sub(r'\b(\d{1,2})\s*[-]\s*(\d{1,2})\b', _small_range_to_words, t)
# Dezimalzahlen: "0.1" / "0,5" / "1,25" → "null komma eins" / "null komma fuenf" / ...
# Muss vor "Zahl+Einheit" laufen, sonst frisst die Unit-Regel den Nachkommaanteil.
# Lookahead verhindert Match auf IP-artigen Strings wie 192.168.1.1.
t = _re_tts.sub(r'\b(\d+)[.,](\d+)(?![.,\d])', _decimal_to_words, t)
# Zahlen + Einheit: "22GB" → "22 Gigabyte" (Leerzeichen einfuegen)
t = _re_tts.sub(r'(\d+)([A-Za-z]{1,4})\b', r'\1 \2', t)
@@ -243,6 +257,12 @@ def clean_text_for_tts(text: str) -> str:
for pat, repl in _UNIT_WORDS:
t = _re_tts.sub(pat, repl, t)
# Generisches Buchstabieren: alle verbleibenden 2-5-Zeichen-Grossbuchstaben-Woerter
# (XTTS, USB, DNS, JSON, HTML, ...) → "X T T S". Laeuft NACH der expliziten Liste,
# damit TTS/GPU/... schon aufgeloest sind. "WLAN"-artige, die als Wort gesprochen
# werden, koennen bei Bedarf explizit in _UNIT_WORDS uebersteuert werden.
t = _re_tts.sub(r'\b([A-Z]{2,5})\b', lambda m: " ".join(m.group(1)), t)
# Anfuehrungszeichen
t = _re_tts.sub(r'["""„`]', '', t)
@@ -305,8 +325,16 @@ class STTEngine:
Erkannter Text oder leerer String.
"""
if self.model is None:
logger.error("Whisper-Modell nicht initialisiert")
return ""
# Lazy-Load: normalerweise laeuft STT remote auf der Gamebox.
# Erst wenn das Fallback hier zuschlaegt, laden wir lokal.
logger.info("Lokales Whisper-Fallback — Modell wird nachgeladen...")
try:
self.initialize()
except Exception:
logger.exception("Lokales Whisper konnte nicht geladen werden")
return ""
if self.model is None:
return ""
try:
# Audio als float32 normalisieren
@@ -503,6 +531,9 @@ class ARIABridge:
# Wird fuer die direkt folgende ARIA-Antwort genutzt und dann zurueckgesetzt.
# So kann jedes Geraet seine bevorzugte Stimme bekommen (pro Request).
self._next_voice_override: Optional[str] = None
# STT-Requests die aktuell auf Antwort von der whisper-bridge (Gamebox) warten.
# requestId → Future mit dem Text (oder None bei Fehler).
self._pending_stt: dict[str, asyncio.Future] = {}
def initialize(self) -> None:
"""Initialisiert alle Komponenten.
@@ -515,8 +546,9 @@ class ARIABridge:
logger.info("ARIA Voice Bridge startet...")
logger.info("=" * 50)
# STT IMMER laden — verarbeitet Audio von der App (braucht kein Sounddevice)
self.stt_engine.initialize()
# STT wird standardmaessig von der whisper-bridge (Gamebox) erledigt.
# Lokales Whisper ist nur Fallback und wird lazy geladen wenn remote nicht
# antwortet. Das spart RAM auf der VM und Startup-Zeit.
# Audio-Hardware pruefen (fuer lokales Mikro/Lautsprecher)
self.audio_available = False
@@ -1175,11 +1207,16 @@ class ARIABridge:
changed = True
if "whisperModel" in payload:
new_model = payload["whisperModel"]
if new_model and new_model != self.stt_engine.model_size:
logger.info("[rvs] Whisper-Modell Wechsel: %s -> %s (laedt...)", self.stt_engine.model_size, new_model)
loop = asyncio.get_event_loop()
if await loop.run_in_executor(None, self.stt_engine.reload, new_model):
changed = True
allowed = {"tiny", "base", "small", "medium", "large-v3"}
if new_model in allowed and new_model != self.stt_engine.model_size:
# Merken und mitschicken an whisper-bridge (Gamebox).
# Lokales Modell wird NICHT geladen — nur das Fallback braucht's,
# und das passiert erst on-demand wenn Remote nicht antwortet.
logger.info("[rvs] Whisper-Modell → %s (nur Config; Modell laedt Gamebox)",
new_model)
self.stt_engine.model_size = new_model
self.stt_engine.model = None
changed = True
# Persistent speichern in Shared Volume
if changed:
try:
@@ -1339,22 +1376,117 @@ class ARIABridge:
mime_type, duration_ms, len(audio_b64) // 1365)
asyncio.create_task(self._process_app_audio(audio_b64, mime_type))
elif msg_type == "stt_response":
# Antwort der whisper-bridge auf unseren stt_request
request_id = payload.get("requestId", "")
future = self._pending_stt.get(request_id)
if future is None or future.done():
return
error = payload.get("error", "")
if error:
logger.warning("[rvs] stt_response Fehler: %s", error)
future.set_result(None)
else:
text = payload.get("text", "")
stt_ms = payload.get("sttMs", 0)
model = payload.get("model", "?")
logger.info("[rvs] Remote-STT OK (%s, %dms): '%s'", model, stt_ms, (text or "")[:80])
future.set_result(text)
return
else:
logger.debug("[rvs] Unbekannter Typ: %s", msg_type)
# STT-Orchestrierung: zuerst Remote (Gamebox), Fallback lokal.
# Timeout grosszuegig gewaehlt, damit auch ein erstmaliger Modell-Load
# auf der Gamebox (bis ~30s bei large-v3) durchgeht.
_STT_REMOTE_TIMEOUT_S = 45.0
async def _process_app_audio(self, audio_b64: str, mime_type: str) -> None:
"""Decodiert App-Audio (Base64 AAC/MP4), konvertiert zu 16kHz PCM, STT, sendet an core."""
"""App-Audio → STT → aria-core. Primaer via whisper-bridge (RVS), Fallback lokal."""
# Erst Remote versuchen
text = await self._stt_remote(audio_b64, mime_type)
if text is None:
# Remote hat nicht geantwortet → lokales Whisper
logger.warning("[rvs] Remote-STT nicht verfuegbar — Fallback auf lokales Whisper")
text = await self._stt_local(audio_b64, mime_type)
if text is None:
return
if text.strip():
logger.info("[rvs] STT Ergebnis: '%s'", text[:80])
# ERST an aria-core senden (wichtigster Schritt)
await self.send_to_core(text, source="app-voice")
# STT-Text an RVS senden (fuer Anzeige in App + Diagnostic)
# sender="stt" damit Bridge es ignoriert (kein Loop)
try:
await self._send_to_rvs({
"type": "chat",
"payload": {
"text": text,
"sender": "stt",
},
"timestamp": int(asyncio.get_event_loop().time() * 1000),
})
except Exception as e:
logger.warning("[rvs] STT-Text konnte nicht an RVS gesendet werden: %s", e)
else:
logger.info("[rvs] Keine Sprache erkannt — ignoriert")
async def _stt_remote(self, audio_b64: str, mime_type: str) -> Optional[str]:
"""Schickt Audio an die whisper-bridge und wartet auf stt_response.
Rueckgabe:
str — erkannter Text (kann leer sein)
None — Remote-STT nicht erreichbar oder Fehler/Timeout (→ Fallback)
"""
if self.ws_rvs is None:
return None
request_id = str(uuid.uuid4())
loop = asyncio.get_event_loop()
future: asyncio.Future = loop.create_future()
self._pending_stt[request_id] = future
try:
model = getattr(self.stt_engine, "model_size", "small")
logger.info("[rvs] stt_request → whisper-bridge (id=%s, model=%s, %dKB)",
request_id[:8], model, len(audio_b64) // 1365)
ok = await self._send_to_rvs({
"type": "stt_request",
"payload": {
"requestId": request_id,
"audio": audio_b64,
"mimeType": mime_type,
"model": model,
"language": getattr(self.stt_engine, "language", "de"),
},
"timestamp": int(loop.time() * 1000),
})
if not ok:
logger.warning("[rvs] stt_request konnte nicht gesendet werden — skip Remote")
return None
return await asyncio.wait_for(future, timeout=self._STT_REMOTE_TIMEOUT_S)
except asyncio.TimeoutError:
logger.warning("[rvs] Remote-STT Timeout (%.0fs)", self._STT_REMOTE_TIMEOUT_S)
return None
except Exception as e:
logger.warning("[rvs] Remote-STT Fehler: %s", e)
return None
finally:
self._pending_stt.pop(request_id, None)
async def _stt_local(self, audio_b64: str, mime_type: str) -> Optional[str]:
"""Lokales Whisper-Fallback: FFmpeg → float32 → stt_engine.transcribe."""
loop = asyncio.get_event_loop()
tmp_in = None
tmp_out = None
try:
# Base64 → temp-Datei
ext = ".mp4" if "mp4" in mime_type else ".wav" if "wav" in mime_type else ".ogg"
tmp_in = tempfile.NamedTemporaryFile(suffix=ext, delete=False)
tmp_in.write(base64.b64decode(audio_b64))
tmp_in.close()
# FFmpeg: beliebiges Format → 16kHz mono PCM (raw float32)
tmp_out = tempfile.NamedTemporaryFile(suffix=".raw", delete=False)
tmp_out.close()
@@ -1369,55 +1501,34 @@ class ARIABridge:
)
if result.returncode != 0:
logger.error("[rvs] FFmpeg Fehler: %s", result.stderr.decode()[:200])
return
return None
# PCM lesen → numpy float32
audio_data = np.fromfile(tmp_out.name, dtype=np.float32)
if len(audio_data) == 0:
logger.warning("[rvs] Leere Audio-Daten nach Konvertierung")
return
return None
duration_s = len(audio_data) / 16000.0
logger.info("[rvs] Audio konvertiert: %.1fs, %d samples", duration_s, len(audio_data))
# STT
text = await loop.run_in_executor(None, self.stt_engine.transcribe, audio_data)
if text.strip():
logger.info("[rvs] STT Ergebnis: '%s'", text[:80])
# ERST an aria-core senden (wichtigster Schritt)
await self.send_to_core(text, source="app-voice")
# STT-Text an RVS senden (fuer Anzeige in App + Diagnostic)
# sender="stt" damit Bridge es ignoriert (kein Loop)
try:
await self._send_to_rvs({
"type": "chat",
"payload": {
"text": text,
"sender": "stt",
},
"timestamp": int(asyncio.get_event_loop().time() * 1000),
})
except Exception as e:
logger.warning("[rvs] STT-Text konnte nicht an RVS gesendet werden: %s", e)
else:
logger.info("[rvs] Keine Sprache erkannt — ignoriert")
logger.info("[rvs] Lokal-STT: %.1fs Audio, model=%s", duration_s, self.stt_engine.model_size)
return await loop.run_in_executor(None, self.stt_engine.transcribe, audio_data)
except Exception:
logger.exception("[rvs] Audio-Verarbeitung fehlgeschlagen")
logger.exception("[rvs] Lokales STT fehlgeschlagen")
return None
finally:
# Temp-Dateien aufraeumen
for f in [tmp_in, tmp_out]:
for f in (tmp_in, tmp_out):
if f:
try:
os.unlink(f.name)
except OSError:
pass
async def _send_to_rvs(self, message: dict) -> None:
"""Sendet eine Nachricht an die App (via RVS) mit Verbindungs-Check."""
async def _send_to_rvs(self, message: dict) -> bool:
"""Sendet eine Nachricht an die App (via RVS) mit Verbindungs-Check.
Rueckgabe: True wenn erfolgreich gesendet, False wenn Verbindung tot.
"""
if self.ws_rvs is None:
return
return False
# Ping-Check: Verbindung wirklich aktiv?
try:
@@ -1431,12 +1542,14 @@ class ARIABridge:
pass
self.ws_rvs = None
# Reconnect wird vom connect_to_rvs Loop uebernommen
return
return False
try:
await self.ws_rvs.send(json.dumps(message))
return True
except Exception:
logger.warning("[rvs] Sendefehler — RVS nicht erreichbar")
return False
# ── Log-Streaming an die App ─────────────────────────────
+26 -1
View File
@@ -438,13 +438,14 @@
</div>
<!-- XTTS Stimme -->
<div style="display:flex;align-items:center;gap:12px;margin-bottom:12px;">
<div style="display:flex;align-items:center;gap:12px;margin-bottom:6px;">
<label style="color:#8888AA;font-size:12px;">XTTS Stimme:</label>
<select id="diag-xtts-voice" onchange="sendVoiceConfig()" style="background:#1E1E2E;color:#fff;border:1px solid #2A2A3E;border-radius:6px;padding:6px 10px;font-size:13px;">
<option value="">Standard (XTTS Default)</option>
</select>
<button class="btn secondary" onclick="loadXTTSVoices()" style="padding:4px 10px;font-size:11px;">Laden</button>
</div>
<div id="voice-status" style="font-size:11px;min-height:14px;margin-bottom:12px;color:#8888AA;"></div>
<!-- Gecloned Stimmen — Liste mit Loeschen -->
<div id="xtts-voice-list" style="margin-bottom:12px;"></div>
@@ -851,6 +852,25 @@
return;
}
if (msg.type === 'voice_ready') {
const v = msg.payload?.voice || '';
const err = msg.payload?.error;
const ms = msg.payload?.loadMs;
const statusEl = document.getElementById('voice-status');
if (statusEl) {
if (err) {
statusEl.textContent = `⚠️ Stimme "${v}" Fehler: ${err}`;
statusEl.style.color = '#FF3B30';
} else {
statusEl.textContent = `✅ Stimme "${v || 'Standard'}" bereit${ms ? ` (${(ms/1000).toFixed(1)}s)` : ''}`;
statusEl.style.color = '#34C759';
}
setTimeout(() => { if (statusEl) statusEl.textContent = ''; }, 5000);
}
addLog('info', 'xtts', err ? `Voice "${v}": ${err}` : `Voice "${v || 'Standard'}" bereit`);
return;
}
if (msg.type === 'watchdog') {
const colors = { warning: '#FFD60A', fixing: '#FF9500', fixed: '#34C759', error: '#FF3B30' };
const color = colors[msg.status] || '#FFD60A';
@@ -1551,6 +1571,11 @@
const xttsVoice = document.getElementById('diag-xtts-voice').value;
const whisperModel = document.getElementById('diag-whisper-model').value;
send({ action: 'send_voice_config', ttsEnabled, xttsVoice, whisperModel });
const statusEl = document.getElementById('voice-status');
if (statusEl && xttsVoice) {
statusEl.textContent = `⏳ Stimme "${xttsVoice}" wird geladen...`;
statusEl.style.color = '#FFD60A';
}
}
// ── Passwort-Feld Anzeigen/Verbergen ─────────────────────
+11
View File
@@ -626,6 +626,17 @@ function connectRVS(forcePlain) {
// Mode-Broadcast von der Bridge → an Browser-Clients weiterreichen
log("info", "rvs", `Mode-Broadcast: ${msg.payload?.mode} (${msg.payload?.name})`);
broadcast({ type: "mode", payload: msg.payload });
} else if (msg.type === "voice_ready") {
// XTTS-Bridge meldet Stimme fertig geladen → an Browser durchreichen
const v = msg.payload?.voice || "";
const err = msg.payload?.error;
const ms = msg.payload?.loadMs;
if (err) {
log("warn", "rvs", `Voice-Ready Fehler fuer "${v}": ${err}`);
} else {
log("info", "rvs", `Voice "${v || "default"}" geladen${ms ? ` in ${(ms/1000).toFixed(1)}s` : ""}`);
}
broadcast({ type: "voice_ready", payload: msg.payload });
} else {
log("debug", "rvs", `Nachricht: ${JSON.stringify(msg).slice(0, 150)}`);
}
+2
View File
@@ -19,6 +19,8 @@ const ALLOWED_TYPES = new Set([
"agent_activity", "cancel_request",
"audio_pcm",
"xtts_delete_voice",
"voice_preload", "voice_ready",
"stt_request", "stt_response",
]);
// Token-Raum: token -> { clients: Set<ws> }
-5
View File
@@ -1,5 +0,0 @@
FROM node:22-alpine
WORKDIR /app
COPY bridge.js package.json ./
RUN npm install --production
CMD ["node", "bridge.js"]
-487
View File
@@ -1,487 +0,0 @@
/**
* ARIA XTTS Bridge — Verbindet XTTS v2 Server mit dem RVS
*
* Empfaengt tts_request ueber RVS → rendert Audio via XTTS API → sendet zurueck
* Empfaengt voice_upload → speichert Voice-Sample fuer Cloning
* Empfaengt xtts_list_voices → listet verfuegbare Stimmen
*/
const WebSocket = require("ws");
const http = require("http");
const https = require("https");
const fs = require("fs");
const path = require("path");
const XTTS_API_URL = process.env.XTTS_API_URL || "http://xtts:8000";
const RVS_HOST = process.env.RVS_HOST || "";
const RVS_PORT = process.env.RVS_PORT || "443";
const RVS_TLS = process.env.RVS_TLS || "true";
const RVS_TLS_FALLBACK = process.env.RVS_TLS_FALLBACK || "true";
const RVS_TOKEN = process.env.RVS_TOKEN || "";
const VOICES_DIR = "/voices";
function log(msg) {
console.log(`[${new Date().toISOString()}] ${msg}`);
}
// ── RVS Verbindung ──────────────────────────────────
let rvsWs = null;
let retryDelay = 2;
function connectRVS(forcePlain) {
if (!RVS_HOST || !RVS_TOKEN) {
log("RVS nicht konfiguriert — beende");
process.exit(1);
}
const useTls = RVS_TLS === "true" && !forcePlain;
const proto = useTls ? "wss" : "ws";
const url = `${proto}://${RVS_HOST}:${RVS_PORT}?token=${RVS_TOKEN}`;
log(`Verbinde zu RVS: ${proto}://${RVS_HOST}:${RVS_PORT}`);
const ws = new WebSocket(url);
ws.on("open", () => {
log("RVS verbunden — warte auf TTS-Requests");
rvsWs = ws;
retryDelay = 2;
// Keepalive
setInterval(() => {
if (ws.readyState === WebSocket.OPEN) {
ws.ping();
ws.send(JSON.stringify({ type: "heartbeat", timestamp: Date.now() }));
}
}, 25000);
});
ws.on("message", async (raw) => {
try {
const msg = JSON.parse(raw.toString());
if (msg.type === "xtts_request") {
await handleTTSRequest(msg.payload);
} else if (msg.type === "voice_upload") {
await handleVoiceUpload(msg.payload);
} else if (msg.type === "xtts_list_voices") {
await handleListVoices();
} else if (msg.type === "xtts_delete_voice") {
await handleDeleteVoice(msg.payload);
}
} catch (err) {
log(`Fehler: ${err.message}`);
}
});
ws.on("close", () => {
log("RVS Verbindung geschlossen");
rvsWs = null;
setTimeout(() => connectRVS(), Math.min(retryDelay * 1000, 30000));
retryDelay = Math.min(retryDelay * 2, 30);
});
ws.on("error", (err) => {
log(`RVS Fehler: ${err.message}`);
if (useTls && RVS_TLS_FALLBACK === "true") {
log("TLS fehlgeschlagen — Fallback auf ws://");
ws.removeAllListeners();
try { ws.close(); } catch (_) {}
connectRVS(true);
}
});
}
// ── TTS Request Handler ─────────────────────────────
// ── TTS-Queue ──────────────────────────────────────
// XTTS verarbeitet Requests sequenziell, damit Streams sich nicht ueberlappen.
// Ohne Queue wuerden parallele Requests parallel streamen → App bekommt
// interleaved PCM-Chunks aus zwei Rendern → klingt wie Chaos.
let ttsQueue = Promise.resolve();
function handleTTSRequest(payload) {
ttsQueue = ttsQueue.then(() => _runTTSRequest(payload)).catch(err => {
log(`TTS-Queue Fehler: ${err.message}`);
});
return ttsQueue;
}
async function _runTTSRequest(payload) {
const { text, voice, requestId, language, messageId } = payload;
if (!text) return;
// Markdown-Cleanup (Bridge macht jetzt auch Cleanup, aber safety net)
let cleanText = text
.replace(/\*\*([^*]+)\*\*/g, "$1")
.replace(/\*([^*]+)\*/g, "$1")
.replace(/`([^`]+)`/g, "$1")
.replace(/```[\s\S]*?```/g, "")
.replace(/\[([^\]]+)\]\([^)]+\)/g, "$1")
.replace(/#{1,6}\s*/g, "")
.replace(/>\s*/g, "")
.replace(/[-*]\s+/g, "")
.replace(/\n{2,}/g, ". ")
.replace(/\n/g, ", ")
.replace(/\s{2,}/g, " ")
.replace(/["""„]/g, "")
.replace(/\(\)/g, "")
.trim();
log(`TTS-Request (streaming): "${cleanText.slice(0, 80)}..." (${cleanText.length} chars, voice: ${voice || "default"})`);
try {
const voiceSample = voice ? path.join(VOICES_DIR, `${voice}.wav`) : null;
const hasCustomVoice = voiceSample && fs.existsSync(voiceSample);
let chunkIndex = 0;
let pcmMeta = null;
const onChunk = (pcmBase64, meta) => {
if (!pcmMeta) pcmMeta = meta;
sendToRVS({
type: "audio_pcm",
payload: {
requestId: requestId || "",
messageId: messageId || "",
base64: pcmBase64,
format: "pcm_s16le",
sampleRate: meta.sampleRate,
channels: meta.channels,
voice: voice || "default",
chunk: chunkIndex++,
final: false,
},
timestamp: Date.now(),
});
};
// /tts_stream fuer echtes Streaming (funktioniert im XTTS local-Mode).
// Wenn Server im apiManual/api-Mode laeuft: 400 → Fallback auf /tts_to_audio/.
try {
await streamXTTSAsPCM(
cleanText,
language || "de",
hasCustomVoice ? voiceSample : null,
onChunk,
);
} catch (streamErr) {
log(`/tts_stream fehlgeschlagen (${streamErr.message.slice(0, 100)}) — Fallback /tts_to_audio/`);
await streamXTTSBatch(
cleanText,
language || "de",
hasCustomVoice ? voiceSample : null,
onChunk,
);
}
// Am Ende: final-Flag damit App weiss "fertig" und Cache geschrieben werden kann
if (pcmMeta) {
sendToRVS({
type: "audio_pcm",
payload: {
requestId: requestId || "",
messageId: messageId || "",
base64: "",
format: "pcm_s16le",
sampleRate: pcmMeta.sampleRate,
channels: pcmMeta.channels,
voice: voice || "default",
chunk: chunkIndex++,
final: true,
},
timestamp: Date.now(),
});
}
log(`TTS komplett: ${chunkIndex} PCM-Frames gestreamt (${cleanText.length} chars)`);
} catch (err) {
log(`TTS Fehler: ${err.message}`);
sendToRVS({
type: "xtts_response",
payload: { requestId, error: err.message },
timestamp: Date.now(),
});
}
}
/**
* Ruft /tts_stream auf — echter Streaming-Endpoint bei daswer123.
* Schickt was der Server verlangt (allow: GET), aber mit JSON-Body
* als POST scheitert mit 405. Manche Versionen wollen GET + Query,
* andere POST + JSON. Testen was funktioniert.
*/
function streamXTTSAsPCM(text, language, speakerWav, onPcmChunk) {
return new Promise((resolve, reject) => {
// Wichtig: speaker_wav MUSS als Query-Key dabei sein (Pydantic required) —
// auch bei default-voice mit leerem Wert. Sonst gibt's HTTP 422.
// stream_chunk_size=100: Kompromiss zwischen first-audio-latency und
// gap-risk. Bei RTX 3060 (RTF 1.48) ~3s bis erster Audio, Chunks gross
// genug dass der AudioTrack-Buffer (128KB ≈ 2.7s) zwischen Chunks nicht
// leerlauft.
const qs = new URLSearchParams();
qs.set("text", text);
qs.set("language", language || "de");
qs.set("speaker_wav", speakerWav || "");
qs.set("stream_chunk_size", "100");
const url = new URL(XTTS_API_URL);
const fullPath = `/tts_stream?${qs.toString()}`;
const options = {
hostname: url.hostname,
port: url.port || 80,
path: fullPath,
method: "GET",
timeout: 60000,
};
log(`TTS GET /tts_stream?text=${text.slice(0, 30)}... (voice=${speakerWav ? "custom" : "default"})`);
const req = http.request(options, (res) => {
if (res.statusCode !== 200) {
let body = "";
res.on("data", (d) => { body += d.toString(); });
res.on("end", () => {
log(`XTTS /tts_stream ${res.statusCode}: ${body.slice(0, 300)}`);
reject(new Error(`XTTS HTTP ${res.statusCode}: ${body.slice(0, 200)}`));
});
return;
}
log(`TTS stream verbunden, empfange PCM...`);
let headerParsed = false;
let sampleRate = 24000;
let channels = 1;
let leftover = Buffer.alloc(0); // ungerade Byte-Reste fuer das naechste Chunk
const HEADER_BYTES = 44;
let headerBuf = Buffer.alloc(0);
const PCM_CHUNK_BYTES = 8192; // ~170ms bei 24kHz s16 mono
res.on("data", (chunk) => {
let data = chunk;
// WAV-Header konsumieren (44 Bytes)
if (!headerParsed) {
headerBuf = Buffer.concat([headerBuf, data]);
if (headerBuf.length < HEADER_BYTES) return;
// Header lesen
const header = headerBuf.slice(0, HEADER_BYTES);
try {
channels = header.readUInt16LE(22);
sampleRate = header.readUInt32LE(24);
} catch (_) {}
headerParsed = true;
data = headerBuf.slice(HEADER_BYTES);
}
// leftover aus vorherigem Chunk + neuer data
let combined = Buffer.concat([leftover, data]);
// In PCM_CHUNK_BYTES-Happen zerlegen (Vielfache von 2 damit keine Sample-Splits)
while (combined.length >= PCM_CHUNK_BYTES) {
const slice = combined.slice(0, PCM_CHUNK_BYTES);
combined = combined.slice(PCM_CHUNK_BYTES);
onPcmChunk(slice.toString("base64"), { sampleRate, channels });
}
leftover = combined;
});
res.on("end", () => {
// Rest-Daten senden
if (leftover.length > 0) {
onPcmChunk(leftover.toString("base64"), { sampleRate, channels });
}
resolve();
});
res.on("error", reject);
});
req.on("error", reject);
req.on("timeout", () => { req.destroy(); reject(new Error("XTTS API Timeout (60s)")); });
req.end();
});
}
/**
* Fallback: /tts_to_audio/ (POST JSON) — rendert komplett, dann response.
* Kein echtes Streaming, aber stabil als Backup wenn /tts_stream nicht geht.
* Shared chunking-Logik mit streamXTTSAsPCM — parst WAV-Header, stueckelt PCM.
*/
function streamXTTSBatch(text, language, speakerWav, onPcmChunk) {
return new Promise((resolve, reject) => {
const body = JSON.stringify({
text,
language: language || "de",
speaker_wav: speakerWav || "",
});
const url = new URL(XTTS_API_URL);
const options = {
hostname: url.hostname,
port: url.port || 80,
path: "/tts_to_audio/",
method: "POST",
headers: {
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(body),
},
timeout: 60000,
};
const req = http.request(options, (res) => {
if (res.statusCode !== 200) {
let rb = "";
res.on("data", (d) => { rb += d.toString(); });
res.on("end", () => reject(new Error(`XTTS Batch HTTP ${res.statusCode}: ${rb.slice(0, 200)}`)));
return;
}
let headerParsed = false;
let sampleRate = 24000;
let channels = 1;
let leftover = Buffer.alloc(0);
let headerBuf = Buffer.alloc(0);
const HEADER_BYTES = 44;
const PCM_CHUNK_BYTES = 8192;
res.on("data", (chunk) => {
let data = chunk;
if (!headerParsed) {
headerBuf = Buffer.concat([headerBuf, data]);
if (headerBuf.length < HEADER_BYTES) return;
const header = headerBuf.slice(0, HEADER_BYTES);
try { channels = header.readUInt16LE(22); sampleRate = header.readUInt32LE(24); } catch (_) {}
headerParsed = true;
data = headerBuf.slice(HEADER_BYTES);
}
let combined = Buffer.concat([leftover, data]);
while (combined.length >= PCM_CHUNK_BYTES) {
const slice = combined.slice(0, PCM_CHUNK_BYTES);
combined = combined.slice(PCM_CHUNK_BYTES);
onPcmChunk(slice.toString("base64"), { sampleRate, channels });
}
leftover = combined;
});
res.on("end", () => {
if (leftover.length > 0) onPcmChunk(leftover.toString("base64"), { sampleRate, channels });
resolve();
});
res.on("error", reject);
});
req.on("error", reject);
req.on("timeout", () => { req.destroy(); reject(new Error("XTTS Batch Timeout (60s)")); });
req.write(body);
req.end();
});
}
// ── Voice Upload Handler ────────────────────────────
async function handleVoiceUpload(payload) {
const { name, samples } = payload;
if (!name || !samples || !Array.isArray(samples) || samples.length === 0) {
log("Voice Upload: Ungueltige Daten");
return;
}
log(`Voice Upload: "${name}" (${samples.length} Samples)`);
try {
// Alle Samples zusammenfuegen
const buffers = samples.map(s => Buffer.from(s.base64, "base64"));
const combined = Buffer.concat(buffers);
// Als WAV speichern
fs.mkdirSync(VOICES_DIR, { recursive: true });
const filePath = path.join(VOICES_DIR, `${name.replace(/[^a-zA-Z0-9_-]/g, "_")}.wav`);
fs.writeFileSync(filePath, combined);
log(`Voice gespeichert: ${filePath} (${(combined.length / 1024).toFixed(0)}KB)`);
sendToRVS({
type: "xtts_voice_saved",
payload: { name, size: combined.length, path: filePath },
timestamp: Date.now(),
});
} catch (err) {
log(`Voice Upload Fehler: ${err.message}`);
}
}
// ── Voice Delete Handler ────────────────────────────
async function handleDeleteVoice(payload) {
const { name } = payload || {};
if (!name || typeof name !== "string") {
log("Voice Delete: ungueltiger Name");
return;
}
const safe = name.replace(/[^a-zA-Z0-9_-]/g, "_");
const filePath = path.join(VOICES_DIR, `${safe}.wav`);
try {
if (fs.existsSync(filePath)) {
fs.unlinkSync(filePath);
log(`Voice geloescht: ${filePath}`);
} else {
log(`Voice Delete: Datei existiert nicht (${filePath})`);
}
// Aktualisierte Liste an alle Clients senden
await handleListVoices();
} catch (err) {
log(`Voice Delete Fehler: ${err.message}`);
}
}
// ── Voice List Handler ──────────────────────────────
async function handleListVoices() {
try {
const files = fs.existsSync(VOICES_DIR)
? fs.readdirSync(VOICES_DIR).filter(f => f.endsWith(".wav"))
: [];
const voices = files.map(f => ({
name: path.basename(f, ".wav"),
file: f,
size: fs.statSync(path.join(VOICES_DIR, f)).size,
}));
log(`Stimmen: ${voices.length} verfuegbar`);
sendToRVS({
type: "xtts_voices_list",
payload: { voices },
timestamp: Date.now(),
});
} catch (err) {
log(`Stimmen-Liste Fehler: ${err.message}`);
}
}
// ── RVS senden ──────────────────────────────────────
function sendToRVS(msg) {
if (rvsWs && rvsWs.readyState === WebSocket.OPEN) {
rvsWs.send(JSON.stringify(msg));
}
}
// ── Start ───────────────────────────────────────────
log("ARIA XTTS Bridge startet...");
log(`XTTS API: ${XTTS_API_URL}`);
log(`RVS: ${RVS_HOST}:${RVS_PORT}`);
// Warten bis XTTS API erreichbar ist
function waitForXTTS(callback, attempts) {
if (attempts <= 0) { log("XTTS API nicht erreichbar — starte trotzdem"); callback(); return; }
http.get(`${XTTS_API_URL}/docs`, (res) => {
log(`XTTS API erreichbar (HTTP ${res.statusCode})`);
callback();
}).on("error", () => {
log(`XTTS API noch nicht bereit — warte (${attempts} Versuche uebrig)...`);
setTimeout(() => waitForXTTS(callback, attempts - 1), 10000); // 10s statt 5s (Model laden dauert)
});
}
waitForXTTS(() => connectRVS(), 30); // Max 5min warten
+50 -31
View File
@@ -1,7 +1,7 @@
# ════════════════════════════════════════════════
# ARIA XTTS v2 — GPU TTS Server
# ARIA Gamebox Stack — GPU F5-TTS + Whisper STT
# Laeuft auf dem Gaming-PC (RTX 3060)
# Verbindet sich zum RVS fuer TTS-Requests
# Verbindet sich zum RVS fuer TTS/STT-Requests
# ════════════════════════════════════════════════
#
# Voraussetzungen:
@@ -10,15 +10,18 @@
# - .env mit RVS-Verbindungsdaten
#
# Start: docker compose up -d
# Test: curl http://localhost:8000/docs
# ════════════════════════════════════════════════
services:
# ─── XTTS v2 API Server (GPU) ─────────────────
xtts:
image: daswer123/xtts-api-server:latest
container_name: aria-xtts
# ─── F5-TTS Bridge (GPU) ──────────────────────
# Ersetzt den frueheren XTTS-Stack. Empfaengt xtts_request via RVS,
# rendert via F5-TTS mit Voice-Cloning, streamt PCM an die App.
# Voice-Upload: speichert WAV und laesst whisper-bridge den Referenz-
# text transkribieren — der User muss nichts eintippen.
f5tts-bridge:
build: ./f5tts
container_name: aria-f5tts-bridge
deploy:
resources:
reservations:
@@ -26,37 +29,53 @@ services:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8020"
volumes:
- xtts-models:/app/xtts_models # Model-Cache (~2GB)
- ./voices:/voices # Custom Voice Samples
- ./voices:/voices # WAV + TXT Referenz
- f5tts-models:/root/.cache/huggingface # Model-Cache persistieren
environment:
- COQUI_TOS_AGREED=1
# Local-Modus statt default "apiManual": Modell bleibt im GPU-VRAM,
# Render startet sofort, /tts_stream funktioniert.
# Default-CMD des Images liest diese ENV: -ms ${MODEL_SOURCE:-"apiManual"}
- MODEL_SOURCE=local
# Speaker-Folder auf unsere gemounteten voices zeigen lassen
- EXAMPLE_FOLDER=/voices
restart: unless-stopped
# ─── XTTS Bridge (verbindet zu RVS) ───────────
xtts-bridge:
build: .
container_name: aria-xtts-bridge
depends_on:
- xtts
volumes:
- ./voices:/voices # Shared mit XTTS-Server
environment:
- XTTS_API_URL=http://xtts:8020
- RVS_HOST=${RVS_HOST}
- RVS_PORT=${RVS_PORT:-443}
- RVS_TLS=${RVS_TLS:-true}
- RVS_TLS_FALLBACK=${RVS_TLS_FALLBACK:-true}
- RVS_TOKEN=${RVS_TOKEN}
- F5TTS_MODEL=${F5TTS_MODEL:-F5TTS_v1_Base}
- F5TTS_DEVICE=${F5TTS_DEVICE:-cuda}
- VOICES_DIR=/voices
restart: unless-stopped
# ─── Whisper STT (GPU) ────────────────────────
# Faster-Whisper auf der Gamebox statt auf der VM (CPU) —
# deutlich schneller. Verbindet sich selbst per WebSocket an
# den RVS und nimmt dort stt_request Nachrichten der aria-bridge
# entgegen, antwortet mit stt_response. Zusaetzlich nutzt die
# f5tts-bridge Whisper intern fuer die Referenz-Transkription bei
# Voice-Uploads. Laedt das Modell beim Start vor; auf Config-
# Broadcasts (Diagnostic → whisperModel) wird zur Laufzeit hot-
# swapped.
whisper-bridge:
build: ./whisper
container_name: aria-whisper-bridge
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- RVS_HOST=${RVS_HOST}
- RVS_PORT=${RVS_PORT:-443}
- RVS_TLS=${RVS_TLS:-true}
- RVS_TLS_FALLBACK=${RVS_TLS_FALLBACK:-true}
- RVS_TOKEN=${RVS_TOKEN}
- WHISPER_MODEL=${WHISPER_MODEL:-small}
- WHISPER_DEVICE=${WHISPER_DEVICE:-cuda}
- WHISPER_COMPUTE_TYPE=${WHISPER_COMPUTE_TYPE:-float16}
- WHISPER_LANGUAGE=${WHISPER_LANGUAGE:-de}
volumes:
- whisper-models:/root/.cache/huggingface
restart: unless-stopped
volumes:
xtts-models:
f5tts-models:
whisper-models:
+21
View File
@@ -0,0 +1,21 @@
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 python3-pip ffmpeg git \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# PyTorch CUDA-Wheels zuerst (f5-tts zieht sonst CPU-only Torch rein)
RUN pip3 install --no-cache-dir torch==2.3.1 torchaudio==2.3.1 \
--index-url https://download.pytorch.org/whl/cu121
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY bridge.py .
CMD ["python3", "bridge.py"]
+580
View File
@@ -0,0 +1,580 @@
#!/usr/bin/env python3
"""
ARIA F5-TTS Bridge — laeuft auf der Gamebox (RTX 3060).
Empfaengt xtts_request via RVS → F5-TTS Voice Cloning auf GPU → streamt
16-bit PCM Chunks als audio_pcm Nachrichten zurueck an die App.
Voice-Layout im VOICES_DIR:
{name}.wav — Referenz-Audio (6-10s, 24kHz mono empfohlen)
{name}.txt — Referenz-Text (UTF-8, was im WAV gesprochen wird)
Beim voice_upload senden wir intern einen stt_request an die whisper-bridge
und legen die Transkription als .txt ab — der User muss keinen Text eingeben.
Env:
RVS_HOST, RVS_PORT, RVS_TLS, RVS_TLS_FALLBACK, RVS_TOKEN
F5TTS_MODEL Default: F5TTS_v1_Base
F5TTS_DEVICE Default: cuda
VOICES_DIR Default: /voices
"""
import asyncio
import base64
import json
import logging
import os
import re
import subprocess
import sys
import tempfile
import time
import uuid
from pathlib import Path
from typing import Optional
import numpy as np
import soundfile as sf
import websockets
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
datefmt="%H:%M:%S",
)
logger = logging.getLogger("f5tts-bridge")
# HuggingFace + Torch download-Logs etwas daempfen
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
RVS_HOST = os.getenv("RVS_HOST", "").strip()
RVS_PORT = int(os.getenv("RVS_PORT", "443"))
RVS_TLS = os.getenv("RVS_TLS", "true").lower() == "true"
RVS_TLS_FALLBACK = os.getenv("RVS_TLS_FALLBACK", "true").lower() == "true"
RVS_TOKEN = os.getenv("RVS_TOKEN", "").strip()
F5TTS_MODEL = os.getenv("F5TTS_MODEL", "F5TTS_v1_Base")
F5TTS_DEVICE = os.getenv("F5TTS_DEVICE", "cuda")
VOICES_DIR = Path(os.getenv("VOICES_DIR", "/voices"))
PCM_CHUNK_BYTES = 8192 # ~170ms @ 24kHz mono s16
TARGET_SR = 24000 # F5-TTS native
# ── Lazy F5-TTS Loader ──────────────────────────────────────
_F5TTS_cls = None
def _get_f5tts_cls():
"""Lazy import damit Startup-Logs nicht durch Torch-Warnungen zumuellen."""
global _F5TTS_cls
if _F5TTS_cls is None:
from f5_tts.api import F5TTS as _cls
_F5TTS_cls = _cls
return _F5TTS_cls
class F5Runner:
"""Haelt das F5-TTS-Modell. Synthese laeuft im Executor (blocking)."""
def __init__(self) -> None:
self.model = None
self._lock = asyncio.Lock()
def _load_blocking(self) -> None:
cls = _get_f5tts_cls()
logger.info("Lade F5-TTS '%s' (device=%s)...", F5TTS_MODEL, F5TTS_DEVICE)
t0 = time.time()
self.model = cls(model=F5TTS_MODEL, device=F5TTS_DEVICE)
logger.info("F5-TTS geladen in %.1fs", time.time() - t0)
async def ensure_loaded(self) -> None:
async with self._lock:
if self.model is not None:
return
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, self._load_blocking)
def _infer_blocking(self, gen_text: str, ref_wav: str, ref_text: str) -> tuple[np.ndarray, int]:
wav, sr, _ = self.model.infer(
ref_file=ref_wav,
ref_text=ref_text,
gen_text=gen_text,
remove_silence=True,
seed=-1,
)
# F5-TTS gibt float32 1D-Array — auf 24kHz sample-rate standard
if not isinstance(wav, np.ndarray):
wav = np.asarray(wav, dtype=np.float32)
if wav.ndim > 1:
wav = wav.squeeze()
return wav.astype(np.float32), int(sr)
async def synthesize(self, gen_text: str, ref_wav: str, ref_text: str) -> tuple[np.ndarray, int]:
await self.ensure_loaded()
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, self._infer_blocking, gen_text, ref_wav, ref_text)
# ── Helpers ─────────────────────────────────────────────────
_SENTENCE_SPLIT = re.compile(r"(?<=[.!?])\s+|\n+")
def split_sentences(text: str, max_len: int = 350) -> list[str]:
"""Teilt langen Text an Satzgrenzen. Kurze Texte bleiben als-is."""
text = text.strip()
if not text:
return []
if len(text) <= max_len:
return [text]
parts = [p.strip() for p in _SENTENCE_SPLIT.split(text) if p.strip()]
# Zu kurze Fragmente mergen damit F5-TTS nicht an jedem Komma neu startet
merged: list[str] = []
buf = ""
for p in parts:
if len(buf) + len(p) + 1 <= max_len:
buf = f"{buf} {p}".strip()
else:
if buf:
merged.append(buf)
buf = p
if buf:
merged.append(buf)
return merged or [text]
def float_to_pcm16(wav: np.ndarray) -> bytes:
"""Float32 (-1..+1) → int16 little-endian bytes."""
wav = np.clip(wav, -1.0, 1.0)
pcm = (wav * 32767.0).astype(np.int16)
return pcm.tobytes()
def sanitize_voice_name(name: str) -> str:
return re.sub(r"[^a-zA-Z0-9_-]", "_", name)
def voice_paths(name: str) -> tuple[Path, Path]:
safe = sanitize_voice_name(name)
return VOICES_DIR / f"{safe}.wav", VOICES_DIR / f"{safe}.txt"
def ensure_24k_mono_wav(src_wav: Path) -> Path:
"""F5-TTS moechte 24kHz mono als Referenz — ffmpeg konvertiert inplace.
Wenn das File schon passt, wird nichts geaendert. Sonst wird es
reingeschrieben (Original wird ueberschrieben).
"""
try:
info = sf.info(str(src_wav))
if info.samplerate == TARGET_SR and info.channels == 1:
return src_wav
except Exception:
pass
tmp_out = src_wav.with_suffix(".conv.wav")
cmd = ["ffmpeg", "-y", "-i", str(src_wav),
"-ar", str(TARGET_SR), "-ac", "1", "-f", "wav", str(tmp_out)]
r = subprocess.run(cmd, capture_output=True, timeout=30)
if r.returncode != 0:
logger.warning("ffmpeg-Konvertierung von %s fehlgeschlagen: %s",
src_wav, r.stderr.decode(errors="replace")[:200])
try:
tmp_out.unlink()
except OSError:
pass
return src_wav
os.replace(tmp_out, src_wav)
return src_wav
async def _send(ws, mtype: str, payload: dict) -> None:
try:
await ws.send(json.dumps({
"type": mtype,
"payload": payload,
"timestamp": int(time.time() * 1000),
}))
except Exception as e:
logger.warning("Send fehlgeschlagen (%s): %s", mtype, e)
# ── Interne Transkription via whisper-bridge ────────────────
_pending_stt: dict[str, asyncio.Future] = {}
_STT_TIMEOUT_S = 60.0
async def request_transcription(ws, wav_path: Path, language: str = "de") -> Optional[str]:
"""Sendet einen stt_request an die whisper-bridge (ueber RVS) und wartet auf stt_response."""
try:
with open(wav_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("ascii")
except Exception as e:
logger.error("Lesen %s fehlgeschlagen: %s", wav_path, e)
return None
request_id = str(uuid.uuid4())
loop = asyncio.get_event_loop()
fut: asyncio.Future = loop.create_future()
_pending_stt[request_id] = fut
try:
await _send(ws, "stt_request", {
"requestId": request_id,
"audio": audio_b64,
"mimeType": "audio/wav",
"model": "small", # klein reicht fuer Voice-Referenz
"language": language,
})
return await asyncio.wait_for(fut, timeout=_STT_TIMEOUT_S)
except asyncio.TimeoutError:
logger.warning("Transkription Timeout fuer %s", wav_path.name)
return None
except Exception as e:
logger.warning("Transkription Fehler: %s", e)
return None
finally:
_pending_stt.pop(request_id, None)
# ── TTS-Request Handler ─────────────────────────────────────
# Queue damit sich parallele Requests nicht ueberlappen (GPU-Throughput)
_tts_queue: asyncio.Queue[tuple] = asyncio.Queue()
async def _tts_worker(ws, runner: F5Runner) -> None:
"""Serialisiert Synthesen — GPU kann sonst OOM gehen."""
while True:
text, voice, request_id, message_id, language = await _tts_queue.get()
try:
await _do_tts(ws, runner, text, voice, request_id, message_id, language)
except Exception:
logger.exception("TTS-Worker Fehler")
finally:
_tts_queue.task_done()
async def _do_tts(ws, runner: F5Runner, text: str, voice: str,
request_id: str, message_id: str, language: str) -> None:
t0 = time.time()
ref_wav_path, ref_txt_path = voice_paths(voice) if voice else (None, None)
has_custom = bool(voice and ref_wav_path and ref_wav_path.exists() and ref_txt_path.exists())
if voice and not has_custom:
# Wenn nur WAV da ist aber kein txt → on-the-fly transkribieren
if ref_wav_path and ref_wav_path.exists() and (not ref_txt_path or not ref_txt_path.exists()):
logger.info("Voice '%s' hat kein txt — transkribiere on-the-fly", voice)
text_ref = await request_transcription(ws, ref_wav_path, language)
if text_ref:
try:
ref_txt_path.write_text(text_ref.strip(), encoding="utf-8")
has_custom = True
logger.info("Referenz-Text nachgezogen: '%s'", text_ref[:60])
except Exception as e:
logger.warning("Referenz-Text speichern fehlgeschlagen: %s", e)
if not has_custom:
logger.warning("Voice '%s' nicht komplett (%s, txt=%s) — nehme Default",
voice, ref_wav_path, (ref_txt_path and ref_txt_path.exists()))
if has_custom:
ref_wav_str = str(ref_wav_path)
ref_text = ref_txt_path.read_text(encoding="utf-8").strip()
else:
# Fallback: kein Custom-Voice. F5-TTS braucht IMMER eine Referenz,
# wir nehmen default_ref.wav/txt falls vorhanden, sonst die erste
# gefundene Voice im Ordner.
default_wav = VOICES_DIR / "default_ref.wav"
default_txt = VOICES_DIR / "default_ref.txt"
if default_wav.exists() and default_txt.exists():
ref_wav_str = str(default_wav)
ref_text = default_txt.read_text(encoding="utf-8").strip()
else:
# Nimm irgendein vorhandenes voice-Paar
pair = next(
((w, t) for w, t in (
(v, v.with_suffix(".txt")) for v in VOICES_DIR.glob("*.wav")
) if t.exists()),
None,
)
if not pair:
logger.error("Keine Referenz-Stimme im VOICES_DIR — TTS abgebrochen")
return
ref_wav_str, ref_text = str(pair[0]), pair[1].read_text(encoding="utf-8").strip()
sentences = split_sentences(text)
logger.info("F5-TTS: %d Satz(e), voice=%s (%s)", len(sentences), voice or "default", ref_wav_str)
chunk_index = 0
pcm_sr = TARGET_SR
for i, sent in enumerate(sentences):
try:
wav, sr = await runner.synthesize(sent, ref_wav_str, ref_text)
pcm_sr = sr
pcm_bytes = float_to_pcm16(wav)
# Erste PCM-Chunk des allerersten Satzes bekommt Fade-In (maskiert
# eventuelle Warmup-Glitches). Alle anderen Chunks bleiben wie sind.
if i == 0 and chunk_index == 0:
pcm_bytes = _fade_in_pcm16(pcm_bytes, sr, 120)
# Stueckeln
for off in range(0, len(pcm_bytes), PCM_CHUNK_BYTES):
slice_ = pcm_bytes[off:off + PCM_CHUNK_BYTES]
await _send(ws, "audio_pcm", {
"requestId": request_id,
"messageId": message_id,
"base64": base64.b64encode(slice_).decode("ascii"),
"format": "pcm_s16le",
"sampleRate": sr,
"channels": 1,
"voice": voice or "default",
"chunk": chunk_index,
"final": False,
})
chunk_index += 1
except Exception as e:
logger.exception("F5-TTS Synthese-Fehler (Satz %d)", i)
await _send(ws, "xtts_response", {
"requestId": request_id,
"error": str(e)[:200],
})
return
# Final-Marker
await _send(ws, "audio_pcm", {
"requestId": request_id,
"messageId": message_id,
"base64": "",
"format": "pcm_s16le",
"sampleRate": pcm_sr,
"channels": 1,
"voice": voice or "default",
"chunk": chunk_index,
"final": True,
})
logger.info("TTS komplett: %d Chunks, %.2fs render (voice=%s, text=%d chars)",
chunk_index, time.time() - t0, voice or "default", len(text))
def _fade_in_pcm16(pcm: bytes, sr: int, fade_ms: int) -> bytes:
"""Linear Fade-In auf erste fade_ms — maskiert Warmup-Glitches."""
arr = np.frombuffer(pcm, dtype=np.int16).copy()
fade_samples = min(int((fade_ms / 1000.0) * sr), len(arr))
if fade_samples <= 0:
return pcm
ramp = np.linspace(0.0, 1.0, fade_samples, dtype=np.float32)
arr[:fade_samples] = (arr[:fade_samples].astype(np.float32) * ramp).astype(np.int16)
return arr.tobytes()
# ── Voice Management Handlers ───────────────────────────────
async def handle_voice_upload(ws, payload: dict) -> None:
name = (payload.get("name") or "").strip()
samples = payload.get("samples") or []
if not name or not samples:
logger.warning("voice_upload: ungueltig (name=%r, samples=%d)", name, len(samples))
return
logger.info("Voice-Upload: '%s' (%d Samples)", name, len(samples))
try:
VOICES_DIR.mkdir(parents=True, exist_ok=True)
safe = sanitize_voice_name(name)
wav_path = VOICES_DIR / f"{safe}.wav"
txt_path = VOICES_DIR / f"{safe}.txt"
# Samples zusammenfuegen
buffers = [base64.b64decode(s.get("base64", "")) for s in samples]
with open(wav_path, "wb") as f:
for b in buffers:
f.write(b)
size_kb = wav_path.stat().st_size / 1024
logger.info("Voice WAV gespeichert: %s (%.0fKB)", wav_path, size_kb)
# Auf 24kHz mono normalisieren (falls App in anderem Format liefert)
ensure_24k_mono_wav(wav_path)
# Transkription ueber whisper-bridge anfragen
logger.info("Transkribiere '%s' via whisper-bridge...", name)
text = await request_transcription(ws, wav_path, language="de")
if not text:
logger.warning("Transkription fehlgeschlagen — speichere Platzhalter-Text")
text = "Das ist ein Referenz Audio."
txt_path.write_text(text.strip(), encoding="utf-8")
logger.info("Voice '%s' komplett (txt: %s)", name, text[:80])
await _send(ws, "xtts_voice_saved", {
"name": name, "size": int(size_kb * 1024), "refText": text.strip(),
})
# Liste aktualisieren
await handle_list_voices(ws)
except Exception as e:
logger.exception("voice_upload Fehler")
await _send(ws, "xtts_voice_saved", {"name": name, "error": str(e)[:200]})
async def handle_list_voices(ws) -> None:
try:
voices = []
if VOICES_DIR.exists():
for wav in sorted(VOICES_DIR.glob("*.wav")):
txt = wav.with_suffix(".txt")
voices.append({
"name": wav.stem,
"file": wav.name,
"size": wav.stat().st_size,
"hasRefText": txt.exists(),
})
logger.info("Stimmen-Liste: %d", len(voices))
await _send(ws, "xtts_voices_list", {"voices": voices})
except Exception:
logger.exception("handle_list_voices Fehler")
async def handle_delete_voice(ws, payload: dict) -> None:
name = (payload.get("name") or "").strip()
if not name:
return
try:
wav, txt = voice_paths(name)
for p in (wav, txt):
if p.exists():
p.unlink()
logger.info("Voice geloescht: %s", p)
await handle_list_voices(ws)
except Exception:
logger.exception("handle_delete_voice Fehler")
# Letzte diagnostisch-gesetzte Voice (verhindert Endlos-Preload bei jedem config)
_last_diag_voice = ""
async def handle_voice_preload(ws, payload: dict, runner: F5Runner) -> None:
voice = (payload.get("voice") or "").strip()
request_id = payload.get("requestId", "")
logger.info("Voice-Preload angefordert: '%s'", voice or "default")
try:
ref_wav, ref_txt = voice_paths(voice) if voice else (None, None)
if voice and (not ref_wav or not ref_wav.exists()):
await _send(ws, "voice_ready", {"voice": voice, "requestId": request_id, "error": "voice-file-not-found"})
return
# Ref-Text sicherstellen (falls nur WAV da ist)
if voice and ref_txt and not ref_txt.exists():
text = await request_transcription(ws, ref_wav, language="de")
if text:
ref_txt.write_text(text.strip(), encoding="utf-8")
logger.info("Referenz-Text beim Preload nachgezogen")
# Dummy-Render zum Warmup
t0 = time.time()
await _do_tts(ws, runner, "ja.", voice, f"preload-{request_id}", "", "de")
ms = int((time.time() - t0) * 1000)
await _send(ws, "voice_ready", {"voice": voice, "requestId": request_id, "loadMs": ms})
except Exception as e:
logger.exception("Voice-Preload Fehler")
await _send(ws, "voice_ready", {"voice": voice, "requestId": request_id, "error": str(e)[:200]})
# ── Haupt-Loop ──────────────────────────────────────────────
async def run_loop(runner: F5Runner) -> None:
# Preload im Hintergrund starten damit der Startup nicht blockiert
asyncio.create_task(runner.ensure_loaded())
use_tls = RVS_TLS
retry_s = 2
tls_fallback_tried = False
global _last_diag_voice
while True:
scheme = "wss" if use_tls else "ws"
url = f"{scheme}://{RVS_HOST}:{RVS_PORT}/ws?token={RVS_TOKEN}"
masked = url.replace(RVS_TOKEN, "***") if RVS_TOKEN else url
try:
logger.info("Verbinde zu RVS: %s", masked)
async with websockets.connect(url, ping_interval=20, ping_timeout=10, max_size=50 * 1024 * 1024) as ws:
logger.info("RVS verbunden")
retry_s = 2
tls_fallback_tried = False
# TTS-Worker fuer diese Verbindung starten
worker = asyncio.create_task(_tts_worker(ws, runner))
try:
async for raw in ws:
try:
msg = json.loads(raw)
except Exception:
continue
mtype = msg.get("type", "")
payload = msg.get("payload", {}) or {}
if mtype == "xtts_request":
await _tts_queue.put((
payload.get("text", ""),
payload.get("voice", "") or "",
payload.get("requestId", ""),
payload.get("messageId", ""),
payload.get("language", "de"),
))
elif mtype == "voice_upload":
asyncio.create_task(handle_voice_upload(ws, payload))
elif mtype == "xtts_list_voices":
asyncio.create_task(handle_list_voices(ws))
elif mtype == "xtts_delete_voice":
asyncio.create_task(handle_delete_voice(ws, payload))
elif mtype == "voice_preload":
asyncio.create_task(handle_voice_preload(ws, payload, runner))
elif mtype == "stt_response":
# Antwort auf unseren internen Transkriptions-Request
req_id = payload.get("requestId", "")
fut = _pending_stt.get(req_id)
if fut and not fut.done():
if payload.get("error"):
fut.set_result(None)
else:
fut.set_result(payload.get("text") or "")
elif mtype == "config":
v = (payload.get("xttsVoice") or "").strip()
if v and v != _last_diag_voice:
_last_diag_voice = v
asyncio.create_task(handle_voice_preload(
ws, {"voice": v, "source": "diagnostic"}, runner,
))
elif not v:
_last_diag_voice = ""
finally:
worker.cancel()
try:
await worker
except asyncio.CancelledError:
pass
except Exception as e:
logger.warning("Verbindung verloren: %s", e)
if use_tls and RVS_TLS_FALLBACK and not tls_fallback_tried:
logger.info("TLS fehlgeschlagen — Fallback auf ws://")
use_tls = False
tls_fallback_tried = True
continue
await asyncio.sleep(min(retry_s, 30))
retry_s = min(retry_s * 2, 30)
async def main() -> None:
if not RVS_HOST:
logger.error("RVS_HOST nicht gesetzt — Abbruch")
sys.exit(1)
VOICES_DIR.mkdir(parents=True, exist_ok=True)
runner = F5Runner()
await run_loop(runner)
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
sys.exit(0)
+5
View File
@@ -0,0 +1,5 @@
f5-tts>=1.0.0
websockets>=12.0
numpy>=1.24
soundfile>=0.12
requests>=2.31
-8
View File
@@ -1,8 +0,0 @@
{
"name": "aria-xtts-bridge",
"version": "1.0.0",
"private": true,
"dependencies": {
"ws": "^8.16.0"
}
}
+14
View File
@@ -0,0 +1,14 @@
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 python3-pip ffmpeg \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY bridge.py .
CMD ["python3", "bridge.py"]
+254
View File
@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""
ARIA Whisper Bridge — laeuft auf der Gamebox (RTX 3060).
Empfaengt stt_request via RVS → FFmpeg-Konvertierung → faster-whisper auf GPU
→ sendet stt_response zurueck an die aria-bridge.
Env:
RVS_HOST, RVS_PORT, RVS_TLS, RVS_TLS_FALLBACK, RVS_TOKEN
WHISPER_MODEL Default: small
WHISPER_DEVICE Default: cuda
WHISPER_COMPUTE_TYPE Default: float16
WHISPER_LANGUAGE Default: de
"""
import asyncio
import base64
import json
import logging
import os
import subprocess
import sys
import tempfile
import time
from typing import Optional
import numpy as np
import websockets
from faster_whisper import WhisperModel
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
datefmt="%H:%M:%S",
)
logger = logging.getLogger("whisper-bridge")
RVS_HOST = os.getenv("RVS_HOST", "").strip()
RVS_PORT = int(os.getenv("RVS_PORT", "443"))
RVS_TLS = os.getenv("RVS_TLS", "true").lower() == "true"
RVS_TLS_FALLBACK = os.getenv("RVS_TLS_FALLBACK", "true").lower() == "true"
RVS_TOKEN = os.getenv("RVS_TOKEN", "").strip()
WHISPER_MODEL = os.getenv("WHISPER_MODEL", "small")
WHISPER_DEVICE = os.getenv("WHISPER_DEVICE", "cuda")
WHISPER_COMPUTE_TYPE = os.getenv("WHISPER_COMPUTE_TYPE", "float16")
WHISPER_LANGUAGE = os.getenv("WHISPER_LANGUAGE", "de")
ALLOWED_MODELS = {"tiny", "base", "small", "medium", "large-v3"}
class WhisperRunner:
"""Haelt das Whisper-Modell. Hot-Swap bei Konfig-Wechsel via ensure_loaded()."""
def __init__(self) -> None:
self.model_size: str = WHISPER_MODEL
self.model: Optional[WhisperModel] = None
self._lock = asyncio.Lock()
def _load_blocking(self, size: str) -> None:
logger.info(
"Lade Whisper '%s' (device=%s, compute=%s)",
size, WHISPER_DEVICE, WHISPER_COMPUTE_TYPE,
)
t0 = time.time()
self.model = WhisperModel(
size, device=WHISPER_DEVICE, compute_type=WHISPER_COMPUTE_TYPE,
)
self.model_size = size
logger.info("Whisper '%s' geladen in %.1fs", size, time.time() - t0)
async def ensure_loaded(self, desired_size: str) -> None:
if desired_size not in ALLOWED_MODELS:
logger.warning("Ungueltiges Whisper-Modell '%s' — nutze %s", desired_size, WHISPER_MODEL)
desired_size = WHISPER_MODEL
async with self._lock:
if self.model is not None and self.model_size == desired_size:
return
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, self._load_blocking, desired_size)
async def transcribe(self, audio: np.ndarray, language: str) -> tuple[str, float]:
if self.model is None:
return "", 0.0
def _run():
segments, info = self.model.transcribe(
audio, language=language, beam_size=5, vad_filter=True,
)
text = " ".join(seg.text.strip() for seg in segments)
return text, info.duration
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, _run)
def ffmpeg_to_float32(audio_b64: str, mime_type: str) -> np.ndarray:
"""Dekodiert beliebiges Audio-Format → 16kHz mono float32 PCM."""
if "mp4" in mime_type or "m4a" in mime_type or "aac" in mime_type:
ext = ".mp4"
elif "wav" in mime_type:
ext = ".wav"
elif "ogg" in mime_type or "opus" in mime_type:
ext = ".ogg"
else:
ext = ".bin"
in_fh = tempfile.NamedTemporaryFile(suffix=ext, delete=False)
try:
in_fh.write(base64.b64decode(audio_b64))
in_fh.close()
out_path = in_fh.name + ".raw"
cmd = ["ffmpeg", "-y", "-i", in_fh.name, "-ar", "16000", "-ac", "1", "-f", "f32le", out_path]
result = subprocess.run(cmd, capture_output=True, timeout=30)
if result.returncode != 0:
logger.error("FFmpeg Fehler: %s", result.stderr.decode(errors="replace")[:300])
return np.zeros(0, dtype=np.float32)
try:
return np.fromfile(out_path, dtype=np.float32)
finally:
try:
os.unlink(out_path)
except OSError:
pass
finally:
try:
os.unlink(in_fh.name)
except OSError:
pass
async def _send(ws, mtype: str, payload: dict) -> None:
try:
await ws.send(json.dumps({
"type": mtype,
"payload": payload,
"timestamp": int(time.time() * 1000),
}))
except Exception as e:
logger.warning("Send fehlgeschlagen (%s): %s", mtype, e)
async def handle_stt_request(ws, payload: dict, runner: WhisperRunner) -> None:
request_id = payload.get("requestId", "")
audio_b64 = payload.get("audio", "")
mime_type = payload.get("mimeType", "audio/mp4")
model = payload.get("model") or WHISPER_MODEL
language = payload.get("language") or WHISPER_LANGUAGE
if not audio_b64:
await _send(ws, "stt_response", {"requestId": request_id, "error": "no-audio"})
return
try:
t_load = time.time()
await runner.ensure_loaded(model)
load_ms = int((time.time() - t_load) * 1000)
audio = ffmpeg_to_float32(audio_b64, mime_type)
if audio.size == 0:
await _send(ws, "stt_response", {"requestId": request_id, "error": "ffmpeg-failed"})
return
duration_s = len(audio) / 16000.0
logger.info("STT-Request: %.1fs Audio, model=%s, lang=%s", duration_s, runner.model_size, language)
t_stt = time.time()
text, detected_duration = await runner.transcribe(audio, language)
stt_ms = int((time.time() - t_stt) * 1000)
logger.info("STT-Ergebnis (%dms): '%s'", stt_ms, text[:100])
await _send(ws, "stt_response", {
"requestId": request_id,
"text": text.strip(),
"durationS": duration_s,
"sttMs": stt_ms,
"loadMs": load_ms,
"model": runner.model_size,
})
except Exception as e:
logger.exception("STT-Request fehlgeschlagen")
await _send(ws, "stt_response", {
"requestId": request_id,
"error": str(e)[:200],
})
async def run_loop(runner: WhisperRunner) -> None:
# Modell vorab laden damit erste Anfrage flott ist
try:
await runner.ensure_loaded(WHISPER_MODEL)
except Exception as e:
logger.error("Preload fehlgeschlagen: %s — Fortsetzung, wird bei erstem Request nachgeladen", e)
use_tls = RVS_TLS
retry_s = 2
tls_fallback_tried = False
while True:
scheme = "wss" if use_tls else "ws"
url = f"{scheme}://{RVS_HOST}:{RVS_PORT}/ws?token={RVS_TOKEN}"
masked = url.replace(RVS_TOKEN, "***") if RVS_TOKEN else url
try:
logger.info("Verbinde zu RVS: %s", masked)
async with websockets.connect(url, ping_interval=20, ping_timeout=10) as ws:
logger.info("RVS verbunden")
retry_s = 2
tls_fallback_tried = False
async for raw in ws:
try:
msg = json.loads(raw)
except Exception:
continue
mtype = msg.get("type", "")
payload = msg.get("payload", {}) or {}
if mtype == "stt_request":
req_id = payload.get("requestId", "?")
audio_len = len(payload.get("audio", ""))
logger.info("stt_request empfangen (id=%s, %dKB Audio)",
req_id[:8] if req_id != "?" else "?", audio_len // 1365)
asyncio.create_task(handle_stt_request(ws, payload, runner))
elif mtype == "config":
new_model = payload.get("whisperModel")
if new_model and new_model != runner.model_size:
logger.info("Config-Broadcast: Whisper-Modell → %s", new_model)
asyncio.create_task(runner.ensure_loaded(new_model))
else:
# Alle anderen Nachrichten debug-loggen — hilft beim Diagnostizieren,
# ob stt_request ueberhaupt durch den RVS kommt
logger.debug("Unbeachteter Type: %s", mtype)
except Exception as e:
logger.warning("Verbindung verloren: %s", e)
if use_tls and RVS_TLS_FALLBACK and not tls_fallback_tried:
logger.info("TLS-Verbindung fehlgeschlagen — Fallback auf ws://")
use_tls = False
tls_fallback_tried = True
continue
await asyncio.sleep(min(retry_s, 30))
retry_s = min(retry_s * 2, 30)
async def main() -> None:
if not RVS_HOST:
logger.error("RVS_HOST ist nicht gesetzt — Abbruch")
sys.exit(1)
runner = WhisperRunner()
await run_loop(runner)
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
sys.exit(0)
+4
View File
@@ -0,0 +1,4 @@
faster-whisper==1.0.3
websockets>=12.0
numpy>=1.24
requests>=2.31