feat(speaker-id): Phase 1 — SpeechBrain ECAPA-TDNN Backend in whisper-bridge

Speaker-ID-Modul (Hermes-Style „echtes Gespraech ohne Wake-Word"-Vision,
Phase 1 von 5). Erkennt Stefans Stimme via 192-dim Embedding + Cosine-
Match gegen einen persistierten Fingerprint.

Module:
- speaker_id.py: lazy-loaded ECAPA-TDNN (HuggingFace), enroll/verify/
  status/delete. Fingerprint = L2-normalisierter Mittelwert aus N
  Enrollment-Samples in /voice-id/fingerprint.json.
  Fail-open: kein Fingerprint → verify() returnt (True, 0.0).
- bridge.py: 3 Message-Handler — voice_id_status_request,
  voice_id_enroll_request (samples[]: base64 16kHz int16 PCM),
  voice_id_delete_request. Enrollment laeuft im Executor (Torch
  blockt sonst die Event-Loop).
- Dockerfile: torch 2.3.1 + torchaudio mit CUDA-12.1-Wheels (sonst
  zieht speechbrain CPU-only Torch rein). Container ~1 GB groesser.
- docker-compose.yml: ./voice-id:/voice-id Bind-Mount fuer Fingerprint-
  Persistenz (ueberlebt Container-Restart).
- rvs/server.js: 6 neue Message-Types in ALLOWED_TYPES.

Phase 2 (next): App-Enrollment-Flow + Diagnostic-Voice-ID-Section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-06 20:26:12 +02:00
parent 095a10aaf0
commit 6e19adab87
6 changed files with 270 additions and 2 deletions
+48
View File
@@ -33,6 +33,8 @@ import sys
import tempfile
import time
from dataclasses import dataclass, field
import speaker_id
from typing import Optional
import numpy as np
@@ -729,6 +731,52 @@ async def run_loop(runner: WhisperRunner, sessions: SessionManager) -> None:
f"received id={req_id[:12]} reason={payload.get('reason', '')}")
sessions.end_session(req_id)
elif mtype == "voice_id_status_request":
req_id = payload.get("requestId", "")
try:
status = speaker_id.status()
except Exception as exc:
await _send(ws, "voice_id_status_response", {
"requestId": req_id, "ok": False, "error": str(exc)[:200],
})
continue
await _send(ws, "voice_id_status_response", {
"requestId": req_id, "ok": True, **status,
})
elif mtype == "voice_id_enroll_request":
# samples: Liste von base64-kodierten int16-LE-PCM-Buffern,
# 16kHz mono, je ~3-5s. App nimmt sie nacheinander auf und
# schickt sie zusammen.
req_id = payload.get("requestId", "")
samples = payload.get("samples") or []
logger.info("voice_id_enroll_request: %d Samples (id=%s)",
len(samples), req_id[:8])
try:
result = await asyncio.get_running_loop().run_in_executor(
None, speaker_id.enroll_from_samples, samples
)
except Exception as exc:
logger.warning("voice_id_enroll failed: %s", exc)
await _send(ws, "voice_id_enroll_response", {
"requestId": req_id, "ok": False, "error": str(exc)[:300],
})
continue
await _send(ws, "voice_id_enroll_response", {
"requestId": req_id, "ok": True,
"sample_count": result.get("sample_count", 0),
"rejected": result.get("rejected", []),
"updated_at": result.get("updated_at"),
"embedding_dim": result.get("embedding_dim"),
})
elif mtype == "voice_id_delete_request":
req_id = payload.get("requestId", "")
removed = speaker_id.delete_fingerprint()
await _send(ws, "voice_id_delete_response", {
"requestId": req_id, "ok": True, "removed": removed,
})
elif mtype == "config":
# Debug-Toggle: aria-bridge broadcastet jetzt whisperDebugLog
# damit Stefan im laufenden Betrieb via Diagnostic-Settings