fixed, create monitor thats not exist before. only create update monitors, that will exist at start

2026-03-05 08:08:01 +01:00
parent 7d8ad91bc4
commit 9babedf6ca
6 changed files with 157 additions and 23 deletions
@@ -380,6 +380,94 @@ rm -f /tmp/monmap
 > **Hinweis:** Node-Namen und IPs an das eigene Setup anpassen. Aktuelle Versionen des Tools aktualisieren die MON-Map automatisch.
 ### Fehlerbehebung: Ghost-Monitor entfernen (z.B. "Unknown" MON auf falschem Node)
 Falls nach der Migration ein Monitor auf einem Node auftaucht, der eigentlich keinen MON haben sollte (z.B. `mon.pvetest04` zeigt "Unknown" im Dashboard), wurde versehentlich eine MON-Map auf einen Nicht-MON-Node injiziert. Aktuelle Versionen des Tools erkennen automatisch welche Nodes tatsächlich einen MON betreiben und überspringen die anderen.
 **Symptom:** Im Ceph-Dashboard oder bei `ceph -s` erscheint ein zusätzlicher Monitor mit Status "Unknown" oder "out of quorum" auf einem Node, der nie einen MON hatte.
 **Schritt 1: Ghost-Monitor aus dem Cluster entfernen**
 ```bash
 # Auf einem Node mit funktionierendem MON:
 ceph mon remove pvetest04    # Name des Ghost-Monitors anpassen
 # Prüfen ob der Ghost weg ist:
 ceph mon stat
 ceph -s
 ```
 **Schritt 2: Reste auf dem betroffenen Node aufräumen**
 ```bash
 # Auf dem Node, der den Ghost-Monitor hatte (z.B. pvetest04):
 systemctl stop ceph-mon@$(hostname)
 systemctl disable ceph-mon@$(hostname)
 # MON-Datenverzeichnis entfernen (falls vorhanden):
 rm -rf /var/lib/ceph/mon/ceph-$(hostname)
 # Prüfen ob noch MON-Prozesse laufen:
 ps aux | grep ceph-mon
 # Wenn nur der grep-Prozess selbst erscheint, ist alles sauber:
 #   root 137377  0.0  0.0  6332  2176 pts/0  S+  08:00  0:00 grep ceph-mon
 # -> Kein ceph-mon läuft mehr, alles OK.
 #
 # Falls noch ein echter ceph-mon-Prozess läuft (z.B. /usr/bin/ceph-mon ...):
 kill <PID>
 ```
 **Schritt 3: ceph.conf bereinigen (falls nötig)**
 ```bash
 # Prüfen ob eine [mon.pvetest04]-Sektion existiert:
 grep -A3 '\[mon.pvetest04\]' /etc/pve/ceph.conf
 # Falls ja, diese Sektion aus /etc/pve/ceph.conf entfernen:
 nano /etc/pve/ceph.conf
 # -> Die komplette [mon.pvetest04]-Sektion löschen
 # Ebenso die IP aus der mon_host-Zeile entfernen, falls dort gelistet:
 grep mon_host /etc/pve/ceph.conf
 ```
 > **Hinweis:** Dieses Problem tritt nur bei älteren Versionen des Tools auf. Aktuelle Versionen erkennen die tatsächlichen MON-Nodes anhand der `[mon.X]`-Sektionen in `ceph.conf`, der `mon_host`-Liste oder durch Prüfung des `/var/lib/ceph/mon/`-Verzeichnisses.
 ### Fehlerbehebung: "X daemons have recently crashed" Warnung entfernen
 Nach der Migration kann im Ceph-Dashboard unter **Health** folgende Warnung erscheinen:
 ```
 Status: HEALTH_WARN
  ! clock skew detected on mon.pvetest03
  ! 23 daemons have recently crashed
 ```
 ![Ceph HEALTH_WARN: 23 daemons have recently crashed](docs/ceph_crashed_daemons.png)
 Die Crash-Meldungen stammen von den Daemon-Neustarts während der Migration und sind nicht kritisch. Ceph speichert Crash-Dumps unter `/var/lib/ceph/crash/` und meldet diese solange sie nicht archiviert wurden.
 **Crash-Dumps anzeigen:**
 ```bash
 ceph crash ls
 ```
 **Alle Crash-Dumps als gelesen markieren (archivieren):**
 ```bash
 ceph crash archive-all
 ```
 **Prüfen ob die Warnung weg ist:**
 ```bash
 ceph -s
 # -> HEALTH_OK (oder nur noch clock skew, falls NTP nicht synchron)
 ```
 > **Hinweis:** Falls zusätzlich `clock skew detected` angezeigt wird, NTP auf den betroffenen Nodes prüfen: `systemctl status chrony` oder `systemctl status ntp`. Nach einer Migration mit Neustarts kann die Uhrzeit kurzzeitig abweichen — das korrigiert sich in der Regel automatisch.
 ## Hinweise
 - Das Tool muss als **root** ausgeführt werden
@@ -463,41 +463,47 @@ class Migrator:
            print("  [Ceph] /etc/pve nicht beschreibbar, schreibe direkt...")
            self._update_ceph_direct(plan, configs)
        # Determine MON nodes (needed for monmap update and service restart)
        mon_node_names = self._get_mon_node_names(plan)
        # Update Ceph MON map with new IPs (MUST happen before restart)
-        self._update_ceph_mon_map(plan)
+        self._update_ceph_mon_map(plan, mon_node_names)
        # Restart Ceph services
        # Note: first MON is already running (started during monmap update)
        print("\n  [Ceph] Services neu starten...")
-        first_started = False
+        first_mon_started = False
        for node in plan.nodes:
            if not node.is_reachable:
                continue
            new_host = node.new_ip if not node.is_local else node.ssh_host
            is_mon_node = not mon_node_names or node.name in mon_node_names
-            if not first_started:
+            if is_mon_node:
-                # First node's MON was already started during monmap update
+                if not first_mon_started:
-                first_started = True
+                    # First MON node was already started during monmap update
-                print(f"  [{node.name}] ceph-mon läuft bereits (Primary)")
+                    first_mon_started = True
-            else:
+                    print(f"  [{node.name}] ceph-mon läuft bereits (Primary)")
-                # Start MON on remaining nodes
+                else:
-                rc, _, err = self.ssh.run_on_node(
+                    # Start MON on remaining MON nodes
                    rc, _, err = self.ssh.run_on_node(
                        new_host,
                        f"systemctl start ceph-mon@{node.name} 2>/dev/null",
                        node.is_local, timeout=30,
                    )
                    if rc == 0:
                        print(f"  [{node.name}] ceph-mon gestartet")
                    else:
                        print(f"  [{node.name}] WARNUNG ceph-mon: {err}")
                # Restart MGR (only on MON nodes)
                self.ssh.run_on_node(
                    new_host,
-                    f"systemctl start ceph-mon@{node.name} 2>/dev/null",
+                    f"systemctl restart ceph-mgr@{node.name} 2>/dev/null",
                    node.is_local, timeout=30,
                )
                if rc == 0:
                    print(f"  [{node.name}] ceph-mon gestartet")
                else:
                    print(f"  [{node.name}] WARNUNG ceph-mon: {err}")
-            # Restart MGR
+            # Restart all OSDs on this node (OSDs can be on any node)
            self.ssh.run_on_node(
                new_host,
                f"systemctl restart ceph-mgr@{node.name} 2>/dev/null",
                node.is_local, timeout=30,
            )
            # Restart all OSDs on this node
            self.ssh.run_on_node(
                new_host,
                "systemctl restart ceph-osd.target 2>/dev/null",
@@ -527,7 +533,45 @@ class Migrator:
            else:
                print(f"  [{node.name}] FEHLER /etc/ceph/ceph.conf: {msg}")
-    def _update_ceph_mon_map(self, plan: MigrationPlan):
+    def _get_mon_node_names(self, plan: MigrationPlan) -> set[str]:
        """Determine which nodes actually run a Ceph MON daemon."""
        mon_node_names = set()
        if plan.ceph_config:
            # From [mon.hostname] sections in ceph.conf
            for section_name in plan.ceph_config.mon_sections:
                # section_name is like "mon.pvetest01"
                name = section_name.replace("mon.", "", 1)
                mon_node_names.add(name)
            # From mon_host IP list — match IPs to nodes
            if not mon_node_names and plan.ceph_config.mon_hosts:
                mon_ips = set(plan.ceph_config.mon_hosts)
                for node in plan.nodes:
                    if node.current_ip in mon_ips:
                        mon_node_names.add(node.name)
        # Fallback: check which nodes have the MON data directory
        if not mon_node_names:
            print("  [Ceph] Prüfe welche Nodes einen MON-Dienst haben...")
            for node in plan.nodes:
                if not node.is_reachable:
                    continue
                new_host = node.new_ip if not node.is_local else node.ssh_host
                rc, _, _ = self.ssh.run_on_node(
                    new_host,
                    f"test -d /var/lib/ceph/mon/ceph-{node.name}",
                    node.is_local, timeout=10,
                )
                if rc == 0:
                    mon_node_names.add(node.name)
        if mon_node_names:
            print(f"  [Ceph] MON-Nodes erkannt: {', '.join(sorted(mon_node_names))}")
        return mon_node_names
    def _update_ceph_mon_map(self, plan: MigrationPlan,
                             mon_node_names: set[str] | None = None):
        """Update Ceph MON map with new addresses.
        When MON IPs change, the internal monmap (stored in MON's RocksDB)
@@ -543,12 +587,14 @@ class Migrator:
            print("  [Ceph] Keine IP-Änderungen für MON-Map")
            return
-        # Build the list of MON nodes with their new IPs
+        # Build the list of MON nodes with their new IPs (only actual MON nodes)
        mon_nodes = []
        reachable_nodes = []
        for node in plan.nodes:
            if not node.is_reachable:
                continue
            if mon_node_names and node.name not in mon_node_names:
                continue
            new_ip = node.new_ip or node.current_ip
            mon_nodes.append((node.name, new_ip))
            reachable_nodes.append(node)