fixed, create monitor thats not exist before. only create update monitors, that will exist at start
This commit is contained in:
parent
7d8ad91bc4
commit
9babedf6ca
88
README.md
88
README.md
|
|
@ -380,6 +380,94 @@ rm -f /tmp/monmap
|
||||||
|
|
||||||
> **Hinweis:** Node-Namen und IPs an das eigene Setup anpassen. Aktuelle Versionen des Tools aktualisieren die MON-Map automatisch.
|
> **Hinweis:** Node-Namen und IPs an das eigene Setup anpassen. Aktuelle Versionen des Tools aktualisieren die MON-Map automatisch.
|
||||||
|
|
||||||
|
### Fehlerbehebung: Ghost-Monitor entfernen (z.B. "Unknown" MON auf falschem Node)
|
||||||
|
|
||||||
|
Falls nach der Migration ein Monitor auf einem Node auftaucht, der eigentlich keinen MON haben sollte (z.B. `mon.pvetest04` zeigt "Unknown" im Dashboard), wurde versehentlich eine MON-Map auf einen Nicht-MON-Node injiziert. Aktuelle Versionen des Tools erkennen automatisch welche Nodes tatsächlich einen MON betreiben und überspringen die anderen.
|
||||||
|
|
||||||
|
**Symptom:** Im Ceph-Dashboard oder bei `ceph -s` erscheint ein zusätzlicher Monitor mit Status "Unknown" oder "out of quorum" auf einem Node, der nie einen MON hatte.
|
||||||
|
|
||||||
|
**Schritt 1: Ghost-Monitor aus dem Cluster entfernen**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Auf einem Node mit funktionierendem MON:
|
||||||
|
ceph mon remove pvetest04 # Name des Ghost-Monitors anpassen
|
||||||
|
|
||||||
|
# Prüfen ob der Ghost weg ist:
|
||||||
|
ceph mon stat
|
||||||
|
ceph -s
|
||||||
|
```
|
||||||
|
|
||||||
|
**Schritt 2: Reste auf dem betroffenen Node aufräumen**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Auf dem Node, der den Ghost-Monitor hatte (z.B. pvetest04):
|
||||||
|
systemctl stop ceph-mon@$(hostname)
|
||||||
|
systemctl disable ceph-mon@$(hostname)
|
||||||
|
|
||||||
|
# MON-Datenverzeichnis entfernen (falls vorhanden):
|
||||||
|
rm -rf /var/lib/ceph/mon/ceph-$(hostname)
|
||||||
|
|
||||||
|
# Prüfen ob noch MON-Prozesse laufen:
|
||||||
|
ps aux | grep ceph-mon
|
||||||
|
# Wenn nur der grep-Prozess selbst erscheint, ist alles sauber:
|
||||||
|
# root 137377 0.0 0.0 6332 2176 pts/0 S+ 08:00 0:00 grep ceph-mon
|
||||||
|
# -> Kein ceph-mon läuft mehr, alles OK.
|
||||||
|
#
|
||||||
|
# Falls noch ein echter ceph-mon-Prozess läuft (z.B. /usr/bin/ceph-mon ...):
|
||||||
|
kill <PID>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Schritt 3: ceph.conf bereinigen (falls nötig)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Prüfen ob eine [mon.pvetest04]-Sektion existiert:
|
||||||
|
grep -A3 '\[mon.pvetest04\]' /etc/pve/ceph.conf
|
||||||
|
|
||||||
|
# Falls ja, diese Sektion aus /etc/pve/ceph.conf entfernen:
|
||||||
|
nano /etc/pve/ceph.conf
|
||||||
|
# -> Die komplette [mon.pvetest04]-Sektion löschen
|
||||||
|
|
||||||
|
# Ebenso die IP aus der mon_host-Zeile entfernen, falls dort gelistet:
|
||||||
|
grep mon_host /etc/pve/ceph.conf
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Hinweis:** Dieses Problem tritt nur bei älteren Versionen des Tools auf. Aktuelle Versionen erkennen die tatsächlichen MON-Nodes anhand der `[mon.X]`-Sektionen in `ceph.conf`, der `mon_host`-Liste oder durch Prüfung des `/var/lib/ceph/mon/`-Verzeichnisses.
|
||||||
|
|
||||||
|
### Fehlerbehebung: "X daemons have recently crashed" Warnung entfernen
|
||||||
|
|
||||||
|
Nach der Migration kann im Ceph-Dashboard unter **Health** folgende Warnung erscheinen:
|
||||||
|
|
||||||
|
```
|
||||||
|
Status: HEALTH_WARN
|
||||||
|
! clock skew detected on mon.pvetest03
|
||||||
|
! 23 daemons have recently crashed
|
||||||
|
```
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Die Crash-Meldungen stammen von den Daemon-Neustarts während der Migration und sind nicht kritisch. Ceph speichert Crash-Dumps unter `/var/lib/ceph/crash/` und meldet diese solange sie nicht archiviert wurden.
|
||||||
|
|
||||||
|
**Crash-Dumps anzeigen:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ceph crash ls
|
||||||
|
```
|
||||||
|
|
||||||
|
**Alle Crash-Dumps als gelesen markieren (archivieren):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ceph crash archive-all
|
||||||
|
```
|
||||||
|
|
||||||
|
**Prüfen ob die Warnung weg ist:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ceph -s
|
||||||
|
# -> HEALTH_OK (oder nur noch clock skew, falls NTP nicht synchron)
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Hinweis:** Falls zusätzlich `clock skew detected` angezeigt wird, NTP auf den betroffenen Nodes prüfen: `systemctl status chrony` oder `systemctl status ntp`. Nach einer Migration mit Neustarts kann die Uhrzeit kurzzeitig abweichen — das korrigiert sich in der Regel automatisch.
|
||||||
|
|
||||||
## Hinweise
|
## Hinweise
|
||||||
|
|
||||||
- Das Tool muss als **root** ausgeführt werden
|
- Das Tool muss als **root** ausgeführt werden
|
||||||
|
|
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
After Width: | Height: | Size: 32 KiB |
92
migrator.py
92
migrator.py
|
|
@ -463,41 +463,47 @@ class Migrator:
|
||||||
print(" [Ceph] /etc/pve nicht beschreibbar, schreibe direkt...")
|
print(" [Ceph] /etc/pve nicht beschreibbar, schreibe direkt...")
|
||||||
self._update_ceph_direct(plan, configs)
|
self._update_ceph_direct(plan, configs)
|
||||||
|
|
||||||
|
# Determine MON nodes (needed for monmap update and service restart)
|
||||||
|
mon_node_names = self._get_mon_node_names(plan)
|
||||||
|
|
||||||
# Update Ceph MON map with new IPs (MUST happen before restart)
|
# Update Ceph MON map with new IPs (MUST happen before restart)
|
||||||
self._update_ceph_mon_map(plan)
|
self._update_ceph_mon_map(plan, mon_node_names)
|
||||||
|
|
||||||
# Restart Ceph services
|
# Restart Ceph services
|
||||||
# Note: first MON is already running (started during monmap update)
|
# Note: first MON is already running (started during monmap update)
|
||||||
print("\n [Ceph] Services neu starten...")
|
print("\n [Ceph] Services neu starten...")
|
||||||
first_started = False
|
first_mon_started = False
|
||||||
for node in plan.nodes:
|
for node in plan.nodes:
|
||||||
if not node.is_reachable:
|
if not node.is_reachable:
|
||||||
continue
|
continue
|
||||||
new_host = node.new_ip if not node.is_local else node.ssh_host
|
new_host = node.new_ip if not node.is_local else node.ssh_host
|
||||||
|
is_mon_node = not mon_node_names or node.name in mon_node_names
|
||||||
|
|
||||||
if not first_started:
|
if is_mon_node:
|
||||||
# First node's MON was already started during monmap update
|
if not first_mon_started:
|
||||||
first_started = True
|
# First MON node was already started during monmap update
|
||||||
print(f" [{node.name}] ceph-mon läuft bereits (Primary)")
|
first_mon_started = True
|
||||||
else:
|
print(f" [{node.name}] ceph-mon läuft bereits (Primary)")
|
||||||
# Start MON on remaining nodes
|
else:
|
||||||
rc, _, err = self.ssh.run_on_node(
|
# Start MON on remaining MON nodes
|
||||||
|
rc, _, err = self.ssh.run_on_node(
|
||||||
|
new_host,
|
||||||
|
f"systemctl start ceph-mon@{node.name} 2>/dev/null",
|
||||||
|
node.is_local, timeout=30,
|
||||||
|
)
|
||||||
|
if rc == 0:
|
||||||
|
print(f" [{node.name}] ceph-mon gestartet")
|
||||||
|
else:
|
||||||
|
print(f" [{node.name}] WARNUNG ceph-mon: {err}")
|
||||||
|
|
||||||
|
# Restart MGR (only on MON nodes)
|
||||||
|
self.ssh.run_on_node(
|
||||||
new_host,
|
new_host,
|
||||||
f"systemctl start ceph-mon@{node.name} 2>/dev/null",
|
f"systemctl restart ceph-mgr@{node.name} 2>/dev/null",
|
||||||
node.is_local, timeout=30,
|
node.is_local, timeout=30,
|
||||||
)
|
)
|
||||||
if rc == 0:
|
|
||||||
print(f" [{node.name}] ceph-mon gestartet")
|
|
||||||
else:
|
|
||||||
print(f" [{node.name}] WARNUNG ceph-mon: {err}")
|
|
||||||
|
|
||||||
# Restart MGR
|
# Restart all OSDs on this node (OSDs can be on any node)
|
||||||
self.ssh.run_on_node(
|
|
||||||
new_host,
|
|
||||||
f"systemctl restart ceph-mgr@{node.name} 2>/dev/null",
|
|
||||||
node.is_local, timeout=30,
|
|
||||||
)
|
|
||||||
# Restart all OSDs on this node
|
|
||||||
self.ssh.run_on_node(
|
self.ssh.run_on_node(
|
||||||
new_host,
|
new_host,
|
||||||
"systemctl restart ceph-osd.target 2>/dev/null",
|
"systemctl restart ceph-osd.target 2>/dev/null",
|
||||||
|
|
@ -527,7 +533,45 @@ class Migrator:
|
||||||
else:
|
else:
|
||||||
print(f" [{node.name}] FEHLER /etc/ceph/ceph.conf: {msg}")
|
print(f" [{node.name}] FEHLER /etc/ceph/ceph.conf: {msg}")
|
||||||
|
|
||||||
def _update_ceph_mon_map(self, plan: MigrationPlan):
|
def _get_mon_node_names(self, plan: MigrationPlan) -> set[str]:
|
||||||
|
"""Determine which nodes actually run a Ceph MON daemon."""
|
||||||
|
mon_node_names = set()
|
||||||
|
|
||||||
|
if plan.ceph_config:
|
||||||
|
# From [mon.hostname] sections in ceph.conf
|
||||||
|
for section_name in plan.ceph_config.mon_sections:
|
||||||
|
# section_name is like "mon.pvetest01"
|
||||||
|
name = section_name.replace("mon.", "", 1)
|
||||||
|
mon_node_names.add(name)
|
||||||
|
# From mon_host IP list — match IPs to nodes
|
||||||
|
if not mon_node_names and plan.ceph_config.mon_hosts:
|
||||||
|
mon_ips = set(plan.ceph_config.mon_hosts)
|
||||||
|
for node in plan.nodes:
|
||||||
|
if node.current_ip in mon_ips:
|
||||||
|
mon_node_names.add(node.name)
|
||||||
|
|
||||||
|
# Fallback: check which nodes have the MON data directory
|
||||||
|
if not mon_node_names:
|
||||||
|
print(" [Ceph] Prüfe welche Nodes einen MON-Dienst haben...")
|
||||||
|
for node in plan.nodes:
|
||||||
|
if not node.is_reachable:
|
||||||
|
continue
|
||||||
|
new_host = node.new_ip if not node.is_local else node.ssh_host
|
||||||
|
rc, _, _ = self.ssh.run_on_node(
|
||||||
|
new_host,
|
||||||
|
f"test -d /var/lib/ceph/mon/ceph-{node.name}",
|
||||||
|
node.is_local, timeout=10,
|
||||||
|
)
|
||||||
|
if rc == 0:
|
||||||
|
mon_node_names.add(node.name)
|
||||||
|
|
||||||
|
if mon_node_names:
|
||||||
|
print(f" [Ceph] MON-Nodes erkannt: {', '.join(sorted(mon_node_names))}")
|
||||||
|
|
||||||
|
return mon_node_names
|
||||||
|
|
||||||
|
def _update_ceph_mon_map(self, plan: MigrationPlan,
|
||||||
|
mon_node_names: set[str] | None = None):
|
||||||
"""Update Ceph MON map with new addresses.
|
"""Update Ceph MON map with new addresses.
|
||||||
|
|
||||||
When MON IPs change, the internal monmap (stored in MON's RocksDB)
|
When MON IPs change, the internal monmap (stored in MON's RocksDB)
|
||||||
|
|
@ -543,12 +587,14 @@ class Migrator:
|
||||||
print(" [Ceph] Keine IP-Änderungen für MON-Map")
|
print(" [Ceph] Keine IP-Änderungen für MON-Map")
|
||||||
return
|
return
|
||||||
|
|
||||||
# Build the list of MON nodes with their new IPs
|
# Build the list of MON nodes with their new IPs (only actual MON nodes)
|
||||||
mon_nodes = []
|
mon_nodes = []
|
||||||
reachable_nodes = []
|
reachable_nodes = []
|
||||||
for node in plan.nodes:
|
for node in plan.nodes:
|
||||||
if not node.is_reachable:
|
if not node.is_reachable:
|
||||||
continue
|
continue
|
||||||
|
if mon_node_names and node.name not in mon_node_names:
|
||||||
|
continue
|
||||||
new_ip = node.new_ip or node.current_ip
|
new_ip = node.new_ip or node.current_ip
|
||||||
mon_nodes.append((node.name, new_ip))
|
mon_nodes.append((node.name, new_ip))
|
||||||
reachable_nodes.append(node)
|
reachable_nodes.append(node)
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue