Ceph split brain

Introduction

If you’ve read through my previous posts you’ll know that i have proxmox on a host and 4 6gbit ssds that i use to form a 3 node hybrid k8s cluster. I passed through the first 3 disks directly to the VM for ceph storage. The 4th ssd is shared between all three VMs and is used as the boot/os drive for the VMs and the hypervisor. I’ve had to repair my ceph cluster twice in the last week so I thought it would be a good idea to talk about what happened and what caused the issue. Even though the cause is fairly easy to explain, the repair process was far from trivial. So let’s take this step by step.

Shared OS disk (4th SSD)       ──▶  Proxmox host OS
                               ──▶  VM1 OS + etcd + ceph-mon
                               ──▶  VM2 OS + etcd + ceph-mon
                               ──▶  VM3 OS + etcd + ceph-mgr

A) The cause

What happened? The ceph logs are pretty clear:

leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk

That 4th SSD is too slow to host both the virtualized control planes, the hypervisor’s kernel, the ceph monitors and ceph managers. Etcd, the proxmox host and ceph all store state on a single consumer SSD that has no cache which means that it can only acknowledge a write once it is “actually” stored in memory.

This is a newbie flaw in the design of course because I have too little experience with hardware and replication. The cascade event is not only traced back to the contention to a single drive but also explained by the vicious circle called latency. The latency on my consumer SSD is not constant so you quickly run into a deadly spiral where 1 request being late slows all subsequent requests.

Proxmox host OS writes ──┐
VM1 etcd fsync           ├──▶ single SSD queue ──▶ 😬 contention
VM2 etcd fsync           │
VM3 etcd fsync           │
Ceph mon writes ─────────┘

B) Catalyst

If you look at my setup you’ll realize that I have a catalyst that makes matters even worst: my networking. I’m virtualizing networking at 3 different levels as well as doing VXLAN traffic encapsulation:

Pod-to-VM networking via veth:

pod <-> veth <-> VM kernel 

Inter-node overlay networking via Cilium
VM-to-hypervisor networking via tap

VM kernel <-> tap <-> Proxmox bridge (vmbr0)

So at a packet level if those connections are slowed down: TCP packets are lost, disconnects and timeouts lead to new TCP sessions getting initialized and monitors lose quorum because they could not handle the incoming requests and/or stopped responding alltogether.

C) The repair

Like I said, the repair is far from trivial because ceph stores IPs in 3 places: monmap inside RocksDB, the Rook ConfigMap at the k8s level and the ceph.conf in each pod. On my second crash I simply had to remove the stale monitor IP’s from the monmap but on my saturday crash I managed to have inconsistent secrets which means that it couldn’t even authenticate. Let’s focus on the second crash for now.

1) Split brain

rook already removed mons f and h but the monmap still had them as active. If you look at the logs you’ll see a spiral of inconclusive leader elections because they both think there are 4 monitors and can’t establish a single source of truth.

The solution is to scale down the running monitors, remove the stale mons f and h IPs and allow quorum to form

        
kubectl scale deploy -n rook-ceph rook-ceph-mon-g --replicas=1                  
kubectl scale deploy -n rook-ceph rook-ceph-mon-i --replicas=1

I didn’t have ceph tools installed on the fedora core os host so i had to bootstrap a debug pod where mon-g and mon-i were running

        
      
kubectl debug node/coreos-cp-2 -it --image=quay.io/ceph/ceph:v18 -- bash

Confirm that the monmap had stale IPs

        
ceph-mon --extract-monmap /tmp/monmap --mon-data /host/var/lib/rook/mon-g/data
monmaptool --print /tmp/monmap

Do the surgery

        
      
# Remove dead mons
monmaptool --rm f /tmp/monmap
monmaptool --rm h /tmp/monmap

# Verify — should only show g (rank 1) and i (rank 3)
monmaptool --print /tmp/monmap
epoch 18
fsid 876327be-c95c-448e-a08b-b2fb4c8ecb7e
last_changed 2026-05-19T05:13:41.192911+0000
created 2026-05-16T08:22:35.636587+0000
min_mon_release 18 (reef)
election_strategy: 1
0: [v2:10.233.60.222:3300/0,v1:10.233.60.222:6789/0] mon.g
1: [v2:10.233.36.135:3300/0,v1:10.233.36.135:6789/0] mon.i

And finally inject the repaired config back into RocksDB’s monmap

        
# Inject into mon-g's store
ceph-mon --inject-monmap /tmp/monmap --mon-data /host/var/lib/rook/mon-g/data

2) Cluster in limbo

I spent half of my saturday recovering an even more peculiar situation. Some deployments were running and others were not. I tried to reinstall helm charts but the deployments were not going through and the ceph cluster was in limbo:

        
      
│   Warning  FailedMount  21m (x382 over 13h)  kubelet  MountVolume.MountDevice failed for volume "pvc-07d14ec7-e804-4f2e-876c-266e13fd3903" : rpc error: code = Aborted d │
│ esc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-28c7f4ae-63a6-426c-ada0-c022a8f65466 already exists                                     │
│   Warning  FailedMount  86s (x392 over 13h)  kubelet  MountVolume.MountDevice failed for volume "pvc-defc0212-31cc-4206-b364-588ff3c4f3eb" : rpc error: code = Aborted d │
│ esc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-32d190d9-509b-49b6-a864-277b9280b5a1 already exist

Back in the rook namespace: 2 mgr and 2 cephfs controllers are in crashloop showing 2/3 pods running.

The connectivity between the MONs was working:

        
      
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- bash -c "
  timeout 3 bash -c 'echo > /dev/tcp/10.233.64.38/6789' && echo 'f:6789 OK' || echo 'f:6789 FAIL'
  timeout 3 bash -c 'echo > /dev/tcp/10.233.65.113/6789' && echo 'g:6789 OK' || echo 'g:6789 FAIL'
  timeout 3 bash -c 'echo > /dev/tcp/10.233.64.38/3300' && echo 'f:3300 OK' || echo 'f:3300 FAIL'
  timeout 3 bash -c 'echo > /dev/tcp/10.233.65.113/3300' && echo 'g:3300 OK' || echo 'g:3300 FAIL'
"

but the rook-ceph-tools pod had a stale ceph.conf

        
      
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- cat /etc/ceph/ceph.conf     
[global]
mon_host = 10.233.43.197:6789,10.233.60.222:6789,10.233.3.115:6789
[client.admin]
keyring = /etc/ceph/keyring

I could solve the stale ceph.conf by simply redeploying a new pod

kubectl rollout restart deploy -n rook-ceph rook-ceph-operator rook-ceph-tools

but the MON keyring was missing in the mon-f pod

        
kubectl exec -n rook-ceph rook-ceph-mon-f-55ddd854b-kzs2n -- cat /etc/ceph/keyring 2>/dev/null || \
kubectl exec -n rook-ceph rook-ceph-mon-f-55ddd854b-kzs2n -- find / -name "*.keyring" 2>/dev/null | head -5

Since that secret gets mounted by the operator on pod startup a rollout solved the issue as well

The plot twist came when I had the briliant idea of using the admin keyring’s key from the rook-ceph-mon secret into the unhealthy mon as it blocked communication alltogether. By comparing the secret in the cluster with the one on the host it became apparent that the ceph cluster was rotating keys and died halfway through.

        
kubectl get secret -n rook-ceph rook-ceph-mon -o jsonpath='{.data.ceph-secret}' | base64 -d
root@coreos-cp-1:/var/home/core# cat /var/lib/rook/rook-ceph/client.admin.keyring

D) The solutions

Now I’m faced with a conundrum that can only be solved by either purchasing hardware or ditching ceph altogether in favor of static pvc allocation to my nodes with lvm, btrfs or zfs. Zfs would have been quite good since I would have been able to replicate my block device at the application level with volsync and replicate my data pool at the zfs level as well but i chose the hardware upgrade for now.

My hardware choice settled on the samsung PM883 entreprise SSD drives. It is only SATA with speed reaching only 6Gbit/s but it has:

1) The advantage of being SATA which means i can throw them at my Proliant Gen 9 without having to buy an expensive add-on card for nvmes

2) PLP: Has power loss protection

3) Uses a NAND cache which stabilizes my latency. No more variable fsync latencies. All writes go through a write-back cache that allows me to queue requests.

VM1 etcd fsync ──▶ SSD #1 (dedicated) ──▶ stable ~200µs ✅
VM2 etcd fsync ──▶ SSD #2 (dedicated) ──▶ stable ~200µs ✅
VM3 etcd fsync ──▶ SSD #3 (dedicated) ──▶ stable ~200µs ✅

Ceph split brain

Introduction

A) The cause

B) Catalyst

C) The repair

1) Split brain

2) Cluster in limbo

D) The solutions

Further Reading

k8s part 7 ceph backups

k8s part 5 persistent storage

k8s part 6 adding a blog to the cluster