A Quorum Conundrum - Homelab Bound

Recently I experienced a self induced failure of one of my nodes within my Proxmox cluster. After a significant amount of troubleshooting and yielding no results I ended up re-installing the node and rejoining it to the cluster.

The steps I followed to remove the node from the cluster on one of the live devices were:
pvecm delnode failednodename
and going to /etc/pve/nodes on all the working nodes and removing the folder that was designated to the failed node.

Now because I have 4 nodes I additionally have a Qdevice 5th vote so I could have a 3/5 quorum.

All seemed well enough until I started to receive alerts “cluster not read – no quorum” and when I checked the status of pvecm status two devices were registered as status NA,NV,NMW and the other two were now registered NR.

A list of status flags explaining the acronyms:

A (Alive) or NA (not alive): Shows the connectivity status between QDevice and Corosync. If there is a heartbeat between QDevice and Corosync, it is shown as alive (A).
V (Vote) or NV (non vote): Shows if the quorum device has given a vote (letter V) to the node. A letter V means that both nodes can communicate with each other. In a split-brain situation, one node would be set to V and the other node would be set to NV.
MW (Master wins) or NMW(not master wins): Shows if the quorum device master_wins flag is set. By default, the flag is not set, so you see NMW (not master wins) See the man page votequorum_qdevice_master_wins(3) for more information.
NR (not registered): Shows that the cluster is not using a quorum device.

It turns out after a lot more troubleshooting trying to setup the Qdevice again as I though the issue was caused by this device to some degree was actually to do with how Proxmox nodes communicate with each other via ssh keys.

Usually for example when a remote host is added to the known_hosts file it it will be in /home/username/.ssh or /root/.ssh.

Proxmox though, while using this location for user controlled/initiated ssh sessions, has another location to store ssh keys in /etc/ssh. If we navigate to this location and ls the directory we can see a few different key pairs generated by Proxmox. We can also see that there is an ssh_known_hosts file as well.

Once I did a bit more looking into the ssh_known_hosts file, I could see the that the old ssh_host_rsa_key.pub key from in this case my failed node was still in there and the new one that was generated during the installation had not changed in this file.

I updated the public key in all nodes and voila!

Everything was working again (of course after re-initializing the Qdevice).

Leave a Reply Cancel reply