Recently I experienced a self induced failure of one of my nodes within my Proxmox cluster. After a significant amount of troubleshooting and yielding no results I ended up re-installing the node and rejoining it to the cluster.
The steps I followed to remove the node from the cluster on one of the live devices were:
pvecm delnode failednodename
and going to /etc/pve/nodes on all the working nodes and removing the folder that was designated to the failed node.
Now because I have 4 nodes I additionally have a Qdevice 5th vote so I could have a 3/5 quorum.
All seemed well enough until I started to receive alerts “cluster not read – no quorum” and when I checked the status of pvecm status
two devices were registered as status NA,NV,NMW and the other two were now registered NR.
A list of status flags explaining the acronyms:
A
(Alive) orNA
(not alive)- Shows the connectivity status between QDevice and Corosync. If there is a heartbeat between QDevice and Corosync, it is shown as alive (A).
V
(Vote) orNV
(non vote)- Shows if the quorum device has given a vote (letter
V
) to the node. A letterV
means that both nodes can communicate with each other. In a split-brain situation, one node would be set toV
and the other node would be set toNV
. MW
(Master wins) orNMW
(not master wins)- Shows if the quorum device master_wins flag is set. By default, the flag is not set, so you see
NMW
(not master wins) See the man page votequorum_qdevice_master_wins(3) for more information. NR
(not registered)- Shows that the cluster is not using a quorum device.
It turns out after a lot more troubleshooting trying to setup the Qdevice again as I though the issue was caused by this device to some degree was actually to do with how Proxmox nodes communicate with each other via ssh keys.
Usually for example when a remote host is added to the known_hosts file it it will be in /home/username/.ssh or /root/.ssh.
Proxmox though, while using this location for user controlled/initiated ssh sessions, has another location to store ssh keys in /etc/ssh. If we navigate to this location and ls the directory we can see a few different key pairs generated by Proxmox. We can also see that there is an ssh_known_hosts file as well.
Once I did a bit more looking into the ssh_known_hosts file, I could see the that the old ssh_host_rsa_key.pub key from in this case my failed node was still in there and the new one that was generated during the installation had not changed in this file.
I updated the public key in all nodes and voila!
Everything was working again (of course after re-initializing the Qdevice).