Skip to content
Homelab Bound
Menu
  • Home
  • About
  • Categories
    • Alerting
    • Automation
    • Database
    • Firewall/IDS/IPS
    • Forensics
    • Hardware
    • Networking
    • Openstack
    • Privacy
    • Proxmox
    • Security
    • Storage
  • News
  • Resources
  • Today I Learned
Menu

A Quorum Conundrum

Posted on May 9, 2022May 9, 2022 by shadow

Recently I experienced a self induced failure of one of my nodes within my Proxmox cluster. After a significant amount of troubleshooting and yielding no results I ended up re-installing the node and rejoining it to the cluster.

The steps I followed to remove the node from the cluster on one of the live devices were:
pvecm delnode failednodename
and going to /etc/pve/nodes on all the working nodes and removing the folder that was designated to the failed node.

Now because I have 4 nodes I additionally have a Qdevice 5th vote so I could have a 3/5 quorum.

All seemed well enough until I started to receive alerts “cluster not read – no quorum” and when I checked the status of pvecm status two devices were registered as status NA,NV,NMW and the other two were now registered NR.

A list of status flags explaining the acronyms:

A (Alive) or NA (not alive)
Shows the connectivity status between QDevice and Corosync. If there is a heartbeat between QDevice and Corosync, it is shown as alive (A).
V (Vote) or NV (non vote)
Shows if the quorum device has given a vote (letter V) to the node. A letter V means that both nodes can communicate with each other. In a split-brain situation, one node would be set to V and the other node would be set to NV.
MW (Master wins) or NMW(not master wins)
Shows if the quorum device master_wins flag is set. By default, the flag is not set, so you see NMW (not master wins) See the man page votequorum_qdevice_master_wins(3) for more information.
NR (not registered)
Shows that the cluster is not using a quorum device.

It turns out after a lot more troubleshooting trying to setup the Qdevice again as I though the issue was caused by this device to some degree was actually to do with how Proxmox nodes communicate with each other via ssh keys.

Usually for example when a remote host is added to the known_hosts file it it will be in /home/username/.ssh or /root/.ssh.

Proxmox though, while using this location for user controlled/initiated ssh sessions, has another location to store ssh keys in /etc/ssh. If we navigate to this location and ls the directory we can see a few different key pairs generated by Proxmox. We can also see that there is an ssh_known_hosts file as well.

Once I did a bit more looking into the ssh_known_hosts file, I could see the that the old ssh_host_rsa_key.pub key from in this case my failed node was still in there and the new one that was generated during the installation had not changed in this file.

I updated the public key in all nodes and voila!

Everything was working again (of course after re-initializing the Qdevice).

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • OpenStack Trove – A Pirates Booty? or Loot Box? (Part 1)
  • A Very Socksy Proxy
  • Automating Generation of IP Allow and Blocklists on an Edgerouter
  • A Quorum Conundrum

Recent Comments

No comments to show.

Archives

  • April 2024
  • June 2022
  • May 2022
© 2025 Homelab Bound | Powered by Superbs Personal Blog theme