RSS .92| RSS 2.0| ATOM 0.3
  • Home

    Killing VMware HA, with extreme brutality

    There are many blog entries about VMware HA, and how great it is.  And they are correct; it is fantastic.  But what happens when you have a very large HA/DRS cluster, and all nodes lose track of who is one of 5 primaries?  And what happens when you right click on the cluster, deselect HA, and then nothing happens?

    ssh to each host that is stalled, possibly at 5%, and issue the following command:

    service vmware-aam stop ; pgrep -f aam | xargs kill

    Breaking down the command:

    service vmware-aam stop — issues a stop command to the vmware-aam service (which provides HA).  It will have no effect since HA is in such bad shape.

    pgrep -f aam | xargs kill — conduct a process grep, search for the pattern aam and pipe the process IDs to xargs which will be killed.

    The command combo will need to be executed on each server that is having an issue.  Once all have been fixed, reconfigure for HA.

    An easy way to watch the status of HA rebuilding itself is to ssh to the first server that has successfully reconfigured, and issue the following commands:

    1. export FT_DIR=/opt/vmware/aam
    2. while [ true ] ; do $FT_DIR/bin/ftcli -domain vmware -cmd ln ; sleep 30; done

    These commands will set the environment variable, FT_DIR, to the appropriate path, and then will list nodes (-cmd ln) in the HA cluster every 30 seconds until you hit Control-C.

    ** Update 8/3/2010:

    Duncan Epping from just posted an entry about updates to HA CLI which will now display the machine that is the “primary master,” and also the ability to promote and demote machines as necessary.  Available (at least) in vSphere 4.1.  Great stuff!

    Leave a Reply

    Your email address will not be published. Required fields are marked *