Random Post: Thoughts on RHEV
RSS .92| RSS 2.0| ATOM 0.3
  • Home
  •  

    How available is VMware’s Round Robin Path Selection Plugin?

    VMware wisely introduced Round Robin (RR) as a supported path selection plugin (PSP) as part of their Native Multipathing (NMP) suite.  We originally dealt with Fixed Path (FP) and Most Recently Used (MRU), which led to administrators to directly manage I/O if their servers had multiple host bus adapters (HBA) and/or storage targets.  I’m sure most admins stuck with fixed path, which provided for an active/passive configuration, and went on with their business.  I know I have.

    Along comes Round Robin to many cheers and hoorays, and we merrily go about our business and configure our hosts to use it.  We extolled the virtues to our management saying how available our connection to the SAN fabric and storage has become now that we are using one path per I/O.  I know I did.

    Here is the rub.  Round Robin is not designed to provide High Availability.  As it turns out, VMware has designed and built RR (and the two other PSP for that matter) to only mark a path dead when it receives certain SCSI sense codes.  You can find the full list on the VMware KB.  They have followed the specifications, and I completely understand why they made their choices.

    But what happens when we are having issues with the SAN fabric and do NOT see those specifics sense codes?  If you guessed nothing, you would be very correct.  The implementation of RR will only move to the next path upon a SUCCESSFUL (sense code 0x0 or 0x00) I/O.  If we receive “soft” errors such as command: 0x2 (Host_Bus_Busy) or 0x28 (Task Set Full), RR will not move to the next path, due to the fact it has not received a 0x0 code.  VMware’s definition of SCSI error conditions can be found at VMware KB #1030381.  This means we will be STUCK on a path that is having problems, which results in failed I/Os.  You may notice the ESX(i) server disconnect from vCenter, the ability to run df from the command line diminish, and no way to enumerate the storage.  You will also notice that virtual machines stop responding to pings, and applications failures there-in.  This is a result of failed I/O’s in the virtual machine.  SCSI errors will be clearly visible within the logs of guest OS.

    You can find the error messages on ESX(i) in the system log from the command line:

    • do: cd /var/log
    • do: grep naa messages

    Or look through the old logs:

    • do: zcat messages*.gz |grep naa

    Due to ESXi’s very quick log rotation, you may be out of luck by the time you respond to an event.  You should take the time and export syslog from your ESXi hosts to a central server such as the vMA appliance.  If you need help setting up syslog, see Kanuj Behl’s post blog post on vmwise.com.

    Leave a Reply

    Your email address will not be published. Required fields are marked *