Adventures in Upgrading ESXi

Here at NetApp, we have a variety of labs available to us to tinker with. I work with a few other TMEs in managing a few clustered Data ONTAP clusters, as well as an ESXi server farm. We have 6 ESXi servers that we just moved into a new lab location and are finally ready to be powered back up after a 4-5 month hiatus.

So, I figured, since the lab’s been down for so long anyway, why not upgrade the ESXi servers from 5.1 to 6.0 update 2 while we’re at it?

What could possibly go wrong on my first actual ESXi upgrade on servers that have been migrated from different IP addresses, some of which may still be lingering on the system and are unreachable?

Well, I’ll tell you.

First attempt at upgrading a server, all sorts of things were broken.

  • vCenter couldn’t connect
  • The web client couldn’t connect – error was “503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x1f06ff18] _serverNamespace = / _isRedirect = false _port = 8309)”
  • esxcli and vim-cmd commands failed with:
[root@esxi1:~] esxcli
Connect to localhost failed: Connection failure.

After spending a few hours poking around to try to fix the issue, I decided it was probably user error.  I used “install” instead of “update” and when I rebooted, so that probably nuked the server, right?

So I tried again on a new server. This time, I read the manual and did the update the way that was supposedly correct. I even got an error found in the release notes and used VMware’s workaround:

~ # esxcli system maintenanceMode set --enable true
~ # esxcli system maintenanceMode get
Enabled
~ # esxcli software vib update -d /vmfs/volumes/vm_storage/ESX6/update-from-esxi
6.0-6.0_update02.zip
 [DependencyError]
 VIB VMware_bootbank_esx-base_6.0.0-2.34.3620759 requires vsan >= 6.0.0-2.34, bu t the requirement cannot be satisfied within the ImageProfile.
 VIB VMware_bootbank_esx-base_6.0.0-2.34.3620759 requires vsan << 6.0.0-2.35, bu t the requirement cannot be satisfied within the ImageProfile.
 VIB VMware_bootbank_ehci-ehci-hcd_1.0-3vmw.600.2.34.3620759 requires xhci-xhci >= 1.0-3vmw.600.2.34, but the requirement cannot be satisfied within the ImagePr ofile.
 Please refer to the log file for more details.
~ # esxcli software profile update -d /vmfs/volumes/vm_storage/ESX6/update-from-esxi6.0-6.0_update02.zip -p ESXi-6.0.0-20160302001-standard
Update Result
 Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
 Reboot Required: true

After I rebooted:

[root@esxi1:~] esxcli
Connect to localhost failed: Connection failure.

Son of a…

I started Googling like a madman.

google-errors

Found the ever-helpful William Lam’s blog on the web client issue. His recommendation was running a vim-cmd command. However…

[root@esxi2:~] vim-cmd hostsvc/advopt/update Config.HostAgent.plugins.solo.enableMob bool true
Failed to login: Invalid response code: 503 Service Unavailable

In the vpxa.log file, a ton of these:

verbose vpxa[FF8E8AC0] [Originator@6876 sub=vpxXml] [VpxXml] Error fetching /sdk/vimService?wsdl: 503 (Service Unavailable)
warning vpxa[FFCC0B70] [Originator@6876 sub=Default] Closing Response processing in unexpected state: 3
warning vpxa[FFCC0B70] [Originator@6876 sub=hostdcnx] [VpxaHalCnxHostagent] Could not resolve version for authenticating to host agent

 

The log suggested there was a connection failure on port 443, but telnet to that port worked fine. It took me a little bit of tinkering, but I finally figured out where that port number is controlled – /etc/vmware/vpxa/vpxa.cfg.

In that log file, I also noticed that my IP address was wrong – it was using the old IP addresses the hosts had. I changed the IP address and the port used to port 80. Once I did that, my error changed a bit. This time, it was a SSL error:

Error in sending request - SSL Exception

I spent a bit more time poking around and finally decided – time to blow it up. Way easier to re-install a lab box than to try to dig through all the configuration files.

If you find yourself in a similar bind, don’t waste your time – unless it’s production. Then open a case.

I think my issue ended up being a combination of:

  • Stale IP addresses
  • Stale iSCSI HBA settings
  • Stale configs
  • Upgrading to ESXi 6 without addressing the above first

If anyone has any suggestions for fixing this issue, by all means, post in the comments. 🙂

UPDATE:

Both ESXi boxes have been wiped and reinstalled with ESXi 6.0. All is working fine. Funny story, though… after one re-image, I connected via SSH and thought it broke again. Turns out I had a duplicate IP and was still connecting to the old server. Ooops.

10 thoughts on “Adventures in Upgrading ESXi

  1. Pingback: Behind the Scenes: Episode 34 – OpenStack Summit Austin Preview | Why Is The Internet Broken?

  2. Having this issue on production box. VMware support is no help. Their solution is to reboot the box and take the VMs down with it. Not an acceptable solution. Any one else out there fix this issue without taking out VMs with the host?

    Like

    • I’d push to get that case escalated. In my case, I’m pretty sure it could have been fixed with some config file changes. But I had no idea which ones I needed to change and wiping it was easier. Also, rebooting didn’t fix my particular issue and I have my doubts it would fix yours.

      Like

      • Same issue here, host disconnected from vCenter for no reason and won’t reconnect – exact same error messages. Case opened with VMware, will see what they come up with, I don’t really want to manually shut down 40 VMs in order to reboot the server…

        Like

  3. Pingback: Is this blog a Top vBlog 2016? | Why Is The Internet Broken?

  4. I know this is old, but we experienced this:

    Connect to localhost failed: Connection failure

    Turns out, our /etc/hosts on the ESXi servers had the ::1 localhost entry above the 127.0.0.1 localhost entry, and so it needed to be moved. Once we did that, everything worked as expected (following a reboot).

    Like

  5. Okay,I just check /var/log/vmkernel.log,and find error
    “2017-07-24T02:55:11.500Z cpu1:242009)Vol3: 1023: Couldn’t read volume header from : I/O error 2017-07-24T02:55:11.503Z cpu1:242009)Vol3: 1023: Couldn’t read volume header from : I/O error 2017-07-24T02:55:11.515Z cpu1:242009)Vol3: 1023: Couldn’t read volume header from naa.5000cca02d5daa38:1: I/O error 2017-07-24T02:55:11.517Z cpu1:242009)Vol3: 1023: Couldn’t read volume header from naa.5000cca02d5daa38:1: I/O error … …

    2017-07-24T02:55:10.778Z cpu1:242009)WARNING: PLOG: PLOGValidateDisk:2546: Possibly corrupt metadata read from disk naa.5000cca02d5d12b8, checksum mismatch – expected 0xe09d656a84b98472, got 0x0 disk remains unpublished 2017-07-24T02:55:10.779Z cpu1:242009)PLOG: PLOGProbeDevice:5214: Probed plog device 0x4313b820f770 exists.. continue with old entry 2017-07-24T02:55:10.779Z cpu1:242009)WARNING: PLOG: PLOGValidateDisk:2546: Possibly corrupt metadata read from disk naa.5446a2e4dcccb002, checksum mismatch – expected 0xbf05aeb2e671663a, got 0x706050403020100 disk remains unpublished … …
    So,I use command “partedUtil delete /dev/disks/aa.5000cca02d5daa38 1” delete all of error disk,and the question is resolved.

    Like

Leave a comment