Behind the Scenes: Episode 36 – Storage Services Design: Object Storage

ep36

Welcome to the Episode 36 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

This week we talk to the guys from NetApp IT and the Storage Services Design team, as well as Duncan Moore, who heads up the StorageGrid team here at NetApp. We discuss object storage and how Storage Services Design are helping NetApp IT implement it.

This episode featured:

The official podcast blog is here:

http://community.netapp.com/t5/Technology/Tech-ONTAP-Podcast-Episode-36-Storage-Services-Design-Object-Storage/ba-p/118849

You can hear the podcast here:

Behind the Scenes: Episode 35 – Mesosphere DC/OS

ep35

Welcome to the Episode 35 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

This week, we were fortunate to get the folks at Mesosphere over Skype to talk to us on the week of the release of DC/OS. Andrew was actually the driver of the scheduling here, and it was good timing.

This episode featured:

  • Kiersten Gaffney of Mesosphere (@kierstengaffney)
  • Aaron Williams of Mesosphere (@_arw)
  • Garrett Mueller, Technical Director at Netapp
  • Bryan Naylor, developer at NetApp

The official NetApp blog post on the episode can be found here: http://community.netapp.com/t5/Technology/Tech-ONTAP-Podcast-Episode-35-Mesosphere-DC-OS/ba-p/118579

Recording the podcast

This week, we had a combination of local people from NetApp, as well as Skype callers from Mesosphere.

The recording went pretty smoothly. Fun fact from last week – forgot to hit record for about 20 min into the OpenStack Summit podcast. True professionalism there.

DC/OS is ready to download and try today at https://dcos.io/. You can also find Andrew’s new Docker blog post at http://community.netapp.com/t5/Technology/Docker-Volumes-Using-NetApp-Storage/ba-p/108159 and a link for the new NetApp Docker plugin here: http://netapp.github.io/openstack/2016/04/19/announcing-the-netapp-docker-volume-plugin/index.html

Now, for the podcast!

Behind the Scenes: Episode 34 – OpenStack Summit Austin Preview

Welcome to the Episode 34 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

I didn’t write up an Episode 33 version of this last week, mainly because I was fighting with an ESXi upgrade.

This week, we brought in a number of our OpenStack people (both NetApp and SolidFire) to talk about the upcoming OpenStack Summit in Austin, TX and what NetApp and SolidFire would be doing there.

On this episode we had:

You can also reach Rob @openstacknetapp.

SolidFire also posted a blog on the preview as well.

Recording the podcast

We only had Andrew, Glenn and myself in the studio – everyone else dialed in via Skype. So, no pictures or funny stories. 🙂

But, I have been to Austin before and it’s a pretty awesome city. If you want photos, I have some here.

DSC_0352DSC_0362DSC_0022DSC_0094

Now, for the episode!

Adventures in Upgrading ESXi

Here at NetApp, we have a variety of labs available to us to tinker with. I work with a few other TMEs in managing a few clustered Data ONTAP clusters, as well as an ESXi server farm. We have 6 ESXi servers that we just moved into a new lab location and are finally ready to be powered back up after a 4-5 month hiatus.

So, I figured, since the lab’s been down for so long anyway, why not upgrade the ESXi servers from 5.1 to 6.0 update 2 while we’re at it?

What could possibly go wrong on my first actual ESXi upgrade on servers that have been migrated from different IP addresses, some of which may still be lingering on the system and are unreachable?

Well, I’ll tell you.

First attempt at upgrading a server, all sorts of things were broken.

  • vCenter couldn’t connect
  • The web client couldn’t connect – error was “503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x1f06ff18] _serverNamespace = / _isRedirect = false _port = 8309)”
  • esxcli and vim-cmd commands failed with:
[root@esxi1:~] esxcli
Connect to localhost failed: Connection failure.

After spending a few hours poking around to try to fix the issue, I decided it was probably user error.  I used “install” instead of “update” and when I rebooted, so that probably nuked the server, right?

So I tried again on a new server. This time, I read the manual and did the update the way that was supposedly correct. I even got an error found in the release notes and used VMware’s workaround:

~ # esxcli system maintenanceMode set --enable true
~ # esxcli system maintenanceMode get
Enabled
~ # esxcli software vib update -d /vmfs/volumes/vm_storage/ESX6/update-from-esxi
6.0-6.0_update02.zip
 [DependencyError]
 VIB VMware_bootbank_esx-base_6.0.0-2.34.3620759 requires vsan >= 6.0.0-2.34, bu t the requirement cannot be satisfied within the ImageProfile.
 VIB VMware_bootbank_esx-base_6.0.0-2.34.3620759 requires vsan << 6.0.0-2.35, bu t the requirement cannot be satisfied within the ImageProfile.
 VIB VMware_bootbank_ehci-ehci-hcd_1.0-3vmw.600.2.34.3620759 requires xhci-xhci >= 1.0-3vmw.600.2.34, but the requirement cannot be satisfied within the ImagePr ofile.
 Please refer to the log file for more details.
~ # esxcli software profile update -d /vmfs/volumes/vm_storage/ESX6/update-from-esxi6.0-6.0_update02.zip -p ESXi-6.0.0-20160302001-standard
Update Result
 Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
 Reboot Required: true

After I rebooted:

[root@esxi1:~] esxcli
Connect to localhost failed: Connection failure.

Son of a…

I started Googling like a madman.

google-errors

Found the ever-helpful William Lam’s blog on the web client issue. His recommendation was running a vim-cmd command. However…

[root@esxi2:~] vim-cmd hostsvc/advopt/update Config.HostAgent.plugins.solo.enableMob bool true
Failed to login: Invalid response code: 503 Service Unavailable

In the vpxa.log file, a ton of these:

verbose vpxa[FF8E8AC0] [Originator@6876 sub=vpxXml] [VpxXml] Error fetching /sdk/vimService?wsdl: 503 (Service Unavailable)
warning vpxa[FFCC0B70] [Originator@6876 sub=Default] Closing Response processing in unexpected state: 3
warning vpxa[FFCC0B70] [Originator@6876 sub=hostdcnx] [VpxaHalCnxHostagent] Could not resolve version for authenticating to host agent

 

The log suggested there was a connection failure on port 443, but telnet to that port worked fine. It took me a little bit of tinkering, but I finally figured out where that port number is controlled – /etc/vmware/vpxa/vpxa.cfg.

In that log file, I also noticed that my IP address was wrong – it was using the old IP addresses the hosts had. I changed the IP address and the port used to port 80. Once I did that, my error changed a bit. This time, it was a SSL error:

Error in sending request - SSL Exception

I spent a bit more time poking around and finally decided – time to blow it up. Way easier to re-install a lab box than to try to dig through all the configuration files.

If you find yourself in a similar bind, don’t waste your time – unless it’s production. Then open a case.

I think my issue ended up being a combination of:

  • Stale IP addresses
  • Stale iSCSI HBA settings
  • Stale configs
  • Upgrading to ESXi 6 without addressing the above first

If anyone has any suggestions for fixing this issue, by all means, post in the comments. 🙂

UPDATE:

Both ESXi boxes have been wiped and reinstalled with ESXi 6.0. All is working fine. Funny story, though… after one re-image, I connected via SSH and thought it broke again. Turns out I had a duplicate IP and was still connecting to the old server. Ooops.