Behind the Scenes: Episode 83 – OnCommand Unified Manager 7.2

Welcome to the Episode 83, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we chat with Philip Bachman and Yossi Weihs about the new OnCommand Unified Manager update. Find out what’s new and why Unified Manager can truly claim to be a Unified Manager!

What is OnCommand Unified Manager?

OnCommand Unified Manager is NetApp’s monitoring software package. It was the next evolution of Data Fabric Manager (DFM) and allows storage administrators to monitor capacity, performance and other storage events. It also allows you to set up notifications and run scripts when thresholds have been reached. It can be deployed on Linux or Windows in OCUM 7.2, as well as a standard OVA in ESXi.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 82 – DockerCon Preview Featuring nDVP and Trident

Welcome to the Episode 82, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, it’s just Justin and Andrew chatting about the upcoming Docker conference in Austin, TX, as well as updates to the NetApp Docker Volume Plugin and Trident. We get a bit rambuncuous on this one and in honor of Easter, we leave plenty of Easter eggs!

If you’ll be in Austin for DockerCon or Boston for Red Hat Summit or OpenStack Summit, be sure to look for Andrew and the rest of the NetApp team!

For information on NetApp Docker Volume Plugin, Trident and more, go to the Pub:

banner-v3

http://netapp.io

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 81 – NetApp Service Level Manager

Welcome to the Episode 81, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we invited the NetApp Service Level Manager team to talk to us about how they’re revolutionizing storage as a service and automating day to day storage administration tasks. Join Executive Architect Evan Miller (@evancmiller), Product Manager Nagananda.Anur (naga@netapp.com) and Technical Director Ameet.Deulgaonkar (Ameet.Deulgaonkar@netapp.com, @Amyth18) as they detail the benefits of NetApp Service Level Manager!

For more details on NetApp Service Level Manager, we encourage you to check out the additional blogs below:

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Case study: Using OSI methodology to troubleshoot NAS

Recently, I installed some 10GB cards into an AFF8040 so I could run some FlexGroup performance tests (stay tuned for that). I was able to install the cards myself, but to get them connected to a network here at NetApp’s internal labs, you have to file a ticket. This should sound familiar to many people, as this is how real-world IT works.

So I filed the ticket and eventually, the cards were connected. However, just like in real-world IT, the network team has no idea what the storage team (me) has configured, and the storage team (me) has no idea how the network team has things configured. So we had to troubleshoot a bit to get the cards to ping correctly. Turns out, they had a vlan tag on the ports that weren’t needed. Removed those and fixed the port channel and cool! We now had two 10GB LACP interfaces on a 2 node cluster!

Not so fast…

Turns out, ping is a great test for basic connectivity. But it’s awful for checking if stuff *actually works.* In this case, I could ping via the 10GB interfaces and even mount via NFSv3 and list directories, etc. But those are lightweight metadata operations.

Whenever I tried a heavier operation like a READ, WRITE or READDIRPLUS (incidentally, tab completion for a path when typing a command on an NFS mount? READDIRPLUS call), the client would hang indefinitely. When I would CTL + C out of the command, the process would sometimes also hang. And subsequent operations, including the GETATTR, LOOKUP, etc operations would also hang.

So, now I had a robust network that couldn’t even perform tab completions.

Narrowing down the issue

I like to start with a packet trace, as that gives me a hint where to focus my efforts. In this issue, I started a packet capture on both the client (10.63.150.161) and the cluster (10.193.67.218). In the traces, I saw some duplicate ACKs, as well as packets being sent but not replied to:

readdirplus-noreply.png

In the corresponding filer trace, I saw the READDIRPLUS call come in and get replied to, and then re-transmitted a bunch of times. But, as the trace above shows, the client never receives it.

readdirplus-filer

That means the filer is doing what it’s supposed to. The client is doing what it’s supposed to. But the network is blocking or dropping the packet for some reason.

When troubleshooting any issue, you have to start with a few basic steps (even though I like to start with the more complicated packet capture).

For instance…

What changed?

Well, this one was easy – I had added an entire new network into the mix. End to end. My previous ports were 1GB and worked fine. This was 10GB infrastructure, with LACP and jumbo frames. And I had no control over that network. Thus, I was left with client and server troubleshooting for now. I didn’t want to file another ticket before I had done my due diligence, in case I had done something stupid (totally within the realm of possibility, naturally).

So where did I go from there?

Start at layers 1, 2 and 3

The OSI model is something I used to take for granted as something interviewers asked because it seemed like a good question to stump people on. However, over the course of the last 10 years, I’ve come to realize it’s useful. What I was troubleshooting was NFS, which is all the way at layer 7 (the application layer).

osi-network-layer-cats[1]

So why start at layers 1-3? Why not start where my problem is?

Because with years of experience, you learn that the issue is rarely at the layer you’re seeing the issue manifest. It’s almost always farther down the stack. Where do you think the “Is it plugged in?” joke comes from?

media-cache-ak0-pinimg-com_736x_88_ac_c8_88acc8216648b26114507ca04686b357

Layer 1 means, essentially, is it plugged in? In this case, yes, it was. But it also means “are we seeing errors on the interfaces that are plugged in?” In ONTAP, you can see that with this command:

ontap9-tme-8040::*> node run * ifstat e2a
Node: ontap9-tme-8040-01

-- interface e2a (8 days, 23 hours, 14 minutes, 30 seconds) --

RECEIVE
 Frames/second: 1 | Bytes/second: 30 | Errors/minute: 0
 Discards/minute: 0 | Total frames: 84295 | Total bytes: 7114k
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 0
 No buffers: 0 | Non-primary u/c: 0 | L2 terminate: 9709
 Tag drop: 0 | Vlan tag drop: 0 | Vlan untag drop: 0
 Vlan forwards: 0 | CRC errors: 0 | Runt frames: 0
 Fragment: 0 | Long frames: 0 | Jabber: 0
 Error symbol: 0 | Illegal symbol: 0 | Bus overruns: 0
 Queue drop: 0 | Xon: 0 | Xoff: 0
 Jumbo: 0 | JMBuf RxFrames: 0 | JMBuf DrvCopy: 0
TRANSMIT
 Frames/second: 82676 | Bytes/second: 33299k | Errors/minute: 0
 Discards/minute: 0 | Total frames: 270m | Total bytes: 1080g
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 4496
 Queue overflows: 0 | No buffers: 0 | Xon: 0
 Xoff: 0 | Jumbo: 13 | TSO non-TCP drop: 0
 Split hdr drop: 0 | Pktlen: 0 | Timeout: 0
 Timeout1: 0 | Stray Cluster Pk: 0
DEVICE
 Rx MBuf Sz: Large (3k)
LINK_INFO
 Current state: up | Up to downs: 22 | Speed: 10000m
 Duplex: full | Flowcontrol: none

In this case, the interface is pretty clean. No errors, no “no buffers,” no CRC errors, etc. I can also see that the ports are “up.” The up to downs are high, but that’s because I’ve been adding/removing this port from the ifgrp multiple times, which leads me to the next step…

Layer 2/3

Layer 2 includes the LACP/port channel, as well as the MTU settings.  Layer 3 can also include pings and some switches, as well as routing.

Since the port channel was a new change, I made sure that the networking team verified that the port channel was configured properly, with the correct ports added to the channel. I also made sure that the MTU was 9216 on the switch ports, as well as the ports on the client and storage. Those all checked out.

However, that doesn’t mean we’re done with layer 2; remember, basic pings worked fine, but those operate at 1500 MTU. That means we’re not actually testing jumbo frames here. The issue with the client was that any NFS operation that was not metadata was never making it back to the client; that suggests a network issue somewhere.

I didn’t mention before, but this cluster also has a properly working 1GB network on 1500 MTU on  the same subnet, so that told me routing was likely not an issue. And because the client was able to send the information just fine and had the 10GB established for a while, the issue likely wasn’t on the network segment the client was connected to. The problem resided somewhere between the filer 10GB ports and the new switch the ports were connected to. (Remember… what changed?)

Jumbo frames

From my experience with troubleshooting and general IT knowledge, I knew that for jumbo frames to work properly, they had to be configured up and down the entire stack of the network. I knew the client was configured for jumbo frames properly because it was a known entity that had been chugging along just fine. I also knew that the filer had jumbo frames enabled because I had control over those ports.

What I wasn’t sure of was if the switch had jumbo frames configured for the entire stack. I knew the switch ports were fine, but what about the switch uplinks?

Luckily, ping can tell us. Did you know you could ping using MTU sizes?

Pinging MTU in Windows

To ping using a packet size in Windows, use:

ping -f -l [size] [address]

-f means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size. If it can’t, you’ll see this:

C:\>ping -f -l 9000 10.193.67.218

Pinging 10.193.67.218 with 9000 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Ping statistics for 10.193.67.218:
 Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)

Then, try pinging with only -l (which specifies the packet size). If that fails, you have a good idea that your issue is MTU size. Note: My windows client didn’t have jumbo frames enabled, so I didn’t bother trying to use it to troubleshoot using it.

Pinging MTU in Linux

To ping using a packet size in Linux, use:

ping [-M do] [-s <packet size>] [host]

-f, again, means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size.

-M <hint>: Select Path MTU Discovery strategy.

<hint> may be either “do” (prohibit fragmentation, even local one), “want” (do PMTU discovery, fragment locally when packet size is large), or “dont” (do not set DF flag).

Keep in mind that the MTU size you specify won’t be *exactly* 9000; there’s some overhead involved. In the case of Linux, we’re dealing with about 28 bytes. So an MTU of 9000 will actually come across as 9028 and complain about the packet being too long:

# ping -M do -s 9000 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 9000(9028) bytes of data.
ping: local error: Message too long, mtu=9000

Instead, ping jumbo frames using 9000 – 28 = 8972:

# ping -M do -s 8972 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 8972(9000) bytes of data.
^C
--- 10.193.67.218 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1454ms

In this case, I lost 100% of my packets. Now, let’s ping using 1500 – 28 = 1472:

# ping -M do -s 1472 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 1472(1500) bytes of data.
1480 bytes from 10.193.67.218: icmp_seq=1 ttl=249 time=0.778 ms
^C
--- 10.193.67.218 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 590ms
rtt min/avg/max/mdev = 0.778/0.778/0.778/0.000 ms

All good! Just to make sure, I pinged a known working client that has jumbo frames enabled end to end:

# ping -M do -s 8972 10.63.150.168
PING 10.63.150.168 (10.63.150.168) 8972(9000) bytes of data.
8980 bytes from 10.63.150.168: icmp_seq=1 ttl=64 time=1.12 ms
8980 bytes from 10.63.150.168: icmp_seq=2 ttl=64 time=0.158 ms
^C
--- 10.63.150.168 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1182ms
rtt min/avg/max/mdev = 0.158/0.639/1.121/0.482 ms

Looks like I have data pointing to jumbo frame configuration as my issue. And if you’ve ever dealt with a networking team, you’d better bring data. 🙂

 

Resolving the issue

The network team confirmed that the switch uplink was indeed not set to support jumbo frames. The change was going to take a bit of time, so rather than wait until then, I switched my ports to 1500 in the interim and everything was happy again. Once the jumbo frames get enabled on the cluster’s network segment, I can re-enable them on the cluster.

Where else can this issue crop up?

MTU mismatch is a colossal PITA. It’s hard to remember to look for it and hard to diagnose, especially if you don’t have access to all of the infrastructure.

In ONTAP, specifically, I’ve seen MTU mismatch break:

  • CIFS setup/performance
  • NFS operations
  • SnapMirror replication

Pretty much anything you do over a network can be affected, so if you run into a problem all the way up at the application layer, remember the OSI model and start with the following:

  • Check layers 1-3
  • Ask yourself “what changed?”
  • Compare against working configurations, if possible

Behind the Scenes: Episode 80 – NetApp MyAutosupport Dashboard Refresh

Welcome to the Episode 80, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we brought in Sudip Hore (@sudiphore) to talk about the new and improved MyAutosupport Dashboard.  Join us as we discuss the new predictive and proactive features and what’s coming in the future.

Also, check out the blog I wrote on it here:
https://whyistheinternetbroken.wordpress.com/2017/03/21/new-myautosupport-dashboard/.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 79 – Databases and Cloud with Jeff Steiner

Welcome to the Episode 79, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we invite the database guru himself, Jeff Steiner (@TweetofSteiner) to talk about databases, the cloud as it pertains to ONTAP and SolidFire. We go over a wide array of things as Jeff tells us exactly how he feels about marketing and NVMe, the current shiny object/topic du jour in the storage industry. Jeff pulls no punches.

Jeff Steiner also has a blog at https://words.ofsteiner.com.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 76 – Customer Chat with Yahoo’s Jeff Mohler

Welcome to the Episode 76, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we bring in a NetApp customer for a candid chat about how they use NetApp’s portfolio in their environment and what sort of challenges they face in day to day operations. Join us as we talk with Jeff Mohler (https://www.linkedin.com/in/jemohler/), a principal Global Storage Architect at Yahoo and get a feel for how an enterprise customer manages thousands of NetApp systems.

If you’re a NetApp customer and you’re interested in appearing on the podcast to chat about how you’re using NetApp, be sure to shoot us an email to podcast@netapp.com!

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 75 – NetApp 101

Welcome to the Episode 75, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week, we brought in a couple of NetApp n00bz from SolidFire – Amy Lewis (@CommsNinja) and Mike Turner (@1andOnlyMikeT) to talk about NetApp basics, from our portfolio offerings to our culture. Mike plays the role of interviewer, while Glenn, Andrew and Justin play the role of podcast guests.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

SMB1 Vulnerabilities: How do they affect NetApp’s Data ONTAP?

Google SMBv1 vulnerability, and you’ll get a ton of hits. There’s a reason for this.

SMB1 is the devil!

waterboy-smb1.jpg

But seriously, there are some major security holes in the protocol.

For a good rundown, check out the new NetApp CIFS/SMB TME Chris Hurley’s blog:

http://averageguyx.blogspot.com/2017/03/smb1-is-baaaaaad.html

This is in addition to the limitations of SMB1, such as lack of resiliency for network loss, lack of durable handles and overall performance and chattiness. There are many good reasons why Microsoft has decided to deprecate SMB1 in favor of newer protocols. SMB owner at Microsoft, Ned Pyle (@NerdPyle), gives a plethora of impassioned reasoning in his TechNet blog “Stop using SMB1!

So, there we are. SMB1 is bad, mmkay?

How does SMB1’s devil status affect NetApp’s ONTAP operating systems?

This question comes up a bit here at NetApp, since security scanners will throw bells, whistles and alarms whenever SMB1 is detected in an environment. What follows is:

  • Does SMB1 in ONTAP have the same vulnerabilities?
  • Can I disable SMB1 in ONTAP?
  • If I can’t disable it, can I block it?

The good news is, the main security vulnerabilities that plague SMB1 in Windows generally don’t affect ONTAP (such as 0-day), because ONTAP isn’t a Windows client. It’s using a proprietary, custom built CIFS/SMB stack (akin to Samba). Thus, the vulnerabilities that impact Windows don’t impact ONTAP.

Note: I can’t take all the credit for the information in this blog. That credit goes to John Lantz (CIFS TME at NetApp), as well as various CIFS/SMB engineering resources here.

Can I disable SMB1 in ONTAP?

While the vulnerabilities don’t necessarily affect ONTAP, the security scanners still are triggering alarms and managers are still wanting the red X’s to go away.

scan

As a result, people want to just turn it off in ONTAP, especially since they aren’t currently using it in their environments (hopefully).

The good news is that ONTAP is in the process of deprecating SMB1. The bad news? It’s still there and there’s no current way to disable it. NetApp is currently working on adding a way to do it. The closest thing we have is the ability to control what SMB version is used with domain controllers for authentication. In systems running ONTAP 7-mode, use the following option to enable SMB2.

cifs.smb2.client.enable

In systems running clustered ONTAP, starting in ONTAP 9.1, you can disable SMB1 connections to the DC, as well as enabling SMB2.

[-smb1-enabled-for-dc-connections {false|true|system-default}] - SMB1 Enabled for DC Connections
 This parameter specifies whether SMB1 is enabled for use with connections to domain controllers. If you do not specify this parameter, the default is system-default.

SMB1 Enabled For DC Connections can be one of the following:
o false - SMB1 is not enabled.
o true - SMB1 is enabled.
o system-default - This sets the option to whatever is the default for the release of Data ONTAP that is running. For this release it is: SMB1 is enabled.

[-smb2-enabled-for-dc-connections {false|true|system-default}] - SMB2 Enabled for DC Connections
 This parameter specifies whether SMB2 is enabled for use with connections to domain controllers. If you do not specify this parameter, the default is system-default.

SMB2 Enabled For DC Connections can be one of the following:
o false - SMB2 is not enabled.
o true - SMB2 is enabled.
o system-default - This sets the option to whatever is the default for the release of Data ONTAP that is running. For this release it is: SMB2 is not enabled.

Use the following command to do that:

cifs security modify -vserver DEMO -smb1-enabled-for-dc-connections false -smb2-enabled-for-dc-connections true

Since I can’t disable it in ONTAP, can I block it?

Technically, you *could* block the SMB1 ports. However, if you block ports that SMB2 also needs (such as 445), you’d be in trouble.

The official recommendation from Microsoft is a combination of disabling SMB1 on clients (you could handle this via Group Policy), as well as blocking ports on *external* facing interfaces. In other words, don’t allow SMB outside of the firewall.

Here’s the official link:

https://technet.microsoft.com/en-us/library/cc766392%28v=ws.10%29.aspx?f=255&MSPPError=-2147217396

To disable SMB1 on the client:

https://support.microsoft.com/en-us/kb/2696547

Inside your firewall, you shouldn’t need the following ports, so block away:

  • UDP/137 (NetBIOS name service)
  • UDP/138 (NetBIOS datagram service)
  • TCP/139 (NetBIOS session service)

In some cases, you won’t be able to rid yourself entirely of SMB1. Remember that $30k printer/copier/scanner that you bought 10 years ago that was cool because you could scan directly to a SMB share? Yeah…. that’s probably still using SMB1. Check with your scanner/copier vendor to see if they have any software updates. Otherwise, you may need to disable SMB1 on the copier/scanner, or budget for a new one.

copier

For official NetApp statement on SMB1, check out this TR, starting on page 4:

http://www.netapp.com/us/media/tr-4543.pdf

ONTAP CLI comparison tool

ontapcli

Ever wonder where a command you always used to use went? Or what the new commands in an ONTAP release are? Didn’t want to read every document on the planet to find out?

Well, good news!

NetApp has released a new tool that does ONTAP CLI comparisons between releases on the support site! And you don’t even need a valid NetApp login to see it.

http://mysupport.netapp.com/NOW/products/support/cli-comparison.shtml

This tool takes comparisons of commands between one major release and color codes them to show which have been added, changed or removed.

ontapcli-compare1ontapcli-compare-menu

Once you click on one of the releases, you get a page that has a color-coded legend and a series of drop down boxes that allow you to navigate different levels of the CLI directory structure. Green means “added.” Yellow is “changed.” Red is “removed.”

In addition, the drop down menus allow for quick navigation of the CLI directories. For instance, you can click “vserver” and get all of the sub-commands.

ontapcli-compare2

Once you select one, it takes you to the area of the table that you selected.

ontapcli-compare3.png

That’s it! Pretty simple. If you’re interested in some ONTAP CLI tricks and tips, check out TECH::Become a clustered Data ONTAP CLI Ninja.