ONTAP 9.2RC1 is available!

Like clockwork, the 6 month cadence is upon us again.

clockwork_930w_spc-31

ONTAP 9.2RC1 is available for download here:

http://mysupport.netapp.com/NOW/download/software/ontap/9.2RC1/

If you’re interested in a podcast where we cover the ONTAP 9.2 features, check it out here:

Also out: OnCommand (truly) Unified Manager 7.2:

http://mysupport.netapp.com/documentation/productlibrary/index.html?productID=61373

For now, let’s dive in a bit, shall we?

First of all, I made sure to upgrade my own cluster to show some of the new stuff off. Went off without a hitch:

upgraded

Now, let’s start with one of the most eagerly awaited new features…

Aggregate Inline Deduplication

If you’re not familiar with deduplication, it’s a storage feature that allows blocks that are identical to rely on pointers to a single block instead of having multiple copies of the same blocks. For example, if I am storing multiple JPEG images on a share (or even inside the same PowerPoint file), deduplication will allow me to save storage space by storing just one copy of the data. The image below is an 8.4MB photo I took in Point Reyes, California:

point-reyes-info.png

If I store two copies of the file on a share (no deduplication), that means I use up 16MB.

wo-dedupe

If I use deduplication, then that means the duplicate blocks only take up 4KB per block as they are pointed back to a single copy of the blocks.

w-dedupe.png

If I have multiple copies of the same image, they all point back to the same blocks:

w-dedupe-multiples.png

Pretty cool, eh?

Well, there was *one* problem with how ONTAP does deduplication; the duplicate blocks only count against a single FlexVol volume. That meant if we had the same file in multiple volumes, you don’t get the benefits of deduplication across those volumes.

dedupe-multiple-flexvol.png

In ONTAP 9.2, that issue is resolved. You can now take advantage of deduplication when multiple volumes reside in the same physical aggregate.

dedupe-aggr.png

This is all currently done inline (as data is ingested) only, and currently only on All Flash FAS systems. The space savings come in handy in workloads such as ESXi datastores, where you may be applying OS patches across multiple VMs in multiple datastores hosted in multiple FlexVol volumes.

At a high level, this animation shows how it works:

aid-animation2

Another place where aggregate inline deduplication would rock? NetApp FlexGroup volumes, where a single container is comprised of multiple member FlexVols on the same physical storage. Speaking of FlexGroup volumes, that leads us to the next feature added to ONTAP 9.2.

Other storage efficiency improvements

In addition to aggregate inline dedupe, ONTAP 9.2 also adds:

  • Advanced Drive Partitioning v2 (ADPv2) support for FAS8xxx and FAS9xxx with spinning drives; previously ADPv2 was only supported on All Flash FAS
  • Increase of the maximum aggregate size to 800TB (was previously 400TB)
  • Automated aggregate provisioning in System Manager for easier aggregate creation

NetApp Volume Encryption on FlexGroup volumes

ONTAP 9.1 introduced volume-level encryption (NVE). We did a podcast on it if you’re interested in learning more about it, but in ONTAP 9.2, support for NVE was added to NetApp FlexGroup volumes. Now you can apply encryption only at the volume level (as opposed to the disks via NSE drives) for your large, unstructured NAS workloads.

To apply it, all you need is a volume encryption license. Then, use the same process you would use for a FlexVol volume.

Additionally, NVE can now be used on SnapLock compliance volumes!

Quality of Service (QoS) Minimums/Guaranteed QoS

In ONTAP 8.2, NetApp introduced Quality of Service to allow storage administrators to apply policies to volumes – and even files like luns or VMs – to prevent bully workloads from affecting other workloads in a cluster.

Last year, NetApp acquired SolidFire, which has a pretty mean QoS of its own where it actually approaches QoS from the other end of the spectrum – guaranteeing a performance floor for workloads that require a specific service level.

qos

I’m not 100% sure, but I’m guessing NetApp saw that and said “that’s pretty sweet. Let’s do that.”

So, they have. Now, ONTAP 9.2 has a maximum and a minimum/guaranteed QoS for storage administrators and service providers. Check out a video on it here:

ONTAP Select enhancements

ONTAP 9.2 also includes some ONTAP Select enhancements, such as:

  • 2-node HA support
  • FlexGroup volume support
  • Improved performance
  • Easier deployment
  • ESX Robo license
  • Single node ONTAP Select vNAS with VSAN and iSCSI LUN support
  • Inline deduplication support

Usability enhancements

ONTAP is also continuing its mission to make the deployment and configuration via the System Manager GUI easier and easier. In ONTAP 9.2, we bring:

  • Enhanced upgrade support
  • Application aware data management
  • Simplified cluster expansion
  • Simplified aggregate deployment
  • Guided cluster setup

FabricPools

We covered FabricPools in Episode 63 of the Tech ONTAP podcast. Essentially, FabricPools tier cold blocks from flash disk to cloud or an on-premises S3 target like StorageGRID WebScale. It’s not a replacement for backup or disaster recovery; it’s more of a way to lower your total cost of ownership for storage by moving data that is not actively in use to free up space for other workloads. This is all done automatically via a policy. It behaves more like an extension of the aggregate, as the pointers to the blocks that moved remain on the local storage device.

fabricpool

ONTAP 9.2 introduces version 1 of this feature, which will support the following:

  • Tiering to S3 (StorageGRID) or AWS
  • Snapshot-only tiering on primary storage
  • SnapMirror destination tiering on secondary storage

Future releases will add more functionality, so stay tuned for that! We’ll also be featuring FabricPools in a deep dive for a future podcast episode.

So there you have it! The latest release of ONTAP! Post your thoughts or questions in the comments below!

Behind the Scenes: Episode 84 – StorageGRID WebScale 10.4

Welcome to the Episode 84, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

In case you didn’t notice, we’ve been on a bit of a hiatus the past two weeks. First, I had my appendix removed and then vacation. Then, we recorded episode 84 and had it ready to go for Friday, but were informed we had to wait until Monday because that’s when the release shipped. But, we’re taking care of you this week – we have THREE episodes lined up. Stay tuned for episodes to drop on Thursday and Friday of the week of May 8, 2017!

We’re kicking off the week with StorageGRID 10.4 with the StorageGRID Director, Duncan Moore (@NCDunc). Not only do we chat about the 10.4 release, (which is available Monday, May 8 2017) but we also talked a bit about 10.3 (which we forgot to cover) and the overall state of the object storage market today.

What is StorageGRID WebScale?

The StorageGRID Webscale system is a distributed object storage system that uses a grid architecture to distribute copies of object data throughout the system. The result is a system where data is protected from loss and continuously available.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

New doc – Top Best Practices for FlexGroup volumes

When I wrote TR-4571: NetApp FlexGroup Volume Best Practices and Implementation Guide, I wanted to keep the document under 50 pages to be more manageable and digestible. 95 pages later, I realized there was so much good information to pass on to people about NetApp FlexGroup volumes that I wasn’t going to be able to condense a TR down effectively.

maxresdefault

The TRs have been out for a while now, and I’m seeing that 95 pages might be a bit much for some people. Everyone’s busy! I was getting asked the same general questions about deploying FlexGroup volumes over and over and decided I needed to create a new, shorter best practices document that focused *only* on the most important, most frequently asked general best practices. It’s part TR, part FAQ. It’s more of an addendum, a sort of companion reader to TR-4571. And the best part?

It’s only FOUR PAGES LONG.

Check it out here:

http://www.netapp.com/us/media/tr-4571-a.pdf

ONTAP 9.1P3 is available!

office-raise-the-roof

ONTAP 9.1P3 is available for download today.

You can get it here:

mysupport.netapp.com/NOW/download/software/ontap/9.1P3

P-releases are patch releases and include bug fixes. Fixed in this release (since P2):

1015156 Storage controller disruption might occur when closing a CIFS session or a CIFS tree with large number of open files.
1028842 Storage system might experience a disruption under heavy load conditions
1032738 NDMP might fail to detect socket close condition causing the backup operations to enter into a hung state.
1036072 NA51 firmware release for 960 GB and 3.8 TB SSD series
1050210 New drive firmware to mitigate higher failure rate on a drive model
1055450 SnapMirror source system disrupts while pre-processing data to a format that can be understood by the destination system
1058975 SAN services disrupt in a two-node cluster when a node reboots while its partner is in a sustained taken-over state
1063710 Controller disruption occurs while processing read request in certain scenarios
1064449 ‘total_ops’ and ‘other_ops’ counters for the system object are incorrect
1064560 Large file deletion or LUN unmap operations can stall on FlexClone volumes
1065209 Replay cache resources held up in reserves lead to EJUKEBOX errors for NFS clients
1068280 X1133A-R6 initiator port fails to discover FCP target devices
1069032 Unnecessary aggregate metadata reads might lead to long Consistency Points or WAFL Hung
1069555 Unnecessary aggregate metadata reads might lead to long consistency points or the file system disrupts
1070116 Driver code repeatedly generates the ‘Error: Failed to open NVRAM device 0’ error message in messages.log
867605 NDMP DAR restore using non-ASCII (UTF-8) doesn’t seem to work
981677 Ethernet interface on the UTA2 X1143-R6 adapter and onboard UTA2 ports might become unresponsive

 

What the heck happened to last week’s Tech ONTAP podcast?

Last week, you may or may not have noticed we didn’t do a Tech ONTAP podcast. Generally, we give a little heads up when this happens. However, the saga of the missing podcast episode started around 10PM on Sunday night for me…

I started feeling like I was having mild pain on the right side of my abdomen. It felt like a gas pain, but it was sore to the touch. Because I am a product of the information age, I googled “appendix location” to find out if it was located where I was having issues, and it was. However, it wasn’t severe, so I decided to try to sleep it off.

Around 2:30AM, I woke up and decided it was probably appendicitis. I told my wife I was going to the ER. She offered to take me, but we had a sleeping 3.5 year old, so I drove myself. After a few tests, I got confirmation and I texted my colleagues that the podcast would have to be recorded without me. However, it got postponed instead.

17990846_10154645107282075_5657874379466161650_n 18058179_10154646433047075_732476922102100862_n

I had my surgery and am recovering nicely, but the podcast had to be moved to another week, as I was *also* going on vacation that week.

18222486_10154665408122075_5043399552303696759_n

Bummer, but there’s the reason. 🙂

This week, don’t expect a podcast on Friday either – instead, we are publishing on Monday, May 8 to correspond with a special announcement for StorageGrid WebScale, so stay tuned… Also, expect the postponed podcast to appear soon, as well – we got a very special guest to join us!

Introducing: NetApp Newsroom

In case you hadn’t noticed, the NetApp community blog got a much needed face lift/modernizing this past week and has been re-branded as the “NetApp Newsroom.”

newsroom.png

In addition to a much fresher look, the blog also resizes nicely for mobile devices. Some other enhancements include:

  • Disqus commenting integration
  • Author profile social media account integration
  • Responsive design

I’ll be contributing from time to time (such as with Tech Meme Friday), as will many of the NetApp A-Team and NetApp United members.

 

Be on the look out for a new FlexGroup blog on there soon!

Behind the Scenes: Episode 83 – OnCommand Unified Manager 7.2

Welcome to the Episode 83, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we chat with Philip Bachman and Yossi Weihs about the new OnCommand Unified Manager update. Find out what’s new and why Unified Manager can truly claim to be a Unified Manager!

What is OnCommand Unified Manager?

OnCommand Unified Manager is NetApp’s monitoring software package. It was the next evolution of Data Fabric Manager (DFM) and allows storage administrators to monitor capacity, performance and other storage events. It also allows you to set up notifications and run scripts when thresholds have been reached. It can be deployed on Linux or Windows in OCUM 7.2, as well as a standard OVA in ESXi.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 82 – DockerCon Preview Featuring nDVP and Trident

Welcome to the Episode 82, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, it’s just Justin and Andrew chatting about the upcoming Docker conference in Austin, TX, as well as updates to the NetApp Docker Volume Plugin and Trident. We get a bit rambuncuous on this one and in honor of Easter, we leave plenty of Easter eggs!

If you’ll be in Austin for DockerCon or Boston for Red Hat Summit or OpenStack Summit, be sure to look for Andrew and the rest of the NetApp team!

For information on NetApp Docker Volume Plugin, Trident and more, go to the Pub:

banner-v3

http://netapp.io

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 81 – NetApp Service Level Manager

Welcome to the Episode 81, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we invited the NetApp Service Level Manager team to talk to us about how they’re revolutionizing storage as a service and automating day to day storage administration tasks. Join Executive Architect Evan Miller (@evancmiller), Product Manager Nagananda.Anur (naga@netapp.com) and Technical Director Ameet.Deulgaonkar (Ameet.Deulgaonkar@netapp.com, @Amyth18) as they detail the benefits of NetApp Service Level Manager!

For more details on NetApp Service Level Manager, we encourage you to check out the additional blogs below:

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Case study: Using OSI methodology to troubleshoot NAS

Recently, I installed some 10GB cards into an AFF8040 so I could run some FlexGroup performance tests (stay tuned for that). I was able to install the cards myself, but to get them connected to a network here at NetApp’s internal labs, you have to file a ticket. This should sound familiar to many people, as this is how real-world IT works.

So I filed the ticket and eventually, the cards were connected. However, just like in real-world IT, the network team has no idea what the storage team (me) has configured, and the storage team (me) has no idea how the network team has things configured. So we had to troubleshoot a bit to get the cards to ping correctly. Turns out, they had a vlan tag on the ports that weren’t needed. Removed those and fixed the port channel and cool! We now had two 10GB LACP interfaces on a 2 node cluster!

Not so fast…

Turns out, ping is a great test for basic connectivity. But it’s awful for checking if stuff *actually works.* In this case, I could ping via the 10GB interfaces and even mount via NFSv3 and list directories, etc. But those are lightweight metadata operations.

Whenever I tried a heavier operation like a READ, WRITE or READDIRPLUS (incidentally, tab completion for a path when typing a command on an NFS mount? READDIRPLUS call), the client would hang indefinitely. When I would CTL + C out of the command, the process would sometimes also hang. And subsequent operations, including the GETATTR, LOOKUP, etc operations would also hang.

So, now I had a robust network that couldn’t even perform tab completions.

Narrowing down the issue

I like to start with a packet trace, as that gives me a hint where to focus my efforts. In this issue, I started a packet capture on both the client (10.63.150.161) and the cluster (10.193.67.218). In the traces, I saw some duplicate ACKs, as well as packets being sent but not replied to:

readdirplus-noreply.png

In the corresponding filer trace, I saw the READDIRPLUS call come in and get replied to, and then re-transmitted a bunch of times. But, as the trace above shows, the client never receives it.

readdirplus-filer

That means the filer is doing what it’s supposed to. The client is doing what it’s supposed to. But the network is blocking or dropping the packet for some reason.

When troubleshooting any issue, you have to start with a few basic steps (even though I like to start with the more complicated packet capture).

For instance…

What changed?

Well, this one was easy – I had added an entire new network into the mix. End to end. My previous ports were 1GB and worked fine. This was 10GB infrastructure, with LACP and jumbo frames. And I had no control over that network. Thus, I was left with client and server troubleshooting for now. I didn’t want to file another ticket before I had done my due diligence, in case I had done something stupid (totally within the realm of possibility, naturally).

So where did I go from there?

Start at layers 1, 2 and 3

The OSI model is something I used to take for granted as something interviewers asked because it seemed like a good question to stump people on. However, over the course of the last 10 years, I’ve come to realize it’s useful. What I was troubleshooting was NFS, which is all the way at layer 7 (the application layer).

osi-network-layer-cats[1]

So why start at layers 1-3? Why not start where my problem is?

Because with years of experience, you learn that the issue is rarely at the layer you’re seeing the issue manifest. It’s almost always farther down the stack. Where do you think the “Is it plugged in?” joke comes from?

media-cache-ak0-pinimg-com_736x_88_ac_c8_88acc8216648b26114507ca04686b357

Layer 1 means, essentially, is it plugged in? In this case, yes, it was. But it also means “are we seeing errors on the interfaces that are plugged in?” In ONTAP, you can see that with this command:

ontap9-tme-8040::*> node run * ifstat e2a
Node: ontap9-tme-8040-01

-- interface e2a (8 days, 23 hours, 14 minutes, 30 seconds) --

RECEIVE
 Frames/second: 1 | Bytes/second: 30 | Errors/minute: 0
 Discards/minute: 0 | Total frames: 84295 | Total bytes: 7114k
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 0
 No buffers: 0 | Non-primary u/c: 0 | L2 terminate: 9709
 Tag drop: 0 | Vlan tag drop: 0 | Vlan untag drop: 0
 Vlan forwards: 0 | CRC errors: 0 | Runt frames: 0
 Fragment: 0 | Long frames: 0 | Jabber: 0
 Error symbol: 0 | Illegal symbol: 0 | Bus overruns: 0
 Queue drop: 0 | Xon: 0 | Xoff: 0
 Jumbo: 0 | JMBuf RxFrames: 0 | JMBuf DrvCopy: 0
TRANSMIT
 Frames/second: 82676 | Bytes/second: 33299k | Errors/minute: 0
 Discards/minute: 0 | Total frames: 270m | Total bytes: 1080g
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 4496
 Queue overflows: 0 | No buffers: 0 | Xon: 0
 Xoff: 0 | Jumbo: 13 | TSO non-TCP drop: 0
 Split hdr drop: 0 | Pktlen: 0 | Timeout: 0
 Timeout1: 0 | Stray Cluster Pk: 0
DEVICE
 Rx MBuf Sz: Large (3k)
LINK_INFO
 Current state: up | Up to downs: 22 | Speed: 10000m
 Duplex: full | Flowcontrol: none

In this case, the interface is pretty clean. No errors, no “no buffers,” no CRC errors, etc. I can also see that the ports are “up.” The up to downs are high, but that’s because I’ve been adding/removing this port from the ifgrp multiple times, which leads me to the next step…

Layer 2/3

Layer 2 includes the LACP/port channel, as well as the MTU settings.  Layer 3 can also include pings and some switches, as well as routing.

Since the port channel was a new change, I made sure that the networking team verified that the port channel was configured properly, with the correct ports added to the channel. I also made sure that the MTU was 9216 on the switch ports, as well as the ports on the client and storage. Those all checked out.

However, that doesn’t mean we’re done with layer 2; remember, basic pings worked fine, but those operate at 1500 MTU. That means we’re not actually testing jumbo frames here. The issue with the client was that any NFS operation that was not metadata was never making it back to the client; that suggests a network issue somewhere.

I didn’t mention before, but this cluster also has a properly working 1GB network on 1500 MTU on  the same subnet, so that told me routing was likely not an issue. And because the client was able to send the information just fine and had the 10GB established for a while, the issue likely wasn’t on the network segment the client was connected to. The problem resided somewhere between the filer 10GB ports and the new switch the ports were connected to. (Remember… what changed?)

Jumbo frames

From my experience with troubleshooting and general IT knowledge, I knew that for jumbo frames to work properly, they had to be configured up and down the entire stack of the network. I knew the client was configured for jumbo frames properly because it was a known entity that had been chugging along just fine. I also knew that the filer had jumbo frames enabled because I had control over those ports.

What I wasn’t sure of was if the switch had jumbo frames configured for the entire stack. I knew the switch ports were fine, but what about the switch uplinks?

Luckily, ping can tell us. Did you know you could ping using MTU sizes?

Pinging MTU in Windows

To ping using a packet size in Windows, use:

ping -f -l [size] [address]

-f means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size. If it can’t, you’ll see this:

C:\>ping -f -l 9000 10.193.67.218

Pinging 10.193.67.218 with 9000 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Ping statistics for 10.193.67.218:
 Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)

Then, try pinging with only -l (which specifies the packet size). If that fails, you have a good idea that your issue is MTU size. Note: My windows client didn’t have jumbo frames enabled, so I didn’t bother trying to use it to troubleshoot using it.

Pinging MTU in Linux

To ping using a packet size in Linux, use:

ping [-M do] [-s <packet size>] [host]

-f, again, means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size.

-M <hint>: Select Path MTU Discovery strategy.

<hint> may be either “do” (prohibit fragmentation, even local one), “want” (do PMTU discovery, fragment locally when packet size is large), or “dont” (do not set DF flag).

Keep in mind that the MTU size you specify won’t be *exactly* 9000; there’s some overhead involved. In the case of Linux, we’re dealing with about 28 bytes. So an MTU of 9000 will actually come across as 9028 and complain about the packet being too long:

# ping -M do -s 9000 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 9000(9028) bytes of data.
ping: local error: Message too long, mtu=9000

Instead, ping jumbo frames using 9000 – 28 = 8972:

# ping -M do -s 8972 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 8972(9000) bytes of data.
^C
--- 10.193.67.218 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1454ms

In this case, I lost 100% of my packets. Now, let’s ping using 1500 – 28 = 1472:

# ping -M do -s 1472 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 1472(1500) bytes of data.
1480 bytes from 10.193.67.218: icmp_seq=1 ttl=249 time=0.778 ms
^C
--- 10.193.67.218 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 590ms
rtt min/avg/max/mdev = 0.778/0.778/0.778/0.000 ms

All good! Just to make sure, I pinged a known working client that has jumbo frames enabled end to end:

# ping -M do -s 8972 10.63.150.168
PING 10.63.150.168 (10.63.150.168) 8972(9000) bytes of data.
8980 bytes from 10.63.150.168: icmp_seq=1 ttl=64 time=1.12 ms
8980 bytes from 10.63.150.168: icmp_seq=2 ttl=64 time=0.158 ms
^C
--- 10.63.150.168 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1182ms
rtt min/avg/max/mdev = 0.158/0.639/1.121/0.482 ms

Looks like I have data pointing to jumbo frame configuration as my issue. And if you’ve ever dealt with a networking team, you’d better bring data. 🙂

 

Resolving the issue

The network team confirmed that the switch uplink was indeed not set to support jumbo frames. The change was going to take a bit of time, so rather than wait until then, I switched my ports to 1500 in the interim and everything was happy again. Once the jumbo frames get enabled on the cluster’s network segment, I can re-enable them on the cluster.

Where else can this issue crop up?

MTU mismatch is a colossal PITA. It’s hard to remember to look for it and hard to diagnose, especially if you don’t have access to all of the infrastructure.

In ONTAP, specifically, I’ve seen MTU mismatch break:

  • CIFS setup/performance
  • NFS operations
  • SnapMirror replication

Pretty much anything you do over a network can be affected, so if you run into a problem all the way up at the application layer, remember the OSI model and start with the following:

  • Check layers 1-3
  • Ask yourself “what changed?”
  • Compare against working configurations, if possible