Introducing: NetApp Newsroom

In case you hadn’t noticed, the NetApp community blog got a much needed face lift/modernizing this past week and has been re-branded as the “NetApp Newsroom.”

newsroom.png

In addition to a much fresher look, the blog also resizes nicely for mobile devices. Some other enhancements include:

  • Disqus commenting integration
  • Author profile social media account integration
  • Responsive design

I’ll be contributing from time to time (such as with Tech Meme Friday), as will many of the NetApp A-Team and NetApp United members.

 

Be on the look out for a new FlexGroup blog on there soon!

Behind the Scenes: Episode 83 – OnCommand Unified Manager 7.2

Welcome to the Episode 83, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we chat with Philip Bachman and Yossi Weihs about the new OnCommand Unified Manager update. Find out what’s new and why Unified Manager can truly claim to be a Unified Manager!

What is OnCommand Unified Manager?

OnCommand Unified Manager is NetApp’s monitoring software package. It was the next evolution of Data Fabric Manager (DFM) and allows storage administrators to monitor capacity, performance and other storage events. It also allows you to set up notifications and run scripts when thresholds have been reached. It can be deployed on Linux or Windows in OCUM 7.2, as well as a standard OVA in ESXi.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 82 – DockerCon Preview Featuring nDVP and Trident

Welcome to the Episode 82, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, it’s just Justin and Andrew chatting about the upcoming Docker conference in Austin, TX, as well as updates to the NetApp Docker Volume Plugin and Trident. We get a bit rambuncuous on this one and in honor of Easter, we leave plenty of Easter eggs!

If you’ll be in Austin for DockerCon or Boston for Red Hat Summit or OpenStack Summit, be sure to look for Andrew and the rest of the NetApp team!

For information on NetApp Docker Volume Plugin, Trident and more, go to the Pub:

banner-v3

http://netapp.io

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 81 – NetApp Service Level Manager

Welcome to the Episode 81, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we invited the NetApp Service Level Manager team to talk to us about how they’re revolutionizing storage as a service and automating day to day storage administration tasks. Join Executive Architect Evan Miller (@evancmiller), Product Manager Nagananda.Anur (naga@netapp.com) and Technical Director Ameet.Deulgaonkar (Ameet.Deulgaonkar@netapp.com, @Amyth18) as they detail the benefits of NetApp Service Level Manager!

For more details on NetApp Service Level Manager, we encourage you to check out the additional blogs below:

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Case study: Using OSI methodology to troubleshoot NAS

Recently, I installed some 10GB cards into an AFF8040 so I could run some FlexGroup performance tests (stay tuned for that). I was able to install the cards myself, but to get them connected to a network here at NetApp’s internal labs, you have to file a ticket. This should sound familiar to many people, as this is how real-world IT works.

So I filed the ticket and eventually, the cards were connected. However, just like in real-world IT, the network team has no idea what the storage team (me) has configured, and the storage team (me) has no idea how the network team has things configured. So we had to troubleshoot a bit to get the cards to ping correctly. Turns out, they had a vlan tag on the ports that weren’t needed. Removed those and fixed the port channel and cool! We now had two 10GB LACP interfaces on a 2 node cluster!

Not so fast…

Turns out, ping is a great test for basic connectivity. But it’s awful for checking if stuff *actually works.* In this case, I could ping via the 10GB interfaces and even mount via NFSv3 and list directories, etc. But those are lightweight metadata operations.

Whenever I tried a heavier operation like a READ, WRITE or READDIRPLUS (incidentally, tab completion for a path when typing a command on an NFS mount? READDIRPLUS call), the client would hang indefinitely. When I would CTL + C out of the command, the process would sometimes also hang. And subsequent operations, including the GETATTR, LOOKUP, etc operations would also hang.

So, now I had a robust network that couldn’t even perform tab completions.

Narrowing down the issue

I like to start with a packet trace, as that gives me a hint where to focus my efforts. In this issue, I started a packet capture on both the client (10.63.150.161) and the cluster (10.193.67.218). In the traces, I saw some duplicate ACKs, as well as packets being sent but not replied to:

readdirplus-noreply.png

In the corresponding filer trace, I saw the READDIRPLUS call come in and get replied to, and then re-transmitted a bunch of times. But, as the trace above shows, the client never receives it.

readdirplus-filer

That means the filer is doing what it’s supposed to. The client is doing what it’s supposed to. But the network is blocking or dropping the packet for some reason.

When troubleshooting any issue, you have to start with a few basic steps (even though I like to start with the more complicated packet capture).

For instance…

What changed?

Well, this one was easy – I had added an entire new network into the mix. End to end. My previous ports were 1GB and worked fine. This was 10GB infrastructure, with LACP and jumbo frames. And I had no control over that network. Thus, I was left with client and server troubleshooting for now. I didn’t want to file another ticket before I had done my due diligence, in case I had done something stupid (totally within the realm of possibility, naturally).

So where did I go from there?

Start at layers 1, 2 and 3

The OSI model is something I used to take for granted as something interviewers asked because it seemed like a good question to stump people on. However, over the course of the last 10 years, I’ve come to realize it’s useful. What I was troubleshooting was NFS, which is all the way at layer 7 (the application layer).

osi-network-layer-cats[1]

So why start at layers 1-3? Why not start where my problem is?

Because with years of experience, you learn that the issue is rarely at the layer you’re seeing the issue manifest. It’s almost always farther down the stack. Where do you think the “Is it plugged in?” joke comes from?

media-cache-ak0-pinimg-com_736x_88_ac_c8_88acc8216648b26114507ca04686b357

Layer 1 means, essentially, is it plugged in? In this case, yes, it was. But it also means “are we seeing errors on the interfaces that are plugged in?” In ONTAP, you can see that with this command:

ontap9-tme-8040::*> node run * ifstat e2a
Node: ontap9-tme-8040-01

-- interface e2a (8 days, 23 hours, 14 minutes, 30 seconds) --

RECEIVE
 Frames/second: 1 | Bytes/second: 30 | Errors/minute: 0
 Discards/minute: 0 | Total frames: 84295 | Total bytes: 7114k
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 0
 No buffers: 0 | Non-primary u/c: 0 | L2 terminate: 9709
 Tag drop: 0 | Vlan tag drop: 0 | Vlan untag drop: 0
 Vlan forwards: 0 | CRC errors: 0 | Runt frames: 0
 Fragment: 0 | Long frames: 0 | Jabber: 0
 Error symbol: 0 | Illegal symbol: 0 | Bus overruns: 0
 Queue drop: 0 | Xon: 0 | Xoff: 0
 Jumbo: 0 | JMBuf RxFrames: 0 | JMBuf DrvCopy: 0
TRANSMIT
 Frames/second: 82676 | Bytes/second: 33299k | Errors/minute: 0
 Discards/minute: 0 | Total frames: 270m | Total bytes: 1080g
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 4496
 Queue overflows: 0 | No buffers: 0 | Xon: 0
 Xoff: 0 | Jumbo: 13 | TSO non-TCP drop: 0
 Split hdr drop: 0 | Pktlen: 0 | Timeout: 0
 Timeout1: 0 | Stray Cluster Pk: 0
DEVICE
 Rx MBuf Sz: Large (3k)
LINK_INFO
 Current state: up | Up to downs: 22 | Speed: 10000m
 Duplex: full | Flowcontrol: none

In this case, the interface is pretty clean. No errors, no “no buffers,” no CRC errors, etc. I can also see that the ports are “up.” The up to downs are high, but that’s because I’ve been adding/removing this port from the ifgrp multiple times, which leads me to the next step…

Layer 2/3

Layer 2 includes the LACP/port channel, as well as the MTU settings.  Layer 3 can also include pings and some switches, as well as routing.

Since the port channel was a new change, I made sure that the networking team verified that the port channel was configured properly, with the correct ports added to the channel. I also made sure that the MTU was 9216 on the switch ports, as well as the ports on the client and storage. Those all checked out.

However, that doesn’t mean we’re done with layer 2; remember, basic pings worked fine, but those operate at 1500 MTU. That means we’re not actually testing jumbo frames here. The issue with the client was that any NFS operation that was not metadata was never making it back to the client; that suggests a network issue somewhere.

I didn’t mention before, but this cluster also has a properly working 1GB network on 1500 MTU on  the same subnet, so that told me routing was likely not an issue. And because the client was able to send the information just fine and had the 10GB established for a while, the issue likely wasn’t on the network segment the client was connected to. The problem resided somewhere between the filer 10GB ports and the new switch the ports were connected to. (Remember… what changed?)

Jumbo frames

From my experience with troubleshooting and general IT knowledge, I knew that for jumbo frames to work properly, they had to be configured up and down the entire stack of the network. I knew the client was configured for jumbo frames properly because it was a known entity that had been chugging along just fine. I also knew that the filer had jumbo frames enabled because I had control over those ports.

What I wasn’t sure of was if the switch had jumbo frames configured for the entire stack. I knew the switch ports were fine, but what about the switch uplinks?

Luckily, ping can tell us. Did you know you could ping using MTU sizes?

Pinging MTU in Windows

To ping using a packet size in Windows, use:

ping -f -l [size] [address]

-f means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size. If it can’t, you’ll see this:

C:\>ping -f -l 9000 10.193.67.218

Pinging 10.193.67.218 with 9000 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Ping statistics for 10.193.67.218:
 Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)

Then, try pinging with only -l (which specifies the packet size). If that fails, you have a good idea that your issue is MTU size. Note: My windows client didn’t have jumbo frames enabled, so I didn’t bother trying to use it to troubleshoot using it.

Pinging MTU in Linux

To ping using a packet size in Linux, use:

ping [-M do] [-s <packet size>] [host]

-f, again, means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size.

-M <hint>: Select Path MTU Discovery strategy.

<hint> may be either “do” (prohibit fragmentation, even local one), “want” (do PMTU discovery, fragment locally when packet size is large), or “dont” (do not set DF flag).

Keep in mind that the MTU size you specify won’t be *exactly* 9000; there’s some overhead involved. In the case of Linux, we’re dealing with about 28 bytes. So an MTU of 9000 will actually come across as 9028 and complain about the packet being too long:

# ping -M do -s 9000 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 9000(9028) bytes of data.
ping: local error: Message too long, mtu=9000

Instead, ping jumbo frames using 9000 – 28 = 8972:

# ping -M do -s 8972 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 8972(9000) bytes of data.
^C
--- 10.193.67.218 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1454ms

In this case, I lost 100% of my packets. Now, let’s ping using 1500 – 28 = 1472:

# ping -M do -s 1472 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 1472(1500) bytes of data.
1480 bytes from 10.193.67.218: icmp_seq=1 ttl=249 time=0.778 ms
^C
--- 10.193.67.218 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 590ms
rtt min/avg/max/mdev = 0.778/0.778/0.778/0.000 ms

All good! Just to make sure, I pinged a known working client that has jumbo frames enabled end to end:

# ping -M do -s 8972 10.63.150.168
PING 10.63.150.168 (10.63.150.168) 8972(9000) bytes of data.
8980 bytes from 10.63.150.168: icmp_seq=1 ttl=64 time=1.12 ms
8980 bytes from 10.63.150.168: icmp_seq=2 ttl=64 time=0.158 ms
^C
--- 10.63.150.168 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1182ms
rtt min/avg/max/mdev = 0.158/0.639/1.121/0.482 ms

Looks like I have data pointing to jumbo frame configuration as my issue. And if you’ve ever dealt with a networking team, you’d better bring data. 🙂

 

Resolving the issue

The network team confirmed that the switch uplink was indeed not set to support jumbo frames. The change was going to take a bit of time, so rather than wait until then, I switched my ports to 1500 in the interim and everything was happy again. Once the jumbo frames get enabled on the cluster’s network segment, I can re-enable them on the cluster.

Where else can this issue crop up?

MTU mismatch is a colossal PITA. It’s hard to remember to look for it and hard to diagnose, especially if you don’t have access to all of the infrastructure.

In ONTAP, specifically, I’ve seen MTU mismatch break:

  • CIFS setup/performance
  • NFS operations
  • SnapMirror replication

Pretty much anything you do over a network can be affected, so if you run into a problem all the way up at the application layer, remember the OSI model and start with the following:

  • Check layers 1-3
  • Ask yourself “what changed?”
  • Compare against working configurations, if possible

Behind the Scenes: Episode 80 – NetApp MyAutosupport Dashboard Refresh

Welcome to the Episode 80, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we brought in Sudip Hore (@sudiphore) to talk about the new and improved MyAutosupport Dashboard.  Join us as we discuss the new predictive and proactive features and what’s coming in the future.

Also, check out the blog I wrote on it here:
https://whyistheinternetbroken.wordpress.com/2017/03/21/new-myautosupport-dashboard/.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 79 – Databases and Cloud with Jeff Steiner

Welcome to the Episode 79, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we invite the database guru himself, Jeff Steiner (@TweetofSteiner) to talk about databases, the cloud as it pertains to ONTAP and SolidFire. We go over a wide array of things as Jeff tells us exactly how he feels about marketing and NVMe, the current shiny object/topic du jour in the storage industry. Jeff pulls no punches.

Jeff Steiner also has a blog at https://words.ofsteiner.com.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Taking a look at the new MyAutosupport Dashboard

NetApp has released a new MyAutosupport dashboard to help enhance the value you get out of the tool. The new features focus on predictive analytics, as well as proactive steps to help maintain your ONTAP cluster in an easier to digest format.

Included in the release are:

  • New customer dashboard with capacity and upgrade recommendations
  • New system dashboard with recommendation details
  • Workload tagging for context sensitive recommendations
  • Upgrade recommendations and plans in a few clicks
  • Dynamic and powerful search for your installed base
  • Modern, responsive user interface

The new interface can be found here:

http://mysupport.netapp.com/myautosupport/dist/index.html

What does it look like?

The new dashboard has a simpler look and feel and offers monitoring and historical tracking of system data, as well as predictive analysis.

The following gives an example of how the new MyAutosupport will look and feel:

myasup1.png

The splash page will deliver a way to track usage and consumption, ensure you’re running the recommended ONTAP versions, as well as viewing recent support cases and system risks to see if there needs to be some proactive maintenance done on a system.

Detailed views

In addition to the overall summary, you can click on one of the sections and drill down into a more detailed analysis of your cluster.

myasup2

View capacity forecasts based on data and workload trends, storage efficiency summaries and details about system risks.

Workload tagging

In addition to the predictive analysis and risk assessment, MyAutosupport also offers a way to tag specific volumes in your cluster and associate them with an application workload. Then, MyAutosupport will provide context sensitive proactive recommendations that will enable you get better efficiency, performance, and availability for your applications. You will also receive recommendations for best practices that are specific to your application. MyAutosupport will come pre-populated with specific applications in a drop down box to simplify the procedure.

myasup3.png

Simplified upgrade planning

From MyAutosupport, you can generate upgrade plans to provide a smoother and simpler way to perform non-disruptive upgrades. Just click on the “Upgrade Recommendation” portion of MyAutosupport, select the systems to upgrade, the target version and click Next.

myasup4

Enhanced search functionality

MyAutosupport also allows for better search functionality across systems in your environment. Simply type in a name, site or cluster and the list will populate in real time.

myasup5.png

Don’t like it?

That’s fine! We still have “Classic My Autosupport” for people who hate change.

oldman-myasup.png

myasup.png

Also check out the latest Tech ONTAP podcast where we talk with Sudip Hore about the dashboard!

 

 

Behind the Scenes: Episode 78 – NetApp Certifications featuring the NetApp A-Team

Welcome to the Episode 78, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, several NetApp A-Team members were in RTP to help create the NetApp NCIE exams that are used to help officially certify storage administrators for ONTAP. We rounded up Ruari McBride (@mcbride_ruairi), Scott Gelb (@scottygelb) and Steven Cortez (@mscproductions) to discuss how a NetApp certification exam gets created and why you’d want to take one.

To find out more about the NCIE, check out the official NetApp page: http://www.netapp.com/us/services-support/university/certification/ncie-san/index.aspx

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 77 – CTO Predictions 2017 with Dr. Mark Bregman

Welcome to the Episode 77, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week, we welcome NetApp CTO Dr. Mark Bregman (@drmarkbregman) to the podcast to discuss his recent CTO predictions with us and elaborate a bit on where he sees the IT industry shifting. Below is the short list of the predictions. Click here for the slide share. Tune in to the podcast for the details!

  • Prediction 1: Data Is the New Currency
  • Prediction 2: The Cloud as Catalyst and Accelerator
  • Prediction 3: New Technologies Become Standard
  • Prediction 4: A Wider Dynamic Range of Storage and Data Management Technologies Evolves
  • Prediction 5: New Models Take Hold
  • Prediction 6: Consumerization of IT Persists

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here: