June 2016 – ONTAP 9 Month on the Tech ONTAP Podcast!

A couple weeks ago, I hinted at some news that was coming to the Tech ONTAP Podcast in June. Now, that news is finally here – ONTAP 9 has been announced!

ontap9week

We kicked off the month with an interview with Lee Caswell, the VP of Product and Solutions Marketing at NetApp to talk about why NetApp decided to go with the new ONTAP 9 version rather than another 8.x release, as well as how the ONTAP software release cadences are changing.

You can listen to that here:

The entire month is listed under a playlist here:

What changed?

this-a-job-for-datafabric

With the new release, there is a focus on the following:

Simplicity

  • New AFF deployment templates for Oracle, VDI, SQL to speed up set up (<10 minute set up)
  • Improved and re-vamped management GUIs (Dave Morera does a nice job covering this)
  • Unified data management across on-prem (flash and disk) and cloud datacenters

Maximizing Data Center Agility

Improved Performance

  • 20x faster response times
  • Up to 60% more IOPs than 8.3.1

There’s also a series of new features being offered in ONTAP 9. We discussed some of those with Lee Caswell (see his blog here), but we will also be covering them in upcoming episodes this month.

For a short list of the new features:

  • Support for 15TB SSD
  • Inline data compaction
  • SnapLock® software for data compliance
  • RAID-TEC triple-parity protection
  • Headroom for visibility of performance capacity
  • MetroCluster enhancements
  • Onboard key manager (Included for FREE)
  • FlexGroups (PVR only in 9.0)
  • Workgroup mode for CIFS/SMB

A few of the NetApp A-Team Members wrote up some blogs for the new stuff:

Upcoming schedule

group-4-2016

The following is the tentative podcast schedule for the rest of the month, where we talk about all things ONTAP 9!

Week of June 3: Lee Caswell on “Why ONTAP 9”; ONTAP 9 overview with a special SolidFire announcement in a mini-podcast! (2.5 podcasts in one week!)
June 10: ONTAP 9 features: Data Protection
June 17: ONTAP 9 features: Flash Perf
June 24: ONTAP 9 features: ONTAP Select and SAN
June 30: ONTAP 9 features: FlexGroups
July 4: ONTAP 9 features: Manageability Tools

Stay tuned for great things!

Behind the Scenes: Episode 40 – Masters of Puppet!

Welcome to the Episode 40 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

This week, I was on vacation, so Glenn and Sully took the reins and invited their FlexPod buddies Dave Cain (@thedavecain) and Amit Borulkar (@amit_borulkar) in and talked about Puppet automation.

I don’t have any inside info, other than in the editing, I found that Glenn discovered how poorly I wrote my outro.

I will add that I started to make some clever graphic replicating the Metallica Masters of Puppet album cover, replacing crosses with Puppet logos, but then I got lazy.

So, the original will have to do. (Likely will see a cease and desist from the band for this)

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

You can find it here:

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

 

Check out the podcast episode here:

Stay tuned for some big news in the month of June!

I recently went on vacation this past week after attending the NetApp A-Team ETL. The family went to Carolina Beach, and if you have ever had/currently have kids, you know that a vacation with a toddler is not really much of a vacation. But, the weather was pretty nice overall…

beach

Good news is, I got my FitBit step goals in for the week.

But, it was still a ton of fun – nothing like witnessing things like the ocean, rusty carnival rides and carnies through the eyes of someone who’s seeing them for the first time.

Now that I’m back, it’s time to start pumping up what we’ve got going on in June.

June will be a huge month for announcements – both for SolidFire and for NetApp. As such, the Tech ONTAP Podcast team is reserving the entire month of June for these announcements. There will also be blog posts on new things!

Naturally, I’m not at liberty to go into too much detail, but know that hints have been dropped all over the place, including on the recent NetApp earnings call.

I will, however, leave you with this teaser graphic…

ontap9week

Behind the Scenes: Episode 39 – NetApp A-Team ETL

group-4-2016

Welcome to the Episode 39 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

This week, we took the show on the road to NetApp headquarters to attend the NetApp A-Team ETL, which is similar to Storage Field Day, except it’s all NetApp.

So, Sully and I hopped on a plane to act as ambassadors and to conduct interviews of the guys. When we do that, we use these nifty little recording devices:

ci1qb5fveaqrugq

We also use those at trade shows. They’re portable and do pretty good recording work.

We managed to interview about 7 of the A-Team members, but couldn’t get more than that, as the bus was leaving for the final evening event. We’ll definitely be trying to sync up with them over Skype for future episodes for sure. If you want to catch a few other interviews from them, Pete Flecha (@vPedroArrow) did some with them for his Virtually Speaking podcast.

Essentially, the ETL (Extract, Transform, Load) is a chance for A-Team members to engage senior NetApp leaders and voice their feedback to those leaders in an attempt to make NetApp better. One of those leaders happened to be NetApp founder, Dave Hitz, who’s always a popular guest for his candidness and entertaining delivery.

DSC_2999

We also were able to interview the Cloud Czar/new SolidFire CTO, Val Bercovici, as well as Joel Kaufman, who is a director of product management for manageability at NetApp.

All in all, I thought the ETL was a success. Great, smart people. Lots of good food. Lots of fun. And Samantha Moulton, who runs the A-Team (@NetAppATeam), did a fantastic job bringing it all together.

DSC_3005

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

You can find it here:

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

The official NetApp blog post for the podcast is here:

http://community.netapp.com/t5/Technology/Tech-ONTAP-Podcast-Episode-39-The-NetApp-A-Team-Attend-ETL/ba-p/119514#M2393

Now, here’s the podcast!

 

Behind the Scenes: Episode 38 – OpenStack Summit Post-cap

group-4-2016

Welcome to the Episode 38 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

This week, we welcome Andrew back from OpenStack Summit and his vacation, as well as chatting about the new NetApp Docker Volume plugin that was recently released.

No guests this week. Stay tuned for some big news coming from the podcast in the month of June.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher. You can find it here:

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

The official NetApp blog post for the podcast is here:

http://community.netapp.com/t5/Technology/Tech-ONTAP-Podcast-Episode-38-Docker-Volume-Plugin-amp-OpenStack-Summit-Recap/ba-p/119266

Here’s this week’s episode!

 

May 2016 update to TR-4073 (the NetApp NFS Kerberos/LDAP manifesto)

It’s time for new technical report updates!

koolaid

Since clustered Data ONTAP 8.3.2 is now available, we are publishing our 8.3.2 updates to our docs. I finally got the updates added to TR-4073: Secure Unified Authentication.

What is Secure Unified Authentication?

Secure Unified Authentication is a solution-based methodology to provide secure (via Kerberos) unified (via central LDAP servers for identity management) authentication for enterprise IT environments.

Security is more important than ever, so using a ticket-based auth process instead of over-the-wire passwords is one way to ensure you have protected your business assets. With AES-256 encryption, you are using the strongest available enctype for Kerberos.

Ease of management is also critical to an ever changing IT landscape. LDAP for Identity Management makes user account management and NAS permissioning easier.

What’s new?

  • PowerShell alternatives to LDIFDE queries
  • Extended GIDs best practices
  • Improved asymmetric name mapping support in LDAP
  • Expanding all hosts in a netgroup information
  • Vserver security command examples and information (fsecurity command parity)
  • Improved RFC-2307bis informati0n
  • Bind DN examples
  • LDAP terminology

Where can I find it?

Technical reports can be found a variety of ways. Google search works, as does looking in the NetApp library. I cover how to be better at NetApp documentation in a separate blog post. I also have a yet-unfinished series on LDAP here:

LDAP::What the heck is an LDAP anyway? – Part 1: Intro

To make it super easy, just follow this link:

TR-4073: Secure Unified Authentication

Behind the Scenes: Episode 37 – DevOps on Windows

group-4-2016

Welcome to the Episode 37 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

DevOps has been a hot topic in IT, but usually when people talk about it, non-Windows applications, operating systems and orchestration suites are involved. This week, we bring a couple of Windows guys in to talk about DevOps on Windows.

This episode featured:

We conducted this one via Skype, as I suspect we smell bad.

To read the official blog post:

http://community.netapp.com/t5/Technology/Tech-ONTAP-Podcast-Episode-37-DevOps-on-Windows/ba-p/119017

To listen to the episode, check it out here:

Re-animating the cDOT Showmount Script

alive

Showmount in clustered Data ONTAP

When NetApp’s clustered Data ONTAP was first introduced, 7-Mode customers using NFS noticed something was missing – they could no longer run “showmount” from their clients to get the exported paths on the NFS server. Instead, it would just show a / and (everyone) for that server. For complete details on showmount in cDOT, see TR-4067.

Example of showmount in 7-Mode:

[root@nfsclient /]# showmount -e 10.61.84.240
Export list for 10.61.84.240:
/vol/unix      (everyone)
/vol/Test      (everyone)
/vol/vol0/home (everyone)
/vol/vol0      (everyone)
/vol/Test2     (everyone)
/vol/mixed     10.61.179.164

Example of showmount in cDOT prior to showmount support (8.2.x and prior):

[root@nfsclient /]# showmount -e 10.61.92.34
Export list for 10.61.92.34:
/ (everyone)

Clustered Data ONTAP 8.3 added support for showmount, but it was mainly for showing paths to support Oracle VM deployments, as OVM required showmount to show paths to work properly.

However, showmount in 8.3 and beyond doesn’t show the exported client list, which some people used in troubleshooting NFS export mount/access issues. There are technical reasons for this decision, but don’t expect to see client access views added any time soon. Instead, you can use the commands supplied in clustered Data ONTAP to verify exports (such as export-policy rule check-access) and my recent blog on NFS mount tracing How to trace NFSv3 mount failures in clustered Data ONTAP.

Example of showmount in cDOT 8.3 and later:

# showmount -e 10.63.21.9
Export list for 10.63.21.9:
/unix       (everyone)
/unix/unix1 (everyone)
/unix/unix2 (everyone)
/           (everyone)

Prior to 8.3, NetApp offered a showmount script in the NetApp Toolchest. That script did replicate the overall 7-Mode functionality. Additionally, since it was an external script leveraging the ONTAP SDK, it could be modified and customized to do even more than run showmount. (A project I started but never got around to finishing)

However, after showmount support was added to cDOT 8.3, the showmount tool was removed from the tool chest and no longer officially supported. In fact, good luck finding it at all. Until now!

I managed to get the guy who wrote the script (Karthik Nagalingam) to give me a copy and allow me to put it up on GitHub. Keep in mind that the script is officially unsupported by NetApp, but it can be considered as an open source contribution to the community and open for all to modify and improve upon. And you might have some luck contacting myself or Karthik with issues.

If you’re interested, get it here:

https://github.com/whyistheinternetbroken/showmount-cdot 

 

Is this blog a Top vBlog 2016?

meaning-of-vault-boy-thumbs-up-jpg

Every year, Eric Siebert (@ericsiebert) puts in a ton of work to collect and rank the top virtualization blogs. Last year, my blog made the (long) list in its first year. This year, I’m hoping to crack the top 50. I don’t do a ton of virtualization posts, but I do sprinkle in a few here and there.

Some of those include:

So, if you read my blog and enjoy it or get any sort of assistance out of it, please consider voting. I’m in the “Top vBlog” and “Top Storage Blog” categories. The NetApp Tech ONTAP podcast is also in the running for “Best Podcast.”

Click here to vote!

http://sgiz.mobi/s3/TopvBlog2016

How to trace NFSv3 mount failures in clustered Data ONTAP

One of the questions I get on a fairly regular basis is:

I used to be able to trace NFSv3 mount failures in 7-Mode with an option called nfs.mountd.trace. How do I do that in clustered Data ONTAP?

Well, the short answer is that there is no single option to trace mount failures currently. The NFS export architecture is vastly different in clustered Data ONTAP than in 7-Mode.

However, there is a way to get pretty close to replicating the functionality. And I’ve managed to script it out a bit to simplify the process. The following is the first run of a draft of a new section in TR-4067: NFS Best Practices and Implementation Guide. This blog will provide a sneak preview of that section, which may also become the beginnings of a clustered Data ONTAP NAS troubleshooting guide.

troubleshooting-large

This document contains commands that must be run at diag privilege level. Please exercise caution when running these commands. Contact NetApp Technical Support for guidance.

What is NFSv3 Mount Tracing?

In some instances, NFS mount requests may fail for clients. When this happens, troubleshooting a mount needs to be made as simple as possible. In Data ONTAP operating in 7-Mode, there was an option that allowed storage administrators to toggle for NFSv3 mount troubleshooting:

nfs.mountd.trace
Note: NFSv4 mounts do not leverage the MOUNT protocol, so these steps do not apply to NFSv4.

When the above option is enabled in 7-Mode, all mount requests are logged. This option is intended to help debug denied mount requests. Valid values for this option are on (enabled) or off (disabled). The default value for this option is off to avoid too many messages. The logging output is stored in the /etc/messages file, as well as being piped to the CLI console.

When a successful mount occurred on a NFS server in 7-Mode, something similar to the following would have been seen in the messages file:

Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 1456151530), is sent the NULL reply
Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 4006546062), is sent the NULL reply
Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 4023323278) in mount, has access rights to path

When a failed mount occurred, we’d see something like this:

Thu Feb 25 17:07:13 GMT: Client 10.61.69.167 (xid 945700734), is sent the NULL reply
Thu Feb 25 17:07:13 GMT: Client 10.61.69.167 (xid 962477950) in mount, does not have access rights to path /vol/nfs

Essentially, the information given in the trace included:

  • Timestamp of error logged to console and messages file
  • Client IP address
  • Whether the client received a reply
  • Whether the mount succeeded or not

That was the extent of the information given. From there, a storage administrator could use a command like exportfs -c to check access, which would give a more descriptive error:

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs
exportfs: problem finding mount access for 10.61.69.167 to /vol/nfs: (Export not there)

In the above example, the volume “nfs” was in the exports file, but that file had not been applied to memory yet (hence, “export not there”).

When we review the in-memory exports vs. the exports file, that is confirmed.

fas2020-rtp2*> exportfs
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid

fas2020-rtp2*> rdfile /etc/exports
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid
/vol/nfs        -sec=sys,rw,nosuid

All it took to fix this issue was re-exporting:

fas2020-rtp2*> exportfs -a

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs
exportfs: 10.61.69.167 has mount access to /vol/nfs

This is what a client export access issue would look like:

fas2020-rtp2*> exportfs
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid
/vol/nfs        -sec=sys,rw,nosuid
/vol/nfs2       -sec=sys,ro=10.61.69.166,rw=10.61.69.16,nosuid

Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 1456395643), is sent the NULL reply
Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 2136546296), is sent the NULL reply
Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 2153323512) in mount, does not have access rights to path /vol/nfs2

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs2
exportfs: 10.61.69.167 does not have mount access to /vol/nfs2 (Access denied)

With the option nfs.mountd.trace, the “who, what and when” of the mount issue was tracked, but not the “how or why.” Those questions required additional leg work.

Despite the limitations of nfs.mountd.trace, customers who have moved from Data ONTAP operating in 7-Mode to clustered Data ONTAP miss this option, as it fit into a specific workflow. However, we can replicate the data gathered by this option fairly closely in clustered Data ONTAP. The following sections show how.

Replicating the option in clustered Data ONTAP

nfs.mountd.trace is conspicuously absent in clustered Data ONTAP. However, that doesn’t mean storage administrators can’t still get the same information as before. It just takes a different approach.

Some of this approach is currently covered in the following KB article: https://kb.netapp.com/support/index?page=content&id=2017281

That KB shows a number of statistics that can help indicate there is an issue with exports/NFS access. It discusses going into systemshell, but we can actually gather statistics from the cluster shell.

Using NFS Statistics to troubleshoot NFS mount issues

Using NFS export access statistics, I can check to see if I’ve ever been denied access to a mount and if those counters are incrementing.

To collect stats, we have to start the statistics capture for the object nfs_exports_access_cache. The object is only available at diag privilege:

cm8040-cluster::*> set diag 
cm8040-cluster::*> statistics start -object nfs_exports_access_cache 
Statistics collection is being started for sample-id: sample_17602

Keep in mind that in 8.3 and later, trying to view stats before we have started them will result in this error:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache
Error: show failed: Default sample not found. To collect a sample, 
use the "statistics start" and "statistics stop" commands. 
To view the data sample, use the "statistics show" command with the 
"-sample-id" parameter.

 

After a bit, check the stats. Filter on the following counters, using a pipe symbol (|) to indicate there are multiple entries:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 17:44:09
Elapsed-time: 549s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          0
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                0
    denied_no_rule                                                  0
    denied_wrong_vserver                                            0
    export_check                                                   92
    indeterminate                                                   0

In the above, I can see that I have two negative entries in cache (which means export access was denied twice). If I know the client’s IP address already (for instance, if the owner of the client called in to complain about being denied access), I could look that information up in the cache using the following command at diag privilege:

cm8040-cluster::*> diag exports nblade access-cache show -node cm8040-cluster-01 -vserver nfs -policy denyhost -address 10.61.69.167

                               Node: cm8040-cluster-01
                            Vserver: nfs
                        Policy Name: denyhost
                         IP Address: 10.61.69.167
           Access Cache Entry Flags: has-usable-data
                        Result Code: 0
                  Failure Type Code: 0
        First Unresolved Rule Index: -
             Unresolved Clientmatch: -
             Unresolved Anon String: -
     Number of Matched Policy Rules: 0
List of Matched Policy Rule Indexes: -
                       Age of Entry: 1002s
        Access Cache Entry Polarity: negative

In the above, the stats were not enabled when the initial issue occurred. Once I reproduce the access issue, I can check the stats again:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 17:50:43
Elapsed-time: 943s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          1
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                0
    denied_no_rule                                                  1
    denied_wrong_vserver                                            0
    export_check                                                  160
    indeterminate                                                   0

We can see that the denied_no_rule stat incremented for node1. If the policy rule allows the client but simply denies access:

cm8040-cluster::*> export-policy rule show -vserver nfs -policyname denyhost
             Policy          Rule    Access   Client                RO
Vserver      Name            Index   Protocol Match                 Rule
------------ --------------- ------  -------- --------------------- ---------
nfs          denyhost        1       any      10.61.69.166          any
nfs          denyhost        2       any      10.61.69.167          never
2 entries were displayed.

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 18:41:47
Elapsed-time: 4007s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          2
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                1
    denied_no_rule                                                  1
    denied_wrong_vserver                                            0
    export_check                                                  714
    indeterminate                                                   0

This time, denied_no_rorule increments when a client attempts access, which means the export policy rule specified the client by name or IP and explicitly denied any access. Now, I have the “what” and “why” questions answered: access denied because of no export policy rule for my client, whether by accident or by design.

While statistics are great, these numbers don’t tell the full story. For instance, with just the numbers, we don’t know which clients are having access issues, what volumes are affected, and thus, what specific export policies/rules are the culprit. We do know what node is being accessed, however, which is useful.

Using logs to troubleshoot NFS mount issues

We can use statistics to show us that a problem exists. We use logs to discover where the problem lives.

In clustered Data ONTAP, there are a number of log traces that can be enabled to not only gain the same level of functionality as seen in 7-Mode, but also expands on that level of functionality. The downside is that instead of a single option, there are multiple switches to toggle.

To get a full look at debug logs for the NFS stack, we would need to configure sktraces to debug levels:

sysvar.sktrace.AccessCacheDebug_enable=-1
sysvar.sktrace.NfsPathResolutionDebug_enable=63
sysvar.sktrace.NfsDebug_enable=63
sysvar.sktrace.MntDebug_enable=-1
sysvar.sktrace.Nfs3ProcDebug_enable=63

This can be done via cluster shell (the ::> prompt) rather than dropping into the systemshell. Since we know from our statistics that the issue lives on node1 of the cluster, we can limit the command to node1 using the following command:

cm8040-cluster::*> set diag -c off; systemshell -node cm8040-cluster-01 -c 
"sudo sysctl sysvar.sktrace.AccessCacheDebug_enable=-1;
sudo sysctl sysvar.sktrace.NfsPathResolutionDebug_enable=63;
sudo sysctl sysvar.sktrace.NfsDebug_enable=63;
sudo sysctl sysvar.sktrace.MntDebug_enable=-1;
sudo sysctl sysvar.sktrace.Nfs3ProcDebug_enable=63;
sudo sysctl sysvar.sktrace.NfsDebug_enable=-1"

Export rules are checked via mgwd, so we’d need to log at a debug level there as well:

cm8040-cluster::*> set diag -c off; logger mgwd log modify -module mgwd::exports 
-level debug -node cm8040-cluster-01

Starting in clustered Data ONTAP 8.3.1, disable log supression. If suppression is enabled, we might miss important information in the logs.

NOTE: Not available prior to clustered Data ONTAP 8.3.1.

cm8040-cluster::*> diag exports mgwd journal modify -node cm8040-cluster-01 
-trace-all true -suppress-repeating-errors false

After these are set, reproduce the issue. After reproducing the issue, be sure to disable the debug logging by setting the values back to their defaults to avoid flooding logs with spam, which could cause us to miss errors as the logs roll off.

The default sktrace values are:

sysvar.sktrace.AccessCacheDebug_enable=0
sysvar.sktrace.NfsPathResolutionDebug_enable=0
sysvar.sktrace.NfsDebug_enable=0
sysvar.sktrace.MntDebug_enable=0
sysvar.sktrace.Nfs3ProcDebug_enable=0

To disable sktraces:

cm8040-cluster::*> set diag -c off; systemshell -node cm8040-cluster-01 
-c "sudo sysctl sysvar.sktrace.AccessCacheDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsPathResolutionDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsDebug_enable=0;
sudo sysctl sysvar.sktrace.MntDebug_enable=0;
sudo sysctl sysvar.sktrace.Nfs3ProcDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsDebug_enable=0"

To reset the mgwd exports trace to error level:

cm8040-cluster::*> set diag -c off; logger mgwd log modify -module mgwd::exports 
-level err -node cm8040-cluster-01

To the mgwd journal log and re-enable suppression:

cm8040-cluster::*> diag exports mgwd journal modify -node cm8040-cluster-01 
-trace-all false -suppress-repeating-errors true

Collecting logs

The logs that capture the necessary information are in /mroot/etc/log/mlog on the specific node that is being debugged. These are accessible via the SPI web interface covered in this KB article:

How to manually collect logs and copy files from a clustered Data ONTAP storage system

(TL;DR: Use http://cluster-mgmt-IP/spi)

Specifically, the logs we want are

sktlogd.log
mgwd.log
secd.log

Analyzing the logs

These logs can be a bit difficult to digest if you don’t know what you’re looking for. To make the logs more consumable, find clients that have tried mounting during the log capture using grep (or search via text editor) in the log for the following string:

"MntDebug_3:  MountProcNull: From Client"

For example:

% cat sktlogd.log | grep "MntDebug_3:  MountProcNull: From Client"
000000bb.000635b7 0697ecee Thu Feb 25 2016 18:59:41 +00:00 [kern_sktlogd:info:4513] 22.18.39.700592+1666 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.000635d1 0697ecef Thu Feb 25 2016 18:59:41 +00:00 [kern_sktlogd:info:4513] 22.18.39.701423+0092 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00066e0b 0698181b Thu Feb 25 2016 19:18:07 +00:00 [kern_sktlogd:info:4513] 22.37.07.606654+0674 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00066e25 0698181b Thu Feb 25 2016 19:18:07 +00:00 [kern_sktlogd:info:4513] 22.37.07.612221+0020 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00068db1 06982ad7 Thu Feb 25 2016 19:26:07 +00:00 [kern_sktlogd:info:4513] 22.45.05.424066+0094 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00068dcb 06982ad7 Thu Feb 25 2016 19:26:07 +00:00 [kern_sktlogd:info:4513] 22.45.05.424789+2000 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00069500 06982fd2 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.745115+0720 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.0006951a 06982fd2 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.745837+0236 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167

From the above, we get our “when” (time stamp) and “who” (which client) answers. Then we can see the “why” (why did it fail) with the following log sections:

000000bb.0006957e 06982fd3 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.746272+0446 [7.0] MntDebug_5:  MountProcMntPathResolutionCallback: Export Check Exec=0x0xffffff08b9f4b040,rsid=0,Result=3106
000000bb.0006957f 06982fd3 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.746273+0380 [7.0] MntDebug_5:  MountProcMntPathResolutionCallback: No Access Exec=0x0xffffff08b9f4b040 Ecode=0xd Result=3106

Security Daemon (SecD) Logs

In addition to the logging that can be done at the kernel layer, there are also logs that can be leveraged with the authentication process. These logs are located at /mroot/etc/log/mlog.

In the SecD log, we’re mainly looking for two things with regards to mount access problems:

  • Is SecD running? (if not, no mounts are allowed)
  • What user is the client authenticating to? (i.e., user squashing)
  • Netgroup processing (no netgroup resolution? Access will be denied.)

Outside of that, SecD logs are mainly useful in troubleshooting user access permissions after the mount is successful, which falls under “permissions” or “access” issues, rather than mounts failing.

For more information on SecD and what it is, see TR-4073: Secure Unified Authentication and TR-4067: NFS Best Practice and Implementation Guide.

Packet tracing

With mount tracing in clustered Data ONTAP, there is currently no way to answer the “what” (ie, which volume is being mounted) from the logs. That would have to happen via a packet trace. Packet traces can be captured from an NFS client using tcpdump. The syntax I normally use to trace from an NFS client:

# tcpdump -w <filename.trc> -s 0 -i <interfacename>

In the above example:

  • If you don’t specify –w, the output will pipe to the screen and won’t get captured to a file
  • -s 0 is used to ensure all of the packetlength is captured and avoids truncation
  • Interface names can be seen with ifconfig. The interface should be the one with the IP address connecting to the NFS server.

Packet traces can also be captured from the cluster. These should be captured simultaneously with the client traces. The goal is to ensure the client and server are communicating properly and consistently. Clustered Data ONTAP packet tracing is covered in the following KB article:

How to capture packet traces on clustered Data ONTAP systems

What to look for in packet traces

When troubleshooting mount access issues via packet traces, it’s useful to filter on NFSv3 specific calls. For example, in Wireshark, use the following filter:

nfs or mount or portmap

That will filter out the main traffic we need to troubleshoot mount issues.

To find export paths, simply look for the MOUNT calls. That will tell us what export is being mounted:

MOUNT  150    V3 MNT Call (Reply In 67) /unix

To find which volume in the cluster that path corresponds to, use the following command:

cluster::> volume show -junction-path /unix

If the export path is a subdirectory or qtree, use the first portion of the path. For example, /unix/qtree/dir1 would be queried as /unix to find the corresponding volume.

Other useful information in packet traces that can be used to troubleshoot mount issues include:

  • Source and destination IP addresses
  • Source and destination ports (firewall issues, NFS/mount rootonly option issues)
  • User attempting access
  • Auth method being used (AUTH_SYS, AUTH_NONE, AUTH_GSS, etc)
  • Client hostname
  • Filehandles

Simplifying NFS mount debugging

To make troubleshooting NFS mounts a bit simpler, it makes sense to create a shell script to automate the enabling/disabling of NFS tracing in clustered Data ONTAP.

First, passwordless SSH should be configured for use with an admin host. That process is covered in TR-4073: Secure Unified Authentication, as well as this KB: https://kb.netapp.com/support/index?page=content&id=1012542

Once that’s done, create a shell script and customize it. And feel free to add feedback or your own submissions!

Script templates are located on Github:

https://github.com/whyistheinternetbroken/TR-4067-scripts