Mounting ONTAP CIFS/SMB shares with Linux – Guidelines and tips

Every now and then, the question of mounting CIFS/SMB shares in ONTAP using Linux clients comes up. Many people might consider this configuration to be a crime against nature.

garbodor

If you’re asking yourself “why would someone want to do this,” there are a few reasons.

  • Lock handling – SMB locks use oplocks. NFS locks use shared locks or delegations, depending on the NFS version and options being used. Generally this is fine, but in some cases, the style of lock might create complications with some applications. For example, Kafka on NFS has some issues with how NFS handles deletions when a file is locked:
    https://blog.usejournal.com/kafka-on-kubernetes-a-good-fit-95251da55837?gi=91867e828261
  • Legacy copiers/scanners – Many companies will use copiers/scanners to create image files that redirect to a NAS share in their environment. Many of those copiers use old SMB clients and are no longer under support, so they can’t be upgrades. And many of those scanners/copiers are expensive, so people aren’t interested in buying newer ones just to get newer SMB versions.
  • “How we’ve always done it” – This is a common reason that doesn’t always have technical logic associated with it, but you still have to roll with it.

There are other reasons out there as well, I’m sure. But this post will attempt to group the considerations, support statements and issues you might see in case someone else has that question.

Mounting SMB shares in ONTAP – General Guidelines

In 7-Mode ONTAP, mounting SMB shares with clients like AIX had no issues. This was because 7-Mode supported a lot of old SMB technology concepts that newer versions of ONTAP have deprecated. Even though SMB shares on Linux worked in 7-Mode, there was never an official supported configuration for it.

NetApp ONTAP has traditionally only supported Apple OS and Windows OS for CIFS/SMB workloads.

You can find the supported clients in the Interoperability Matrix here:

https://mysupport.netapp.com/matrix/

With newer versions of ONTAP, you *can* get SMB/CIFS working properly, provided you take the following into account:

  • CIFS/SMB clients should support UNICODE for CIFS/SMB communication. Clustered ONTAP doesn’t support non-UNICODE in any release and never will. Newer CIFS/SMB clients should support UNICODE. AIX is a good example of an older Linux client that has some challenges using SMB because it still uses non-UNICODE in some client installs.
  • CIFS/SMB versions mounting should reflect the supported SMB versions in ONTAP. ONTAP has deprecated SMB1, so you’d need to mount using SMB2 or SMB3.
  • CIFS/SMB enabled features in ONTAP should reflect the feature support for the CIFS/SMB Linux clients. For example, my Centos7 Linux Samba client doesn’t support large MTU, SMB3 encryption, multichannel, etc. That can cause failures in the mounts (for an example, see below).
  • If using Kerberos for the CIFS/SMB mounts, be sure the CIFS/SMB ONTAP server has a valid SPN for the CIFS service. The name should match the mount name. For example, if your CIFS/SMB server is named “CIFS” then the SPN would be cifs/cifs.domain.com.

Common Issues when Mounting CIFS/SMB

One of the main issues people run into with SMB mounts in Linux (provided they follow the guidelines above when using ONTAP) is using the wrong command syntax. For example, when you specify the share path, you use either \\\\name\\share (extra slashes to tell Linux that the \ isn’t an operand) or you use //name/share.

The following NetApp KB article does a good job of giving command examples for the right way to mount SMB:

How to access CIFS from Linux machine using SAMBA

Usually the command will tell you if the wrong option is being used, but sometimes the errors are less than helpful. These are a few errors I ran into.

Incorrect Mount Option

This was pretty self-explanatory. First, I didn’t have the cifs-utils installed. Second, I used the wrong command syntax.

Mount point does not exist

Again, self explanatory. The directory you’re trying to mount to needs to exist.

No such file or directory (when mounting)

This means you specified the wrong export path on the NFS server. Check your junction-paths and try again.

Host is down

This one was a tricky error, as it suggests that the server was not up. In some cases, that may truly be the issue. But it wasn’t in my case.

# mount -t cifs -o user=cifsuser \\\\DEMO\\nas /mnt/nas
Password for cifsuser@\DEMO\nas: **********
mount error(112): Host is down

But, in reality, the issue was that I didn’t specify the SMB version and it tried to use SMBv1.0 by default.

Specifying the SMB version (-o vers=3.0) got past that issue.

Required key not available

This is a Kerberos specific error. In my case, the cifs/servername.domain.com SPN didn’t exist for the hostname I used in the UNC path. You can see that in a packet capture.

Permission denied

This error is fairly useless in both Windows and Linux in a lot of cases – mainly because it can mean a variety of things. Sometimes, it really is an access issue (such as share or file-level permissions). But in my testing, I also ran into this issue when I had unsupported SMB features enabled on my CIFS server in ONTAP.

# mount -t cifs -o vers=3.0,user=administrator,domain=NTAP.LOCAL //DEMO/nas /mnt/nas
Password for administrator@//DEMO/nas: **********
mount error(13): Permission denied

In a packet trace, I could see the client telling me what it supported:

smb-capabilities

But the reply didn’t really mention the issue. So I made sure to disable the following CIFS/SMB features:

  • SMB3 encryption (cifs security modify)
  • Large MTU and SMB Multichannel (cifs options modify)

Once I did that, I was able to mount.

[root@centos7 /]# mount -t cifs -o vers=3.0,user=administrator,domain=NTAP.LOCAL //DEMO/nas /mnt/nas
Password for administrator@//DEMO/nas: **********
[root@centos7 /]# touch /mnt/nas/smbfile
[root@centos7 /]# ls -la /mnt/nas
total 1
drwxr-xr-x 2 root root 0 Mar 27 14:36 .
drwxr-xr-x. 9 root root 97 Mar 27 10:17 ..
-rwxr-xr-x 1 root root 0 Mar 27 14:36 smbfile

For Kerberos, you would just specify sec=krb5.

# mount -t cifs -o sec=krb5,vers=3.0,user=administrator,domain=NTAP.LOCAL //DEMO/nas /mnt/nas

Stale file handle errors when crossing junction paths

In some cases, if you mount volumes to volumes (via junction-path), newer Linux clients may report different inode numbers for the junctioned volume when using a command like “ls -i” than when using “ls -id”. This can create potential access problems from clients, such as “stale file handle” errors.

For example, this Centos 8.3 client shows different inode numbers for the same volume:

# ls -i
8107029540347314273 mountedvolume
# ls -id /mnt/cifs/mountedvolume/
2744251190362505280 /mnt/cifs/mountedvolume/

The way around this issue is to mount with the mount option “actimeo=0.” Newer SMB clients seem to cache inode information and create potential conflicts. Disabling actimeo prevents that caching from taking place. Since this is a caching option, you might see slower overall performance, but it will fix the stale file handle issue. Alternately, you may want to mount with the “noserverino” option instead, which doesn’t try to use the SMB server’s inodes.

Inode Numbers

When Unix Extensions are enabled, we use the actual inode number provided by the server in response to the POSIX calls as an inode number.

When Unix Extensions are disabled and "serverino" mount option is enabled there is no way to get the server inode number. The client typically maps the server-assigned "UniqueID" onto an inode number.

Note that the UniqueID is a different value from the server inode number. The UniqueID value is unique over the scope of the entire server and is often greater than 2 power 32. This value often makes programs that are not compiled with LFS (Large File Support), to trigger a glibc EOVERFLOW error as this won't fit in the target structure field. It is strongly recommended to compile your programs with LFS support (i.e. with -D_FILE_OFFSET_BITS=64) to prevent this problem. You can also use "noserverino" mount option to generate inode numbers smaller than 2 power 32 on the client. But you may not be able to detect hardlinks properly.

Once I set actimeo=0, this is how those inodes looked:

# ls -i
8107029540347314273 mountedvolume
# ls -id /mnt/cifs/mountedvolume/
8107029540347314273 /mnt/cifs/mountedvolume/

There you go! Hopefully this helped someone from going down a Google K-hole. Leave comments or suggestions below.

ONTAP 9.3 NFS sneak preview: Mount and security tracing

aid1871175-v4-728px-trace-step-6-version-2

ONTAP 9.3 is on its way, and with it comes some long-awaited new functionality for NFS debugging, including a way to map volumes to IP addresses!

Mount trace

In ONTAP 7-Mode, you could trace mount requests with an option “nfs.mountd.trace.” That didn’t make its way into ONTAP operating in cluster mode until ONTAP 9.3. I covered a long and convoluted workaround in How to trace NFSv3 mount failures in clustered Data ONTAP.

Now, you can set different levels of debugging for mount traces via the cluster CLI without having to jump through hoops. As a bonus, you can see which data LIF has mounted to which client, to which volume!

To enable it, you would use the following diag level commands:

::*> debug sktrace tracepoint modify -node [node] -module MntTrace -level [0-20] -enabled true

::*> debug sktrace tracepoint modify -node [node] -module MntDebug -level [0-20] -enabled true

When enabled, ONTAP will log the mount trace modules to the sktrace log file, which is located at /mroot/etc/mlog/skrace.log. This file can be accessed via systemshell, or via the SPI interface. Here are a few of the logging levels:

4 – Info
5 – Error
8 – Debug

When you set the trace level to 8, you can see successful mounts, as well as failures. This gives volume info, client IP and data LIF IP. For example, this mount was done from client 10.63.150.161 to data LIF 10.193.67.218 of vserverID 10 on the /FGlocal path:

cluster::*> debug log sktrace show -node node2 -module-level MntTrace_8
Time TSC CPU:INT Module_Level
--------------------- ------------------------ ------- -------------------
 LogMountTrace: Mount access granted for Client=10.63.150.161
 VserverID=10 Lif=10.193.67.218 Path=/FGlocal

With that info, we can run the following command on the cluster to find the SVM and volume:

::*> net int show -address 10.193.67.218 -fields lif
 (network interface show)
vserver lif
------- ---------
DEMO    10g_data1

::*> volume show -junction-path /FGlocal -fields volume
vserver volume
------- ---------------
DEMO    flexgroup_local

The mount trace command can also be used to figure out why mount failures may have occurred from clients. We can also leverage performance information from OnCommand Performance Manager (top clients) and per-client stats to see what volumes might be seeing large performance increases and work our way backward to see what clients are mounting what LIFs, nodes, volumes, etc. with mount trace enabled.

Security trace (sectrace)

In ONTAP 9.2 and prior, you could trace CIFS/SMB permission issues only, using “sectrace” commands. Starting in ONTAP 9.3, you can now use sectrace on SMB and/or NFS. This is useful to troubleshoot why someone might be having access to a file or folder inside of a volume.

With the command, you can filter on:

  • Client IP
  • Path
  • Windows or UNIX name

Currently, sectrace is not supported on FlexGroup volumes, however.

cluster::*> sectrace filter create -vserver DEMO -index 1 -protocols nfs -trace-allow yes -enabled enabled -time-enabled 60

Warning: Security tracing for NFS will not be done for the following FlexGroups because Security tracing for NFS is not supported for FlexGroups: TechONTAP,flexgroupDS,flexgroup_16,flexgroup_local.
Do you want to continue? {y|n}: y

Then, I tested a permission issue.

# mkdir testsec
# chown 1301 testsec/
# chmod 700 testsec
# su user
$ cd /mnt/flexvol/testsec
bash: cd: /mnt/flexvol/testsec: Permission denied

And this was the result:

cluster::*> sectrace trace-result show -vserver DEMO

Node            Index Filter Details             Reason
--------------- ----- -------------------------- ------------------------------
node2           1     Security Style: UNIX       Access is allowed because the
                      permissions                user has UNIX root privileges
                                                 while creating the directory.
                                                 Access is granted for:
                                                 "Append"
                      Protocol: nfs
                      Volume: flexvol
                      Share: -
                      Path: /testsec
                      Win-User: -
                      UNIX-User: 0
                      Session-ID: -
node2           1     Security Style: UNIX       Access is allowed because the
                      permissions                user has UNIX root privileges
                                                 while setting attributes.
                      Protocol: nfs
                      Volume: flexvol
                      Share: -
                      Path: /testsec
                      Win-User: -
                      UNIX-User: 0
                      Session-ID: -
node2           1     Security Style: UNIX       Access is allowed because the
                                                 permissions user has UNIX root privileges
                                                 while setting attributes.
                      Protocol: nfs
                      Volume: flexvol
                      Share: -
                      Path: /testsec
                      Win-User: -
                      UNIX-User: 0
                      Session-ID: -
node2           1     Security Style: UNIX       Access is not granted for:
                      permissions                "Modify", "Extend", "Delete"
                      Protocol: nfs
                      Volume: flexvol
                      Share: -
                      Path: /
                      Win-User: -
                      UNIX-User: 7041
                      Session-ID: -
node2           1     Security Style: UNIX       Access is not granted for:
                      permissions                "Lookup", "Modify", "Extend",
                                                 "Delete", "Read"
                      Protocol: nfs
                      Volume: flexvol
                      Share: -
                      Path: /testsec
                      Win-User: -
                      UNIX-User: 7041
                      Session-ID: -

As you can see above, the trace output gives a very clear picture about who tried to access the folder, which folder had the error and why the permission issued occurred.

Bonus Round: Block Size Histograms!

Now, this isn’t really a “new in ONTAP 9.3” thing; in fact, I found it as far back as 9.1. I just hadn’t ever noticed it before. But in ONTAP, you can see the block sizes for NFS and CIFS/SMB operations in the CLI with the following command:

cluster::> statistics-v1 protocol-request-size show -node nodename

When you run this, you’ll see the average request size, the total count and a breakdown of what block sizes are being written to the cluster node. This can help you understand your NAS workloads better.

For example, this node runs mostly a VMware datastore workload

cluster::> statistics-v1 protocol-request-size show -node node2 -stat-type nfs3_read

Node: node2
Stat Type: nfs3_read
                     Value    Delta
--------------       -------- ----------
Average Size:        30073    -
Total Request Count: 92633    -
0-511:                1950    -
512-1023:                0    -
1K-2047:              1786    -
2K-4095:              1253    -
4K-8191:             18126    -
8K-16383:              268    -
16K-32767:            4412    -
32K-65535:             343    -
64K-131071:           1560    -
128K - :             62935    -

When you run the command again, you get a delta from the last time you ran it.

If you’re interested in more ONTAP 9.3 feature information, check out Jeff Baxter’s blog here:

https://blog.netapp.com/announcing-netapp-ontap-9-3-the-next-step-in-modernizing-your-data-management/

You can also see me dress up all fancy and break down the new features at a high level here:

I’ll also be doing more detailed blogs on new features as we get closer to the release.

How to trace NFSv3 mount failures in clustered Data ONTAP

One of the questions I get on a fairly regular basis is:

I used to be able to trace NFSv3 mount failures in 7-Mode with an option called nfs.mountd.trace. How do I do that in clustered Data ONTAP?

Well, the short answer is that there is no single option to trace mount failures currently. The NFS export architecture is vastly different in clustered Data ONTAP than in 7-Mode.

However, there is a way to get pretty close to replicating the functionality. And I’ve managed to script it out a bit to simplify the process. The following is the first run of a draft of a new section in TR-4067: NFS Best Practices and Implementation Guide. This blog will provide a sneak preview of that section, which may also become the beginnings of a clustered Data ONTAP NAS troubleshooting guide.

troubleshooting-large

This document contains commands that must be run at diag privilege level. Please exercise caution when running these commands. Contact NetApp Technical Support for guidance.

What is NFSv3 Mount Tracing?

In some instances, NFS mount requests may fail for clients. When this happens, troubleshooting a mount needs to be made as simple as possible. In Data ONTAP operating in 7-Mode, there was an option that allowed storage administrators to toggle for NFSv3 mount troubleshooting:

nfs.mountd.trace
Note: NFSv4 mounts do not leverage the MOUNT protocol, so these steps do not apply to NFSv4.

When the above option is enabled in 7-Mode, all mount requests are logged. This option is intended to help debug denied mount requests. Valid values for this option are on (enabled) or off (disabled). The default value for this option is off to avoid too many messages. The logging output is stored in the /etc/messages file, as well as being piped to the CLI console.

When a successful mount occurred on a NFS server in 7-Mode, something similar to the following would have been seen in the messages file:

Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 1456151530), is sent the NULL reply
Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 4006546062), is sent the NULL reply
Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 4023323278) in mount, has access rights to path

When a failed mount occurred, we’d see something like this:

Thu Feb 25 17:07:13 GMT: Client 10.61.69.167 (xid 945700734), is sent the NULL reply
Thu Feb 25 17:07:13 GMT: Client 10.61.69.167 (xid 962477950) in mount, does not have access rights to path /vol/nfs

Essentially, the information given in the trace included:

  • Timestamp of error logged to console and messages file
  • Client IP address
  • Whether the client received a reply
  • Whether the mount succeeded or not

That was the extent of the information given. From there, a storage administrator could use a command like exportfs -c to check access, which would give a more descriptive error:

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs
exportfs: problem finding mount access for 10.61.69.167 to /vol/nfs: (Export not there)

In the above example, the volume “nfs” was in the exports file, but that file had not been applied to memory yet (hence, “export not there”).

When we review the in-memory exports vs. the exports file, that is confirmed.

fas2020-rtp2*> exportfs
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid

fas2020-rtp2*> rdfile /etc/exports
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid
/vol/nfs        -sec=sys,rw,nosuid

All it took to fix this issue was re-exporting:

fas2020-rtp2*> exportfs -a

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs
exportfs: 10.61.69.167 has mount access to /vol/nfs

This is what a client export access issue would look like:

fas2020-rtp2*> exportfs
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid
/vol/nfs        -sec=sys,rw,nosuid
/vol/nfs2       -sec=sys,ro=10.61.69.166,rw=10.61.69.16,nosuid

Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 1456395643), is sent the NULL reply
Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 2136546296), is sent the NULL reply
Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 2153323512) in mount, does not have access rights to path /vol/nfs2

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs2
exportfs: 10.61.69.167 does not have mount access to /vol/nfs2 (Access denied)

With the option nfs.mountd.trace, the “who, what and when” of the mount issue was tracked, but not the “how or why.” Those questions required additional leg work.

Despite the limitations of nfs.mountd.trace, customers who have moved from Data ONTAP operating in 7-Mode to clustered Data ONTAP miss this option, as it fit into a specific workflow. However, we can replicate the data gathered by this option fairly closely in clustered Data ONTAP. The following sections show how.

Replicating the option in clustered Data ONTAP

nfs.mountd.trace is conspicuously absent in clustered Data ONTAP. However, that doesn’t mean storage administrators can’t still get the same information as before. It just takes a different approach.

Some of this approach is currently covered in the following KB article: https://kb.netapp.com/support/index?page=content&id=2017281

That KB shows a number of statistics that can help indicate there is an issue with exports/NFS access. It discusses going into systemshell, but we can actually gather statistics from the cluster shell.

Using NFS Statistics to troubleshoot NFS mount issues

Using NFS export access statistics, I can check to see if I’ve ever been denied access to a mount and if those counters are incrementing.

To collect stats, we have to start the statistics capture for the object nfs_exports_access_cache. The object is only available at diag privilege:

cm8040-cluster::*> set diag 
cm8040-cluster::*> statistics start -object nfs_exports_access_cache 
Statistics collection is being started for sample-id: sample_17602

Keep in mind that in 8.3 and later, trying to view stats before we have started them will result in this error:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache
Error: show failed: Default sample not found. To collect a sample, 
use the "statistics start" and "statistics stop" commands. 
To view the data sample, use the "statistics show" command with the 
"-sample-id" parameter.

 

After a bit, check the stats. Filter on the following counters, using a pipe symbol (|) to indicate there are multiple entries:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 17:44:09
Elapsed-time: 549s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          0
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                0
    denied_no_rule                                                  0
    denied_wrong_vserver                                            0
    export_check                                                   92
    indeterminate                                                   0

In the above, I can see that I have two negative entries in cache (which means export access was denied twice). If I know the client’s IP address already (for instance, if the owner of the client called in to complain about being denied access), I could look that information up in the cache using the following command at diag privilege:

cm8040-cluster::*> diag exports nblade access-cache show -node cm8040-cluster-01 -vserver nfs -policy denyhost -address 10.61.69.167

                               Node: cm8040-cluster-01
                            Vserver: nfs
                        Policy Name: denyhost
                         IP Address: 10.61.69.167
           Access Cache Entry Flags: has-usable-data
                        Result Code: 0
                  Failure Type Code: 0
        First Unresolved Rule Index: -
             Unresolved Clientmatch: -
             Unresolved Anon String: -
     Number of Matched Policy Rules: 0
List of Matched Policy Rule Indexes: -
                       Age of Entry: 1002s
        Access Cache Entry Polarity: negative

In the above, the stats were not enabled when the initial issue occurred. Once I reproduce the access issue, I can check the stats again:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 17:50:43
Elapsed-time: 943s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          1
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                0
    denied_no_rule                                                  1
    denied_wrong_vserver                                            0
    export_check                                                  160
    indeterminate                                                   0

We can see that the denied_no_rule stat incremented for node1. If the policy rule allows the client but simply denies access:

cm8040-cluster::*> export-policy rule show -vserver nfs -policyname denyhost
             Policy          Rule    Access   Client                RO
Vserver      Name            Index   Protocol Match                 Rule
------------ --------------- ------  -------- --------------------- ---------
nfs          denyhost        1       any      10.61.69.166          any
nfs          denyhost        2       any      10.61.69.167          never
2 entries were displayed.

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 18:41:47
Elapsed-time: 4007s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          2
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                1
    denied_no_rule                                                  1
    denied_wrong_vserver                                            0
    export_check                                                  714
    indeterminate                                                   0

This time, denied_no_rorule increments when a client attempts access, which means the export policy rule specified the client by name or IP and explicitly denied any access. Now, I have the “what” and “why” questions answered: access denied because of no export policy rule for my client, whether by accident or by design.

While statistics are great, these numbers don’t tell the full story. For instance, with just the numbers, we don’t know which clients are having access issues, what volumes are affected, and thus, what specific export policies/rules are the culprit. We do know what node is being accessed, however, which is useful.

Using logs to troubleshoot NFS mount issues

We can use statistics to show us that a problem exists. We use logs to discover where the problem lives.

In clustered Data ONTAP, there are a number of log traces that can be enabled to not only gain the same level of functionality as seen in 7-Mode, but also expands on that level of functionality. The downside is that instead of a single option, there are multiple switches to toggle.

To get a full look at debug logs for the NFS stack, we would need to configure sktraces to debug levels:

sysvar.sktrace.AccessCacheDebug_enable=-1
sysvar.sktrace.NfsPathResolutionDebug_enable=63
sysvar.sktrace.NfsDebug_enable=63
sysvar.sktrace.MntDebug_enable=-1
sysvar.sktrace.Nfs3ProcDebug_enable=63

This can be done via cluster shell (the ::> prompt) rather than dropping into the systemshell. Since we know from our statistics that the issue lives on node1 of the cluster, we can limit the command to node1 using the following command:

cm8040-cluster::*> set diag -c off; systemshell -node cm8040-cluster-01 -c 
"sudo sysctl sysvar.sktrace.AccessCacheDebug_enable=-1;
sudo sysctl sysvar.sktrace.NfsPathResolutionDebug_enable=63;
sudo sysctl sysvar.sktrace.NfsDebug_enable=63;
sudo sysctl sysvar.sktrace.MntDebug_enable=-1;
sudo sysctl sysvar.sktrace.Nfs3ProcDebug_enable=63;
sudo sysctl sysvar.sktrace.NfsDebug_enable=-1"

Export rules are checked via mgwd, so we’d need to log at a debug level there as well:

cm8040-cluster::*> set diag -c off; logger mgwd log modify -module mgwd::exports 
-level debug -node cm8040-cluster-01

Starting in clustered Data ONTAP 8.3.1, disable log supression. If suppression is enabled, we might miss important information in the logs.

NOTE: Not available prior to clustered Data ONTAP 8.3.1.

cm8040-cluster::*> diag exports mgwd journal modify -node cm8040-cluster-01 
-trace-all true -suppress-repeating-errors false

After these are set, reproduce the issue. After reproducing the issue, be sure to disable the debug logging by setting the values back to their defaults to avoid flooding logs with spam, which could cause us to miss errors as the logs roll off.

The default sktrace values are:

sysvar.sktrace.AccessCacheDebug_enable=0
sysvar.sktrace.NfsPathResolutionDebug_enable=0
sysvar.sktrace.NfsDebug_enable=0
sysvar.sktrace.MntDebug_enable=0
sysvar.sktrace.Nfs3ProcDebug_enable=0

To disable sktraces:

cm8040-cluster::*> set diag -c off; systemshell -node cm8040-cluster-01 
-c "sudo sysctl sysvar.sktrace.AccessCacheDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsPathResolutionDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsDebug_enable=0;
sudo sysctl sysvar.sktrace.MntDebug_enable=0;
sudo sysctl sysvar.sktrace.Nfs3ProcDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsDebug_enable=0"

To reset the mgwd exports trace to error level:

cm8040-cluster::*> set diag -c off; logger mgwd log modify -module mgwd::exports 
-level err -node cm8040-cluster-01

To the mgwd journal log and re-enable suppression:

cm8040-cluster::*> diag exports mgwd journal modify -node cm8040-cluster-01 
-trace-all false -suppress-repeating-errors true

Collecting logs

The logs that capture the necessary information are in /mroot/etc/log/mlog on the specific node that is being debugged. These are accessible via the SPI web interface covered in this KB article:

How to manually collect logs and copy files from a clustered Data ONTAP storage system

(TL;DR: Use http://cluster-mgmt-IP/spi)

Specifically, the logs we want are

sktlogd.log
mgwd.log
secd.log

Analyzing the logs

These logs can be a bit difficult to digest if you don’t know what you’re looking for. To make the logs more consumable, find clients that have tried mounting during the log capture using grep (or search via text editor) in the log for the following string:

"MntDebug_3:  MountProcNull: From Client"

For example:

% cat sktlogd.log | grep "MntDebug_3:  MountProcNull: From Client"
000000bb.000635b7 0697ecee Thu Feb 25 2016 18:59:41 +00:00 [kern_sktlogd:info:4513] 22.18.39.700592+1666 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.000635d1 0697ecef Thu Feb 25 2016 18:59:41 +00:00 [kern_sktlogd:info:4513] 22.18.39.701423+0092 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00066e0b 0698181b Thu Feb 25 2016 19:18:07 +00:00 [kern_sktlogd:info:4513] 22.37.07.606654+0674 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00066e25 0698181b Thu Feb 25 2016 19:18:07 +00:00 [kern_sktlogd:info:4513] 22.37.07.612221+0020 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00068db1 06982ad7 Thu Feb 25 2016 19:26:07 +00:00 [kern_sktlogd:info:4513] 22.45.05.424066+0094 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00068dcb 06982ad7 Thu Feb 25 2016 19:26:07 +00:00 [kern_sktlogd:info:4513] 22.45.05.424789+2000 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00069500 06982fd2 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.745115+0720 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.0006951a 06982fd2 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.745837+0236 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167

From the above, we get our “when” (time stamp) and “who” (which client) answers. Then we can see the “why” (why did it fail) with the following log sections:

000000bb.0006957e 06982fd3 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.746272+0446 [7.0] MntDebug_5:  MountProcMntPathResolutionCallback: Export Check Exec=0x0xffffff08b9f4b040,rsid=0,Result=3106
000000bb.0006957f 06982fd3 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.746273+0380 [7.0] MntDebug_5:  MountProcMntPathResolutionCallback: No Access Exec=0x0xffffff08b9f4b040 Ecode=0xd Result=3106

Security Daemon (SecD) Logs

In addition to the logging that can be done at the kernel layer, there are also logs that can be leveraged with the authentication process. These logs are located at /mroot/etc/log/mlog.

In the SecD log, we’re mainly looking for two things with regards to mount access problems:

  • Is SecD running? (if not, no mounts are allowed)
  • What user is the client authenticating to? (i.e., user squashing)
  • Netgroup processing (no netgroup resolution? Access will be denied.)

Outside of that, SecD logs are mainly useful in troubleshooting user access permissions after the mount is successful, which falls under “permissions” or “access” issues, rather than mounts failing.

For more information on SecD and what it is, see TR-4073: Secure Unified Authentication and TR-4067: NFS Best Practice and Implementation Guide.

Packet tracing

With mount tracing in clustered Data ONTAP, there is currently no way to answer the “what” (ie, which volume is being mounted) from the logs. That would have to happen via a packet trace. Packet traces can be captured from an NFS client using tcpdump. The syntax I normally use to trace from an NFS client:

# tcpdump -w <filename.trc> -s 0 -i <interfacename>

In the above example:

  • If you don’t specify –w, the output will pipe to the screen and won’t get captured to a file
  • -s 0 is used to ensure all of the packetlength is captured and avoids truncation
  • Interface names can be seen with ifconfig. The interface should be the one with the IP address connecting to the NFS server.

Packet traces can also be captured from the cluster. These should be captured simultaneously with the client traces. The goal is to ensure the client and server are communicating properly and consistently. Clustered Data ONTAP packet tracing is covered in the following KB article:

How to capture packet traces on clustered Data ONTAP systems

What to look for in packet traces

When troubleshooting mount access issues via packet traces, it’s useful to filter on NFSv3 specific calls. For example, in Wireshark, use the following filter:

nfs or mount or portmap

That will filter out the main traffic we need to troubleshoot mount issues.

To find export paths, simply look for the MOUNT calls. That will tell us what export is being mounted:

MOUNT  150    V3 MNT Call (Reply In 67) /unix

To find which volume in the cluster that path corresponds to, use the following command:

cluster::> volume show -junction-path /unix

If the export path is a subdirectory or qtree, use the first portion of the path. For example, /unix/qtree/dir1 would be queried as /unix to find the corresponding volume.

Other useful information in packet traces that can be used to troubleshoot mount issues include:

  • Source and destination IP addresses
  • Source and destination ports (firewall issues, NFS/mount rootonly option issues)
  • User attempting access
  • Auth method being used (AUTH_SYS, AUTH_NONE, AUTH_GSS, etc)
  • Client hostname
  • Filehandles

Simplifying NFS mount debugging

To make troubleshooting NFS mounts a bit simpler, it makes sense to create a shell script to automate the enabling/disabling of NFS tracing in clustered Data ONTAP.

First, passwordless SSH should be configured for use with an admin host. That process is covered in TR-4073: Secure Unified Authentication, as well as this KB: https://kb.netapp.com/support/index?page=content&id=1012542

Once that’s done, create a shell script and customize it. And feel free to add feedback or your own submissions!

Script templates are located on Github:

https://github.com/whyistheinternetbroken/TR-4067-scripts