Behind the Scenes: Episode 42 –ONTAP 9 Overview

group-4-2016

Welcome to the Episode 42 version of the new series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

This is yet another in the series of episodes for ONTAP 9 month on the podcast.

ontap9week

This week, we invited the director of product management, Quinn Summers, to give a technical overview of the new features of ONTAP 9. We also interviewed the SolidFire CTO, Val Bercovici (@valb00) to speak about the SolidFire Analyst Day announcements on June 2. Earlier in the week, we released a full episode on the decision process of branding ONTAP 9, as well as a mini-podcast on the SolidFire announcement (for those of you who didn’t want to sit through the ONTAP 9 episode).

sf-robot

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

You can find it here:

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Check out the podcast episode here:

How to trace NFSv3 mount failures in clustered Data ONTAP

One of the questions I get on a fairly regular basis is:

I used to be able to trace NFSv3 mount failures in 7-Mode with an option called nfs.mountd.trace. How do I do that in clustered Data ONTAP?

Well, the short answer is that there is no single option to trace mount failures currently. The NFS export architecture is vastly different in clustered Data ONTAP than in 7-Mode.

However, there is a way to get pretty close to replicating the functionality. And I’ve managed to script it out a bit to simplify the process. The following is the first run of a draft of a new section in TR-4067: NFS Best Practices and Implementation Guide. This blog will provide a sneak preview of that section, which may also become the beginnings of a clustered Data ONTAP NAS troubleshooting guide.

troubleshooting-large

This document contains commands that must be run at diag privilege level. Please exercise caution when running these commands. Contact NetApp Technical Support for guidance.

What is NFSv3 Mount Tracing?

In some instances, NFS mount requests may fail for clients. When this happens, troubleshooting a mount needs to be made as simple as possible. In Data ONTAP operating in 7-Mode, there was an option that allowed storage administrators to toggle for NFSv3 mount troubleshooting:

nfs.mountd.trace
Note: NFSv4 mounts do not leverage the MOUNT protocol, so these steps do not apply to NFSv4.

When the above option is enabled in 7-Mode, all mount requests are logged. This option is intended to help debug denied mount requests. Valid values for this option are on (enabled) or off (disabled). The default value for this option is off to avoid too many messages. The logging output is stored in the /etc/messages file, as well as being piped to the CLI console.

When a successful mount occurred on a NFS server in 7-Mode, something similar to the following would have been seen in the messages file:

Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 1456151530), is sent the NULL reply
Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 4006546062), is sent the NULL reply
Thu Feb 25 17:03:28 GMT: Client 10.61.69.167 (xid 4023323278) in mount, has access rights to path

When a failed mount occurred, we’d see something like this:

Thu Feb 25 17:07:13 GMT: Client 10.61.69.167 (xid 945700734), is sent the NULL reply
Thu Feb 25 17:07:13 GMT: Client 10.61.69.167 (xid 962477950) in mount, does not have access rights to path /vol/nfs

Essentially, the information given in the trace included:

  • Timestamp of error logged to console and messages file
  • Client IP address
  • Whether the client received a reply
  • Whether the mount succeeded or not

That was the extent of the information given. From there, a storage administrator could use a command like exportfs -c to check access, which would give a more descriptive error:

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs
exportfs: problem finding mount access for 10.61.69.167 to /vol/nfs: (Export not there)

In the above example, the volume “nfs” was in the exports file, but that file had not been applied to memory yet (hence, “export not there”).

When we review the in-memory exports vs. the exports file, that is confirmed.

fas2020-rtp2*> exportfs
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid

fas2020-rtp2*> rdfile /etc/exports
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid
/vol/nfs        -sec=sys,rw,nosuid

All it took to fix this issue was re-exporting:

fas2020-rtp2*> exportfs -a

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs
exportfs: 10.61.69.167 has mount access to /vol/nfs

This is what a client export access issue would look like:

fas2020-rtp2*> exportfs
/vol/vol0       -sec=sys,rw,root=10.61.69.161,nosuid
/vol/nfs        -sec=sys,rw,nosuid
/vol/nfs2       -sec=sys,ro=10.61.69.166,rw=10.61.69.16,nosuid

Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 1456395643), is sent the NULL reply
Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 2136546296), is sent the NULL reply
Thu Feb 25 17:15:34 GMT: Client 10.61.69.167 (xid 2153323512) in mount, does not have access rights to path /vol/nfs2

fas2020-rtp2*> exportfs -c 10.61.69.167 /vol/nfs2
exportfs: 10.61.69.167 does not have mount access to /vol/nfs2 (Access denied)

With the option nfs.mountd.trace, the “who, what and when” of the mount issue was tracked, but not the “how or why.” Those questions required additional leg work.

Despite the limitations of nfs.mountd.trace, customers who have moved from Data ONTAP operating in 7-Mode to clustered Data ONTAP miss this option, as it fit into a specific workflow. However, we can replicate the data gathered by this option fairly closely in clustered Data ONTAP. The following sections show how.

Replicating the option in clustered Data ONTAP

nfs.mountd.trace is conspicuously absent in clustered Data ONTAP. However, that doesn’t mean storage administrators can’t still get the same information as before. It just takes a different approach.

Some of this approach is currently covered in the following KB article: https://kb.netapp.com/support/index?page=content&id=2017281

That KB shows a number of statistics that can help indicate there is an issue with exports/NFS access. It discusses going into systemshell, but we can actually gather statistics from the cluster shell.

Using NFS Statistics to troubleshoot NFS mount issues

Using NFS export access statistics, I can check to see if I’ve ever been denied access to a mount and if those counters are incrementing.

To collect stats, we have to start the statistics capture for the object nfs_exports_access_cache. The object is only available at diag privilege:

cm8040-cluster::*> set diag 
cm8040-cluster::*> statistics start -object nfs_exports_access_cache 
Statistics collection is being started for sample-id: sample_17602

Keep in mind that in 8.3 and later, trying to view stats before we have started them will result in this error:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache
Error: show failed: Default sample not found. To collect a sample, 
use the "statistics start" and "statistics stop" commands. 
To view the data sample, use the "statistics show" command with the 
"-sample-id" parameter.

 

After a bit, check the stats. Filter on the following counters, using a pipe symbol (|) to indicate there are multiple entries:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 17:44:09
Elapsed-time: 549s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          0
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                0
    denied_no_rule                                                  0
    denied_wrong_vserver                                            0
    export_check                                                   92
    indeterminate                                                   0

In the above, I can see that I have two negative entries in cache (which means export access was denied twice). If I know the client’s IP address already (for instance, if the owner of the client called in to complain about being denied access), I could look that information up in the cache using the following command at diag privilege:

cm8040-cluster::*> diag exports nblade access-cache show -node cm8040-cluster-01 -vserver nfs -policy denyhost -address 10.61.69.167

                               Node: cm8040-cluster-01
                            Vserver: nfs
                        Policy Name: denyhost
                         IP Address: 10.61.69.167
           Access Cache Entry Flags: has-usable-data
                        Result Code: 0
                  Failure Type Code: 0
        First Unresolved Rule Index: -
             Unresolved Clientmatch: -
             Unresolved Anon String: -
     Number of Matched Policy Rules: 0
List of Matched Policy Rule Indexes: -
                       Age of Entry: 1002s
        Access Cache Entry Polarity: negative

In the above, the stats were not enabled when the initial issue occurred. Once I reproduce the access issue, I can check the stats again:

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 17:50:43
Elapsed-time: 943s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          1
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                0
    denied_no_rule                                                  1
    denied_wrong_vserver                                            0
    export_check                                                  160
    indeterminate                                                   0

We can see that the denied_no_rule stat incremented for node1. If the policy rule allows the client but simply denies access:

cm8040-cluster::*> export-policy rule show -vserver nfs -policyname denyhost
             Policy          Rule    Access   Client                RO
Vserver      Name            Index   Protocol Match                 Rule
------------ --------------- ------  -------- --------------------- ---------
nfs          denyhost        1       any      10.61.69.166          any
nfs          denyhost        2       any      10.61.69.167          never
2 entries were displayed.

cm8040-cluster::*> statistics show -object nfs_exports_access_cache -counter cache_entries_negative|cache_entries_positive|denied|denied_no_eff_rorule|denied_no_rule|denied_no_rorule|denied_wrong_vserver|export_check|indeterminate

Object: nfs_exports_access_cache
Instance: cm8040-cluster-01
Start-time: 2/25/2016 17:35:00
End-time: 2/25/2016 18:41:47
Elapsed-time: 4007s

Node: cm8040-cluster-01

    Counter                                                     Value
    -------------------------------- --------------------------------
    cache_entries_negative                                          2
    cache_entries_positive                                          2
    denied                                                          2
    denied_no_eff_rorule                                            0
    denied_no_rorule                                                1
    denied_no_rule                                                  1
    denied_wrong_vserver                                            0
    export_check                                                  714
    indeterminate                                                   0

This time, denied_no_rorule increments when a client attempts access, which means the export policy rule specified the client by name or IP and explicitly denied any access. Now, I have the “what” and “why” questions answered: access denied because of no export policy rule for my client, whether by accident or by design.

While statistics are great, these numbers don’t tell the full story. For instance, with just the numbers, we don’t know which clients are having access issues, what volumes are affected, and thus, what specific export policies/rules are the culprit. We do know what node is being accessed, however, which is useful.

Using logs to troubleshoot NFS mount issues

We can use statistics to show us that a problem exists. We use logs to discover where the problem lives.

In clustered Data ONTAP, there are a number of log traces that can be enabled to not only gain the same level of functionality as seen in 7-Mode, but also expands on that level of functionality. The downside is that instead of a single option, there are multiple switches to toggle.

To get a full look at debug logs for the NFS stack, we would need to configure sktraces to debug levels:

sysvar.sktrace.AccessCacheDebug_enable=-1
sysvar.sktrace.NfsPathResolutionDebug_enable=63
sysvar.sktrace.NfsDebug_enable=63
sysvar.sktrace.MntDebug_enable=-1
sysvar.sktrace.Nfs3ProcDebug_enable=63

This can be done via cluster shell (the ::> prompt) rather than dropping into the systemshell. Since we know from our statistics that the issue lives on node1 of the cluster, we can limit the command to node1 using the following command:

cm8040-cluster::*> set diag -c off; systemshell -node cm8040-cluster-01 -c 
"sudo sysctl sysvar.sktrace.AccessCacheDebug_enable=-1;
sudo sysctl sysvar.sktrace.NfsPathResolutionDebug_enable=63;
sudo sysctl sysvar.sktrace.NfsDebug_enable=63;
sudo sysctl sysvar.sktrace.MntDebug_enable=-1;
sudo sysctl sysvar.sktrace.Nfs3ProcDebug_enable=63;
sudo sysctl sysvar.sktrace.NfsDebug_enable=-1"

Export rules are checked via mgwd, so we’d need to log at a debug level there as well:

cm8040-cluster::*> set diag -c off; logger mgwd log modify -module mgwd::exports 
-level debug -node cm8040-cluster-01

Starting in clustered Data ONTAP 8.3.1, disable log supression. If suppression is enabled, we might miss important information in the logs.

NOTE: Not available prior to clustered Data ONTAP 8.3.1.

cm8040-cluster::*> diag exports mgwd journal modify -node cm8040-cluster-01 
-trace-all true -suppress-repeating-errors false

After these are set, reproduce the issue. After reproducing the issue, be sure to disable the debug logging by setting the values back to their defaults to avoid flooding logs with spam, which could cause us to miss errors as the logs roll off.

The default sktrace values are:

sysvar.sktrace.AccessCacheDebug_enable=0
sysvar.sktrace.NfsPathResolutionDebug_enable=0
sysvar.sktrace.NfsDebug_enable=0
sysvar.sktrace.MntDebug_enable=0
sysvar.sktrace.Nfs3ProcDebug_enable=0

To disable sktraces:

cm8040-cluster::*> set diag -c off; systemshell -node cm8040-cluster-01 
-c "sudo sysctl sysvar.sktrace.AccessCacheDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsPathResolutionDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsDebug_enable=0;
sudo sysctl sysvar.sktrace.MntDebug_enable=0;
sudo sysctl sysvar.sktrace.Nfs3ProcDebug_enable=0;
sudo sysctl sysvar.sktrace.NfsDebug_enable=0"

To reset the mgwd exports trace to error level:

cm8040-cluster::*> set diag -c off; logger mgwd log modify -module mgwd::exports 
-level err -node cm8040-cluster-01

To the mgwd journal log and re-enable suppression:

cm8040-cluster::*> diag exports mgwd journal modify -node cm8040-cluster-01 
-trace-all false -suppress-repeating-errors true

Collecting logs

The logs that capture the necessary information are in /mroot/etc/log/mlog on the specific node that is being debugged. These are accessible via the SPI web interface covered in this KB article:

How to manually collect logs and copy files from a clustered Data ONTAP storage system

(TL;DR: Use http://cluster-mgmt-IP/spi)

Specifically, the logs we want are

sktlogd.log
mgwd.log
secd.log

Analyzing the logs

These logs can be a bit difficult to digest if you don’t know what you’re looking for. To make the logs more consumable, find clients that have tried mounting during the log capture using grep (or search via text editor) in the log for the following string:

"MntDebug_3:  MountProcNull: From Client"

For example:

% cat sktlogd.log | grep "MntDebug_3:  MountProcNull: From Client"
000000bb.000635b7 0697ecee Thu Feb 25 2016 18:59:41 +00:00 [kern_sktlogd:info:4513] 22.18.39.700592+1666 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.000635d1 0697ecef Thu Feb 25 2016 18:59:41 +00:00 [kern_sktlogd:info:4513] 22.18.39.701423+0092 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00066e0b 0698181b Thu Feb 25 2016 19:18:07 +00:00 [kern_sktlogd:info:4513] 22.37.07.606654+0674 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00066e25 0698181b Thu Feb 25 2016 19:18:07 +00:00 [kern_sktlogd:info:4513] 22.37.07.612221+0020 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00068db1 06982ad7 Thu Feb 25 2016 19:26:07 +00:00 [kern_sktlogd:info:4513] 22.45.05.424066+0094 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00068dcb 06982ad7 Thu Feb 25 2016 19:26:07 +00:00 [kern_sktlogd:info:4513] 22.45.05.424789+2000 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.00069500 06982fd2 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.745115+0720 [7.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167
000000bb.0006951a 06982fd2 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.745837+0236 [2.0] MntDebug_3:  MountProcNull: From Client = 10.61.69.167

From the above, we get our “when” (time stamp) and “who” (which client) answers. Then we can see the “why” (why did it fail) with the following log sections:

000000bb.0006957e 06982fd3 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.746272+0446 [7.0] MntDebug_5:  MountProcMntPathResolutionCallback: Export Check Exec=0x0xffffff08b9f4b040,rsid=0,Result=3106
000000bb.0006957f 06982fd3 Thu Feb 25 2016 19:28:14 +00:00 [kern_sktlogd:info:4513] 22.47.12.746273+0380 [7.0] MntDebug_5:  MountProcMntPathResolutionCallback: No Access Exec=0x0xffffff08b9f4b040 Ecode=0xd Result=3106

Security Daemon (SecD) Logs

In addition to the logging that can be done at the kernel layer, there are also logs that can be leveraged with the authentication process. These logs are located at /mroot/etc/log/mlog.

In the SecD log, we’re mainly looking for two things with regards to mount access problems:

  • Is SecD running? (if not, no mounts are allowed)
  • What user is the client authenticating to? (i.e., user squashing)
  • Netgroup processing (no netgroup resolution? Access will be denied.)

Outside of that, SecD logs are mainly useful in troubleshooting user access permissions after the mount is successful, which falls under “permissions” or “access” issues, rather than mounts failing.

For more information on SecD and what it is, see TR-4073: Secure Unified Authentication and TR-4067: NFS Best Practice and Implementation Guide.

Packet tracing

With mount tracing in clustered Data ONTAP, there is currently no way to answer the “what” (ie, which volume is being mounted) from the logs. That would have to happen via a packet trace. Packet traces can be captured from an NFS client using tcpdump. The syntax I normally use to trace from an NFS client:

# tcpdump -w <filename.trc> -s 0 -i <interfacename>

In the above example:

  • If you don’t specify –w, the output will pipe to the screen and won’t get captured to a file
  • -s 0 is used to ensure all of the packetlength is captured and avoids truncation
  • Interface names can be seen with ifconfig. The interface should be the one with the IP address connecting to the NFS server.

Packet traces can also be captured from the cluster. These should be captured simultaneously with the client traces. The goal is to ensure the client and server are communicating properly and consistently. Clustered Data ONTAP packet tracing is covered in the following KB article:

How to capture packet traces on clustered Data ONTAP systems

What to look for in packet traces

When troubleshooting mount access issues via packet traces, it’s useful to filter on NFSv3 specific calls. For example, in Wireshark, use the following filter:

nfs or mount or portmap

That will filter out the main traffic we need to troubleshoot mount issues.

To find export paths, simply look for the MOUNT calls. That will tell us what export is being mounted:

MOUNT  150    V3 MNT Call (Reply In 67) /unix

To find which volume in the cluster that path corresponds to, use the following command:

cluster::> volume show -junction-path /unix

If the export path is a subdirectory or qtree, use the first portion of the path. For example, /unix/qtree/dir1 would be queried as /unix to find the corresponding volume.

Other useful information in packet traces that can be used to troubleshoot mount issues include:

  • Source and destination IP addresses
  • Source and destination ports (firewall issues, NFS/mount rootonly option issues)
  • User attempting access
  • Auth method being used (AUTH_SYS, AUTH_NONE, AUTH_GSS, etc)
  • Client hostname
  • Filehandles

Simplifying NFS mount debugging

To make troubleshooting NFS mounts a bit simpler, it makes sense to create a shell script to automate the enabling/disabling of NFS tracing in clustered Data ONTAP.

First, passwordless SSH should be configured for use with an admin host. That process is covered in TR-4073: Secure Unified Authentication, as well as this KB: https://kb.netapp.com/support/index?page=content&id=1012542

Once that’s done, create a shell script and customize it. And feel free to add feedback or your own submissions!

Script templates are located on Github:

https://github.com/whyistheinternetbroken/TR-4067-scripts

One of Clustered Data ONTAP’s Best Features That No One Knows About

Some questions I’ve gotten a few times go like this:

OMG, I deleted my volume. How do I get it back?

Or:

I deleted my volume and I’m not seeing the space given back to my aggregate. How do I fix that?

These questions started around clustered Data ONTAP 8.3. This is not a coincidence.

A little backstory

Back in my support days, we’d occasionally get an unfortunate call from a customer where they accidentally deleted a volume/the wrong volume and were frantically trying to get it back. Luckily, if you caught it in time, you could power down the filers and have one of our engineering wizards work his magic and recover the volume, since deletes take time as blocks are freed.

This issue came to a head when we had a System Manager design flaw that made deleting a volume *way* too easy and did not prompt the user for confirmation. Something had to be done.

Enter the Volume Recovery Queue

As a way to prevent catastrophe, clustered Data ONTAP 8.3 introduced a safety mechanism called the “volume recovery queue.” This feature is not entirely well known, as it’s buried in diag level, which means it doesn’t get documented in official product docs. However, I feel like it’s a cool feature that people need to know about, and one that should help answer questions like the ones I listed above.

Essentially, the recovery queue will take a deleted volume and keep it in the active file system (renamed and hidden from normal viewing) for a default of 12 hours. That means you have 12 hours to recover the deleted volume. It also means you have 12 hours until that space is reclaimed by the OS.

From the CLI man pages:

cluster::*> man volume recovery-queue
volume recovery-queue Data ONTAP 8.3 volume recovery-queue
NAME
 volume recovery-queue -- Manage volume recovery queue
DESCRIPTION
 The recovery-queue commands enable you to manage volumes that are deleted and kept in the recovery queue.
COMMANDS
 modify - Modify attributes of volumes in the recovery queue
purge-all - Purge all volumes from the recovery queue belonging to a Vserver
purge - Purge volumes from the recovery queue belonging to a Vserver
recover-all - Recover all volumes from the recovery queue belonging to a Vserver
recover - Recover volumes from the recovery queue belonging to a Vserver
show - Show volumes in the recovery queue

The above commands, naturally, should be used with caution, especially the purge commands. And the modify command should not be used to change the retention hours to delete things too aggressively. Definitely don’t set it to zero (which disables it).

How it works

When a volume is deleted, the volume gets renamed with the volume’s unique data set ID (DSID) appended and removed from the replicated database volume table. Instead, it’s viewable via the recovery queue for the 12 hour default retention period. During that time, space is not reclaimed, but the volume is still available to be recovered.

For example, my volume called “testdel” has a DSID of 1037:

cluster::*> vol show testdel -fields dsid
vserver volume  dsid
------- ------- ----
nfs     testdel 1037

When I delete the volume, we can’t see it in the volume table, but we can see it in the recovery queue, renamed to testdel_1037 (recall 1037 is the volume DSID):

cluster::*> vol offline testdel
Volume "nfs:testdel" is now offline.
cluster::*> vol delete testdel
Warning: Are you sure you want to delete volume "testdel" in Vserver "nfs" ? {y|n}: y
[Job 490] Job succeeded: Successful
cluster::*> vol show testdel -fields dsid
There are no entries matching your query.
cluster:*> volume recovery-queue show
Vserver   Volume       Deletion Request Time     Retention Hours
--------- ------------ ------------------------- ---------------
nfs       testdel_1037 Fri Mar 11 19:02:40 2016  12

That volume will be in the system for 12 hours unless I purge it out of the queue. That will free space up immediately, but will also remove the chance of being able to recover the volume. Run this command only if you’re sure the volume should be deleted.

cluster::*> volume recovery-queue purge -volume testdel_1037
Initializing
cluster::*> volume recovery-queue show
This table is currently empty.

Pretty straightforward, eh?

Pretty cool, too. I am a big fan of this feature, even if it means an extra step to delete a volume quickly. Better safe than sorry and all.

There is also a KB article on this, with a link to a video. It requires a valid NetApp support login to view:

https://kb.netapp.com/support/index?page=content&id=1014958

This KB shows how to enable it (if it’s somehow disabled):

https://kb.netapp.com/support/index?page=content&id=1015626

 

Clustered Data ONTAP 8.3.2 is now GA!

Several months ago, I wrote a post describing the new 8.3.2RC1 release and what features it includes. You can read that here.

Now, clustered Data ONTAP is in general availability (GA)! If you’re curious what GA means, check that out here.

You can get the new release here:

http://mysupport.netapp.com/NOW/download/software/ontap/8.3.2/

woohoo

Usually, releases don’t add new features between the RC and GA release, but due to popular demand, 8.3.2 has a few nuggets to add to the GA release, in addition to the things that were added to 8.2.3RC.

These include:

  • Simplified System Manager workflows for AFF (basically templates for specific NAS workloads)
  • LDAP signing and sealing
  • FLI support for AFF

In addition, a number of bug fixes are included in this release, so it’s a good idea to schedule a window to upgrade your cluster. Remember, upgrades are non-disruptive!

Updated NetApp NFS Technical Reports Available for Clustered Data ONTAP 8.3.2!

Clustered Data ONTAP 8.3.2 GA is here!

Because of that, the latest updates to the following TRs are now publicly available!

Available now:

TR-4067: Clustered Data ONTAP NFS Implementation and Best Practice Guide

This is essentially the NFS Bible for clustered Data ONTAP. Read it if you ever plan on using NFS with clustered Data ONTAP.

TR-3580: NFSv4 Enhancements and Best Practices Guide for Data ONTAP Implementation

This TR covers NFSv4 in both 7-Mode and cDOT. Think of it as a companion piece to TR-4067.

TR-4379: Name Services Best Practice Guide

This TR covers best practices for DNS, netgroups, LDAP, NIS and other related items to name services in clustered Data ONTAP.

Coming soon:

 

TR-4073: Secure Unified Authentication

This one is currently being updated and does not have a timetable for release currently. Keep checking back here for more information.

Other Updated TRs of interest:

TR-4052: Successfully Transitioning to Clustered Data ONTAP 

This includes a new section on Copy-Free Transition.

 

Introducing: Copy-Free Transition

Clustered Data ONTAP 8.3.2RC1 was announced last week and included many enhancements to ONTAP, including a feature called Copy-Free Transition.

A number of people knew about this feature prior to the cDOT release because they attended Insight 2015 and witnessed either a live demo of the feature or the session presented by Jay White (CR-2845-2: Using the 7-Mode Transition Tool).

We talked a bit about CFT in Episode 15 of the Tech ONTAP Podcast.

There’s also a video demo of Copy-Free Transition available here:

If you’re not familiar with Copy-Free Transition (CFT), then here’s a brief rundown…

cft-button

What is Copy-Free Transition?

Prior to cDOT 8.3.2, transition to clustered Data ONTAP involved copying your data from 7-Mode to clustered Data ONTAP using one of the many tools available. Architecturally and structurally clustered Data ONTAP is very different from 7-Mode which precluded the ability to upgrade in-place to clustered Data ONTAP.

Essentially, you would use one of the following migration options:

  • Use the 7-Mode Transition Tool (7MTT) which leverages SnapMirror to replicate data from 7-Mode to clustered Data ONTAP
  • An application-based migration option (such as Storage vMotion from VMware)
  • File copy options such as ndmpcopy, RoboCopy, rsync, etc.
  • Using Foreign LUN Import

As the above migration options are all methods that copy data, the general term used to describe them is Copy-Based Transition (CBT).

With CFT in 8.3.2 and later, the 7MTT can be used to migrate to clustered Data ONTAP by simply halting your 7-Mode systems, recabling your disk shelves to a cDOT system, then importing the data and configuration into the cluster.

Voilà! Transition simplified!

Why do we want to use CFT?

For starters, you’d use CFT because it allows you to move a large amount of data in a fraction of the time it would take you to copy it. This “big bang” type of transition does require a little extra planning to make sure the clustered Data ONTAP environment is functional post-CFT, but the 7MTT contains extensive pre-checks and assessment capabilities to assist you with your transition planning.

Our live demo at Insight involved a 2-node HA pair with 2 data aggregates and 4 volumes. These volumes served NFS, CIFS and iSCSI data. We were able to finish a live migration in less than 30 minutes, start to finish.

I wasn’t just wearing a Flash costume for giggles – I wanted to emphasize how fast CFT can be.

cft-flash

The guidance from engineering I’ve heard is 3-8 hours, but they’ve been *very* generous in the amount of time built in for cabling the shelves. The time to completion is also dictated by the overall number of objects in the system (ie, number of volumes, qtrees, quotas, exports, etc) and not the size of the dataset. That’s because the 7MTT has to build the configuration on the cDOT system and that takes a number of ZAPI calls. Fundamentally, the message here is that you can do CFT, and roll back if necessary, within a single maintenance window. The main contention for timing here will be how long it takes to re-cable or move disk shelves and reconnect clients.

The actual conversion of the 7-Mode volumes is relatively quick.

Anecdotally, I heard about a customer that did an early preview of CFT with multiple terabytes of data. The cutover after the shelves were moved took 30 minutes. That is… impressive.

That timing is not guaranteed, however – it’s a good idea to plan the 3-8 hours into your window.

Aside from the time it takes to transition, using CFT is also a bonus for people who did not want to purchase/rent swing gear to move data (aside from the minimal amount of equipment needed to bring the cDOT cluster up), or people that simply wanted to keep their existing shelves that they already had support on.

Rather than having to copy the data from 7-Mode to a swing system and then to a cDOT system, you can now simply use the existing gear you have.

The sweet spot for CFT is really unstructured NAS data, such as home directories. These datasets can potentially have thousands or millions of objects with corresponding ACLs. CFT allows for a massively simplified transition of this type of data.

 

What do I need for CFT?

This is a short list of what you currently need for CFT. Keep in mind that the product documentation for the cDOT release is the final word, so always check there.

Currently, you need:

  • 7-Mode 8.1.4P48.1.4P9 (source system)
  • Clustered Data ONTAP 8.3.2RC1 or later (destination)
  • 7MTT 2.2 or later
  • 64-bit aggregates
  • A minimally pre-configured* storage virtual machine on the destination cluster – one per vFiler/node
  • If using CIFS, a CIFS server on the destination
  • An HA pair with no data on it other than the cluster config/SVM placeholders
  • Functioning SP modules on the 7-Mode systems

*Minimally pre-configured here means you need a vsroot volume. If CIFS is involved, you need a data LIF, DNS configuration and a CIFS server pre-created in the same domain as the source CIFS server.

If you have a cluster with existing data on it, you can still use CFT, but you have to have a 4 node cluster with 2 of the HA nodes evacuated of all data. Otherwise, 7MTT won’t allow the CFT to continue.

For platform support, please check the documentation, as those are subject to change.

Also keep in mind that this is a version 1.0 of the feature, so there will be more support for things as the feature matures.

What isn’t currently supported by CFT?

  • SnapMirror sources and destinations are supported, but SnapVault currently is not.
  • MetroCluster is currently not supported.
  • 32-bit aggregates are not supported, but can be upgraded to 64-bit prior to running CFT.
  • Systems containing traditional volumes (TradVols), but let’s be real – who uses those still? 🙂
  • Currently, clusters with existing datasets are not supported (must have an evacuated HA pair)

What happens during the CFT process?

In our demo, we had the following graphic:

cft-process

In that graphic, we have gear images for automated processes and M for manual processes. The good thing about CFT is that it’s super easy because it’s mostly automated. The 7MTT handles most of it for you – even the halting of the 7-Mode systems.

Here’s a rundown of each part of that flowchart. For more details, check the product documentation and TR-4052. (not updated yet, but should be updated in time for 8.3.2GA)

Keep in mind that during the 7MTT run, each section will have a window that shows exactly what is happening at each phase.

Start CFT Migration

This covers the start of the 7MTT and the addition of the 7Mode HA pair and cluster management LIF to the tool. This does not cover the initial up-front planning prior to the migration, so keep that in mind. That all has to take place before this part.

During the “Start CFT” portion, you will also populate the data LIFs you want to migrate, the volumes and define the volume paths. You will also map the vFilers you are migrating to the SVM placeholders on the cluster.

Planning and Pre-checks

This portion of CFT is an automated task that will look at a list of pre-canned checks of 7-Mode and cDOT to ensure the source and destination are ready. It checks compatibility via a series of pre-canned checks and looks to see if 7-Mode is doing things that are not currently supported in cDOT. If anything fails, the tool makes you correct the mistakes before you continue as not to allow you to shoot yourself in the foot.

Apply SVM Configuration

This automated process will take the information grabbed from 7-Mode and apply it to cDOT. This includes the data LIFS – they will get created on the SVM and then placed into a “down” state to avoid IP conflicts.

Test SVM Configuration

Here, you would manually ensure that the SVM configuration has been applied correctly. Check the data LIFs, etc.

Verify Cutover Readiness

This is another pre-check that is essentially in place in case you did the pre-check a week ago and need to verify nothing has changed since then.

Disconnect clients

This is a manual process and the start of the “downtime” portion of CFT – we don’t want clients attached to the 7-Mode system during the export/halt phase.

Export & Halt 7-Mode Systems

This is an automated process that is done by the 7MTT. It leverages the SP interfaces on the 7-Mode systems to do a series of halts and reboots, as well as booting into maintenance mode to remove disk ownership. We’re almost there!

Cable Disk Shelves

Another manual process – you essentially move the cables from the 7-Mode system to the cDOT system. You might even have to physically move shelves or heads, depending on  the datacenter layout.

Verify Cabling

This is an automated 7MTT task. It simply looks for the disks and ensures they can be seen. However, it’s a good idea to do some visual checks, as well as potentially make use of Config Advisor or the 7MTT Cabling Guide.

Import Data & Configuration

This automated phase will assign the disks to the cDOT systems, as well as import the remaining configuration that could not be added previously (we need volumes to attach to quotas, etc… volumes had to come over with the shelves). This is also where the actual conversion of the volumes from 7-Mode style to cDOT style takes place.

Pre-prod verification

This is where you need to check the cDOT cluster to ensure your transitioned data is in place and able to be accessed as expected.

Reconnect clients

This is the “all clear” signal to your clients to start using the cluster. Keep in mind that if you are intending on rolling back to 7-Mode at any time, the data written to the cluster from here could potentially be lost, as the roll back entails reverting to an aggregate level snapshot.

Commit

This is the point of difficult return – once you do this, the aggregate level snapshots you could use to roll back will be deleted. That means, if you plan on going back to 7-Mode, you will be using a copy-based method. Be sure to make your decision quickly!

Rolling back to 7-Mode

If, for some strange reason, you have to roll back to 7-Mode, be sure you decide on it prior to committing CFT. In our demo, roll back was simple, but not automated by the 7MTT. To make the process easy and repeatable, I actually scripted it out using a simple shell script. Worked pretty well every time, provided people followed the directions. 🙂

But, it is possible, and if you don’t commit, it’s pretty fast.

If you have any questions about CFT that I didn’t cover here, feel free to comment.

Also, check out this excellent summary blog on transition by Dimitris Krekoukias (@dkrek):

http://recoverymonkey.org/2016/02/05/7-mode-to-clustered-ontap-transition/

Clustered Data ONTAP 8.3.2 is here! (plus, Insight EMEA 2015 recap)

It’s finally here!

Clustered Data ONTAP 8.3.GA is up and available here:

http://support.netapp.com/NOW/download/software/ontap/8.3.2

I can also finally talk about the live demo I was working on for NetApp Insight. I even made some videos for it:

 

As for what’s in the release, I’ll keep it high level here and start with what I was showing off…

Copy-Free Transition!

One of the biggest pieces of FUD being thrown around regarding NetApp was the fact that the transition from Data ONTAP operating in 7-Mode to clustered Data ONTAP was a “forklift upgrade.”

Given the fact that the operating systems were vastly different, it wasn’t 100% inaccurate, but it was definitely an exaggeration in my opinion.

However, one of the main things people kept asking for was a way to simply move the shelves from the old 7-Mode heads to the new cDOT heads.

“Shouldn’t it be easy?” they (rightfully) asked?

Well, now it is. The live demo we did showed a Copy-Free Transition (CFT) in real-time from 7-Mode to cDOT and it took less than 20 minutes.

Some other notable features of this release:

  • SAN Break-Resync-Break for 7àC Snapmirror relationships
  • MCC Improvements – Cisco 9250i switch support, sharing 8G card FC-VI/FC-Initiator
  • In-line Deduplication
  • Volume Re-host
  • QoS node limit increase (24 nodes!)
  • SVM DR MSID replication (filehandles!)

I wrote up a more in-depth post on CFT after Insight. Also, be sure to check out the TechONTAP Podcast episode where we cover all the new goodness of the 8.3.2 cDOT release!

Episode 15 – Clustered Data ONTAP 8.3.2RC1

NetApp Insight EMEA Recap

flash-netapp

NetApp Insight EMEA is all over and if you noticed I didn’t have any recaps posted, it was because I was INSANELY busy.

In addition to my sessions and the CFT demo, I also joined the TechONTAP podcast as a co-host, which is awesome – but when you have daily recaps, that means you are doing daily edits. So, not a lot of sleep involved.

If you’re interested, here are the daily recaps of the conference:

Day 1 Recap

Day 2 Recap

Day 3 Recap

Day 4 Recap

The Flash did make another appearance, and he even did a video with our All Flash FAS TME, Dan Isaacs called “Flash on Flash”:

I also had the opportunity to talk Data Fabric with NetApp A-Team member Paul Stringfellow:

Stay tuned for more information about 8.3.2 on this blog and by following me on Twitter @NFSDudeAbides, as well as listening to the TechONTAP podcast.

NetAppInsight::Come to booth 303 for a special live demo!

NetApp Insight 2015 in Las Vegas is finally upon us and we’re past day 1 of the show. I’ve already delivered one session and did the very first live demo of the stuff we’ve been working so hard on the past few weeks.

I’d like to say we did it without a hitch, but it’s a live demo – Murphy’s law dictates that whatever can happen, will happen. And with live demos, this is especially true. 🙂

But, even with a couple of snafus, we were able to complete the demo and I think it went pretty well. People seemed genuinely excited about what we were showing and had plenty of good questions.

Am I being vague?

Why yes, yes I am. 🙂

You see, we can’t really say *what* exactly the demo is about right now on social media. But I can tell you there is a demo going on at Booth 303 at NetApp Insight 2015 in Las Vegas in Insight Central at 12:15PM and 2:15PM PST.

If you’re not familiar with the layout of the exhibit hall, just enter the doors, go through a couple of giant “N”s, pass the Lab on Demand area and bear left. We’re across from the NetApp Social Media Booth.

Tuesday through Thursday. I’ll be at the 2:15PM slots on Tues/Wed and then both slots on Thursday.

So come on out and see a live demo! You might even get to see how we handle stuff breaking in real time. 🙂

Other stuff

Also, check out my sessions going on this week.

1884-2: Unlocking the Mysteries of Multiprotocol NAS 

This is a level 2 session where I will attempt to demystify multiprotocol NAS and discuss some best practices with regards to clustered Data ONTAP.

  • Tuesday, 10/13, 10:30AM PST (Jasmine A)
  • Wednesday, 10/14, 1PM PST (Breakers B)
  • Thursday, 10/15, 9AM PST (Jasmine C)

1881-3-TT: SecD Deep Dive

This is a level 3 session where I go pretty deep into how SecD works and how to use it to troubleshoot.

  • Wednesday, 10/14, 10:30AM PST (Palm D)

#NetAppInsight:: What are we up to @NetAppInsightUS 2015?

NetApp Insight 2015 – Las Vegas is getting closer. Have you built your schedule yet? Did you download the mobile app (available for free in the Android and Apple app stores)?

I fly out on Saturday, but have been working night and day with my colleagues on a special project that will make it’s debut at the conference.

We will have our own booth in Insight Central (across from the ONTAP booth) and will be doing demos twice a day. I can’t tell you much about *what* we’re doing until the conference itself, but know that it involved a travel rack and a DIY FlexPod that I have affectionately dubbed “FrankenPod.” (Just in time for Halloween!)

I’ve made some promotional videos leading up to the conference with footage I shot during the process. (Mostly because it was fun to do)

Check them out on my NetApp Insight 2015 YouTube playlist or click on the embedded videos below:

Also check out “Things You Should Know About NetApp Insight 2015.”

If you use Twitter, follow me @NFSDudeAbides and @NetAppInsightUS. Use the #NetAppInsight hashtag for updates and more info.

See you all there!

TECH::The Foreign LUN Import (FLI) Technical Report is Available!

One of the challenges seen in clustered Data ONTAP (cDOT) was the migration of block-based storage from ONTAP operating in 7-mode. Even with a supported tool set, the transitions were clunky, tedious and had to be done offline.

Starting in cDOT 8.3.1, support for foreign LUN import from 7-mode systems was added to make that transition easier, as well as minimizing (or eliminating) downtime for LUN transitions.

borat

With this new feature addition, there is also an excellent how-to TR on the process available as of July 2015. Be sure to check it out here:

SAN Migration Using Foreign LUN Import