NFS Kerberos in ONTAP Primer

Fun fact!

Kerberos was named after Cerberus, the hound of Hades, which protected the gates of the underworld with its three heads of gnashing teeth.

cerberos

Kerberos in IT security isn’t a whole lot different; it’s pretty effective at stopping intruders and is literally a three-headed monster.

In my day to day role as a Technical Marketing Engineer for NFS, I find that one of the most challenging questions I get is regarding NFS mounts using Kerberos. This is especially true now, as IT organizations are focusing more and more on securing their data and Kerberos is one way to do that. CIFS/SMB already does a nice job of this and it’s pretty easily integrated without having to do a ton on the client or storage side.

With NFS Kerberos, however, there are a ton of moving parts and not a ton of expertise that spans those moving parts. Think for a moment what all is involved here when dealing with ONTAP:

  • DNS
  • KDC server (Key Distribution Center)
  • Client/principal
  • NFS server/principal
  • ONTAP
  • NFS
  • LDAP/name services

This blog post isn’t designed to walk you through all those moving parts; that’s what TR-4073 was written for. Instead, this blog is going to simply walk through the workflow of what happens during an NFS mount using Kerberos and where things can fail/common failure scenarios. This post will focus on Active Directory KDCs, since that’s what I see most and get the most questions on. Other UNIX-based KDCs are either not as widely used, or the admins running them are ninjas that never need any help. 🙂

Common terms

First, let’s cover a few common terms used in NFS Kerberos.

Storage Virtual Machine (SVM)

This is what clustered ONTAP uses to present NAS and SAN storage to clients. SVMs act as tenants within a cluster. Think of them as “virtualized storage blades.”

Key Distribution Center (KDC)

The Kerberos ticket headquarters. This stores all the passwords, objects, etc. for running Kerberos in an environment. In Active Directory, domain controllers are KDCs and replicate to other DCs in the environment, which makes Active Directory an ideal platform to run Kerberos on due to ease of use and familiarity. As a bonus, Active Directory is already primed with UNIX attributes for Identity Management with LDAP. (Note: Windows 2012 has UNIX attributes by default; prior to 2012, you had to manually extend the schema.)

Kerberos principals

Kerberos principals are objects within a KDC that can have tickets assigned. Users can own principals. Machine accounts can own principals. However, simply creating a user or machine account doesn’t mean you have created a principal. Those are stored within the object’s LDAP schema attributes in Active Directory. Generally speaking, it’s one of either:

  • servicePrincipalName (SPN)
  • userPrincipalName (UPN)

These get set when adding computers to a domain (including joining Linux clients), as well as when creating new users (every user gets a UPN). Principals include three different components.

  1. Primary – this defines the type of principal (usually a service such as ldap, nfs, host, etc) and is followed by a “/”; Not all principals have primary components. For example, most users are simply user@REALM.COM.
  2. Secondary – this defines the name of the principal (such as jimbob)
  3. Realm – This is the Kerberos realm and is usually defined in ALL CAPS and is the name of the domain your principal was added into (such as CONTOSO.COM)

Keytabs

The keytab file allows a client or server that is participating in an NFS mount to use their keytab to generate AS (authentication service) ticket requests. Think of this as the principal “logging in” to the KDC, similar to what you’d do with a username and password. Keytab files can make their way to clients one of two ways.

  1. Manually creating and copying the keytab file to the client (old school)
  2. Using the domain join tool of your choice (realmd, net ads/samba, adcli, etc.) on the client to automatically negotiate the keytab and machine principals on the KDC (recommended)

Keytab files, when created using the domain join tools, will create multiple entries for Kerberos principals. Generally, this will include a service principal name (SPN) for host/shortname@REALM.COM, host/fully.qualified.name@REALM.COM and a UPN for the machine account such as MACHINE$@REALM.COM. The auto-generated keytabs will also include multiple entries for each principal with different encryption types (enctypes). The following is an example of a CentOS 7 box’s keytab joined to an AD domain using realm join:

# klist -kte
Keytab name: FILE:/etc/krb5.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
 3 05/15/2017 18:01:39 host/centos7.ntap.local@NTAP.LOCAL (des-cbc-crc)
 3 05/15/2017 18:01:39 host/centos7.ntap.local@NTAP.LOCAL (des-cbc-md5)
 3 05/15/2017 18:01:39 host/centos7.ntap.local@NTAP.LOCAL (aes128-cts-hmac-sha1-96)
 3 05/15/2017 18:01:39 host/centos7.ntap.local@NTAP.LOCAL (aes256-cts-hmac-sha1-96)
 3 05/15/2017 18:01:39 host/centos7.ntap.local@NTAP.LOCAL (arcfour-hmac)
 3 05/15/2017 18:01:39 host/CENTOS7@NTAP.LOCAL (des-cbc-crc)
 3 05/15/2017 18:01:39 host/CENTOS7@NTAP.LOCAL (des-cbc-md5)
 3 05/15/2017 18:01:39 host/CENTOS7@NTAP.LOCAL (aes128-cts-hmac-sha1-96)
 3 05/15/2017 18:01:39 host/CENTOS7@NTAP.LOCAL (aes256-cts-hmac-sha1-96)
 3 05/15/2017 18:01:39 host/CENTOS7@NTAP.LOCAL (arcfour-hmac)
 3 05/15/2017 18:01:39 CENTOS7$@NTAP.LOCAL (des-cbc-crc)
 3 05/15/2017 18:01:39 CENTOS7$@NTAP.LOCAL (des-cbc-md5)
 3 05/15/2017 18:01:39 CENTOS7$@NTAP.LOCAL (aes128-cts-hmac-sha1-96)
 3 05/15/2017 18:01:39 CENTOS7$@NTAP.LOCAL (aes256-cts-hmac-sha1-96)
 3 05/15/2017 18:01:39 CENTOS7$@NTAP.LOCAL (arcfour-hmac)

Encryption types (enctypes)

Encryption types (or enctypes) are the level of encryption used for the Kerberos conversation. The client and KDC will negotiate the level of enctype used. The client will tell the KDC “hey, I want to use this list of enctypes. Which do you support?” and the KDC will respond “I support these, in order of strongest to weakest. Try using the strongest first.” In the example above, this is the order of enctype strength, from strongest to weakest:

  • AES-256
  • AES-128
  • ARCFOUR-HMAC
  • DES-CBC-MD5
  • DES-CBC-CRC

The reason a keytab file would add weaker enctypes like DES or ARCFOUR is for backwards compatibility. For example, Windows 2008 DCs don’t support AES enctypes. In some cases, the enctypes can cause Kerberos issues due to lack of support. Windows 2008 and later don’t support DES unless you explicitly enable it. ARCFOUR isn’t supported in clustered ONTAP for NFS Kerberos. In these cases, it’s good to modify the machine accounts to strictly define which enctypes to use for Kerberos.

What you need before you try mounting

This is a quick list of things that have to be in place before you can expect Kerberos with NFS to work properly. If I left something out, feel free to remind me in the comments. There’s so much info involved that I occasionally forget some things. 🙂

KDC and client – The KDC is a given – in this case, Active Directory. The client would need to have some things installed/configured before you try to join it, including a valid DNS server configuration, Kerberos utilities, etc. This varies depending on client and would be too involved to get into here. Again, TR-4073 would be a good place to start.

DNS entries for all clients and servers participating in the NFS Kerberos operation – this includes forward and reverse (PTR) records for the clients and servers. The DNS friendly names *must* match the SPN names. If they don’t, then when you try to mount, the DNS lookup will file the name hostname1 and use that to look up the SPN host/hostname1. If the SPN was called nfs/hostname2, then the Kerberos attempt will fail with “PRINCIPAL_UNKNOWN.” This is also true for Kerberos in CIFS/SMB environments. In ONTAP, a common mistake people make is they name the CIFS server or NFS Kerberos SPN as the SVM name (such as SVM1), but their DNS names are something totally different (such as cifs.domain.com).

Valid Kerberos SPNs and UPNs – When you join a Linux client to a domain, the machine account and SPNs are automatically created. However, the UPN is not created. Having no UPN on a machine account can create issues with some Linux services that use Kerberos keytab files to authenticate. For example, RedHat’s LDAP service (SSSD) can fail to bind if using a Kerberos service principal in the configuration via the ldap_sasl_authid option. The error you’d see would be “PRINCIPAL_UNKNOWN” and would drive you batty because it would be using a principal you *know* exists in your environment. That’s because it’s trying to find the UPN, not the SPN. You can manage the SPN and UPN via the Active Directory attributes tab in the advanced features view. You can query whether SPNs exist via the setspn command (use /q to query by SPN name) in the CLI or PowerShell.

PS C:\> setspn /q host/centos7.ntap.local
Checking domain DC=NTAP,DC=local
CN=CENTOS7,CN=Computers,DC=NTAP,DC=local
 HOST/centos7.ntap.local
 HOST/CENTOS7

Existing SPN found!

You can view a user’s UPN and SPN with the following PowerShell command:

PS C:\> Get-ADUser student1 -Properties UserPrincipalName,ServicePrincipalName

DistinguishedName : CN=student1,CN=Users,DC=NTAP,DC=local
Enabled : True
GivenName : student1
Name : student1
ObjectClass : user
ObjectGUID : d5d5b526-bef8-46fa-967b-00ebc77e468d
SamAccountName : student1
SID : S-1-5-21-3552729481-4032800560-2279794651-1108
Surname :
UserPrincipalName : student1@NTAP.local

And a machine account’s with:

PS C:\> Get-ADComputer CENTOS7$ -Properties UserPrincipalName,ServicePrincipalName

DistinguishedName : CN=CENTOS7,CN=Computers,DC=NTAP,DC=local
DNSHostName : centos7.ntap.local
Enabled : True
Name : CENTOS7
ObjectClass : computer
ObjectGUID : 3a50009f-2b40-46ea-9014-3418b8d70bdb
SamAccountName : CENTOS7$
ServicePrincipalName : {HOST/centos7.ntap.local, HOST/CENTOS7}
SID : S-1-5-21-3552729481-4032800560-2279794651-1140
UserPrincipalName : HOST/centos7.ntap.local@NTAP.LOCAL

Network Time Protocol (NTP) – With Kerberos, there is a 5 minute default time skew window. If a client and server/KDC’s time is outside of that window, Kerberos requests will fail with “Access denied” and you’d see time skew errors in the cluster logs. This KB covers it nicely:

https://kb.netapp.com/support/s/article/ka11A0000001V1YQAU/Troubleshooting-Workflow-CIFS-Authentication-failures?language=en_US

A common issue I’ve seen with this is time zone differences or daylight savings issues. I’ve often seen the wall clock time look identical on server and client, but the time zones or month/date differ, causing the skew.

The NTP requirement is actually a “make sure your time is up to date and in sync on everything” requirement, but NTP makes that easier.

Kerberos to UNIX name mappings – In ONTAP, we authenticate via name mappings not only for CIFS/SMB, but also for Kerberos. When a client attempts to send an authentication request to the cluster for an AS request or ST (service ticket) request, it has to map to a valid UNIX user. The UNIX user mapping will depend on what type of principal is coming in. If you don’t have a valid name mapping rule, you’d see something like this in the event log:

5/16/2017 10:24:23 ontap9-tme-8040-01
 ERROR secd.nfsAuth.problem: vserver (DEMO) General NFS authorization problem. Error: RPC accept GSS token procedure failed
 [ 8 ms] Acquired NFS service credential for logical interface 1034 (SPN='nfs/demo.ntap.local@NTAP.LOCAL').
 [ 11] GSS_S_COMPLETE: client = 'CENTOS7$@NTAP.LOCAL'
 [ 11] Trying to map SPN 'CENTOS7$@NTAP.LOCAL' to UNIX user 'CENTOS7$' using implicit mapping
 [ 12] Using a cached connection to oneway.ntap.local
**[ 14] FAILURE: User 'CENTOS7$' not found in UNIX authorization source LDAP.
 [ 15] Entry for user-name: CENTOS7$ not found in the current source: LDAP. Ignoring and trying next available source
 [ 15] Entry for user-name: CENTOS7$ not found in the current source: FILES. Entry for user-name: CENTOS7$ not found in any of the available sources
 [ 15] Unable to map SPN 'CENTOS7$@NTAP.LOCAL'
 [ 15] Unable to map Kerberos NFS user 'CENTOS7$@NTAP.LOCAL' to appropriate UNIX user

For service principals (SPNS) such as host/name or nfs/name, the mapping would try to default to primary/, so you’d need a UNIX user named host or nfs on the local SVM or in a name service like LDAP. Otherwise, you can create static krb-unix name mappings in the SVM to map to whatever user you like. If you want to use wild cards, regex, etc. you can do  that. For example, this name mapping rule will map all SPNs coming in as {MACHINE}$@REALM.COM to root.

cluster::*> vserver name-mapping show -vserver DEMO -direction krb-unix -position 1

Vserver: DEMO
 Direction: krb-unix
 Position: 1
 Pattern: (.+)\$@NTAP.LOCAL
 Replacement: root
IP Address with Subnet Mask: -
 Hostname: -

To test the mapping, use diag priv:

cluster::*> diag secd name-mapping show -node node1 -vserver DEMO -direction krb-unix -name CENTOS7$@NTAP.LOCAL

'CENTOS7$@NTAP.LOCAL' maps to 'root'

You can map the SPN to root, pcuser, etc. – as long as the UNIX user exists locally on the SVM or in the name service.

The workflow

Now that I’ve gotten some basics out of the way (and if you find that I’ve missed some, add to the comments), let’s look at how the workflow for an NFS mount using Kerberos would work, end to end. This is assuming we’ve configured everything correctly and are ready to mount, and that all the export policy rules allow the client to mount NFSv4 and Kerberos. If a mount fails, always check your export policy rules first.

Some common export policy issues include:

  • The export policy doesn’t have any rules configured
  • The vserver/SVM root volume doesn’t allow read access in the export policy rule for traversal of the / mount point in the namespace
  • The export policy has rules, but they are either misconfigured (clientmatch is wrong, read access disallowed, NFS protocol or auth method is disallowed) or they aren’t allowing the client to access the mount (Run export-policy rule show -instance)
  • The wrong/unexpected export policy has been applied to the volume (Run volume show -fields policy)

What’s unfortunate about trying to troubleshoot mounts with NFS Kerberos involved is that, regardless of the failures happening, the client will report:

mount.nfs: access denied by server while mounting

It’s a generic error and isn’t really helpful in diagnosing the issue.

In ONTAP, there is a command in admin privilege to check the export policy access for the client for troubleshooting purposes. Be sure to use it to rule out export issues.

cluster::> export-policy check-access -vserver DEMO -volume flexvol -client-ip 10.193.67.225 -authentication-method krb5 -protocol nfs4 -access-type read-write
 Policy Policy Rule
Path Policy Owner Owner Type Index Access
----------------------------- ---------- --------- ---------- ------ ----------
/ root vsroot volume 1 read
/flexvol default flexvol volume 1 read-write
2 entries were displayed.

The mount command is issued.

In my case, I use NFSv4.x, as that’s the security standard. Mounting without specifying a version will default to the highest NFS version allowed by the client and server, via a client-server negotiation. If NFSv4.x is disabled on the server, the client will fall back to NFSv3.

# mount -o sec=krb5 demo:/flexvol /mnt

Once the mount command gets issued and Kerberos is specified, a few (ok, a lot of) things happen in the background.

While this stuff happens, the mount command will appear to “hang” as the client, KDC and server suss out if you’re going to be allowed access.

  • DNS lookups are done for the client hostname and server hostname (or reverse lookup of the IP address) to help determine what names are going to be used. Additionally, SRV lookups are done for the LDAP service and Kerberos services in the domain. DNS lookups are happening constantly through this process.
  • The client uses its keytab file to send an authentication service request (AS-REQ) to the KDC, along with what enctypes it has available. The KDC then verifies if the requested principal actually exists in the KDC and if the enctypes are supported.
  • If the enctypes are not supported, or if the principal exists, or if there are DUPLICATE principals, the AS-REQ fails. If the principal exists, the KDC will send a successful reply.
  • Then the client will send a Ticket Granting Service request (TGS-REQ) to the KDC. This request is an attempt to look up the NFS service ticket named nfs/name. The name portion of the ticket is generated either via what was typed into the mount command (ie, demo) or via reverse lookup (if we typed in an IP address to mount). The TGS-REQ will be used later to allow us to obtain a service ticket (ST). The TGS will also negotiate supported enctypes for later. If the TGS-REQ between the KDC and client negotiates an enctype that ONTAP doesn’t support (for example, ARCFOUR), then the mount will fail later in process.
  • If the TGS-REQ succeeds, a TGS-REP is sent. If the KDC doesn’t support the requested enctypes from the client, we fail here. If the NFS principal doesn’t exist (remember, it has to be in DNS and match exactly), then we fail.
  • Once the TGS is acquired by the NFS client, it presents the ticket to the NFS server in ONTAP via a NFS NULL call. The ticket information includes the NFS service SPN and the enctype used. If the NFS SPN doesn’t match what’s in “kerberos interface show,” the mount fails. If the enctype presented by the client isn’t supported or is disallowed in “permitted enctypes” on the NFS server, the request fails. The client would show “access denied.”
  • The NFS service SPN sent by the client is presented to ONTAP. This is where the krb-unix mapping takes place. ONTAP will first see if a user named “nfs” exists in local files or name services (such as LDAP, where a bind to the LDAP server and lookup takes place). If the user doesn’t exist, it will then check to see if any krb-unix name mapping rules were set explicitly. If no rules exist and mapping fails, ONTAP logs an error on the cluster and the mount fails with “Access denied.” If the mapping works, the mount procedure moves on to the next step.
  • After the NFS service ticket is verified, the client will send SETCLIENTID calls and then the NFSv4.x mount compound call (PUTROOTFH | GETATTR). The client and server are also negotiating the name@domainID string to make sure they match on both sides as part of NFSv4.x security.
  • Then, the client will try to run a series of GETATTR calls to “/” in the path. If we didn’t allow “read” access in the policy rule for “/” (the vsroot volume), we fail. If the ACLs/mode bits on the vsroot volume don’t allow at least traverse permissions, we fail. In a packet trace, we can see that the vsroot volume has only traverse permissions:
    V4 Reply (Call In 268) ACCESS, [Access Denied: RD MD XT], [Allowed: LU DL]

    We can also see that from the cluster CLI (“Everyone” only has “Execute” permissions in this NTFS security style volume):

    cluster::> vserver security file-directory show -vserver DEMO -path / -expand-mask true
    
    Vserver: DEMO
     File Path: /
     File Inode Number: 64
     Security Style: ntfs
     Effective Style: ntfs
     DOS Attributes: 10
     DOS Attributes in Text: ----D---
    Expanded Dos Attributes: 0x10
     ...0 .... .... .... = Offline
     .... ..0. .... .... = Sparse
     .... .... 0... .... = Normal
     .... .... ..0. .... = Archive
     .... .... ...1 .... = Directory
     .... .... .... .0.. = System
     .... .... .... ..0. = Hidden
     .... .... .... ...0 = Read Only
     UNIX User Id: 0
     UNIX Group Id: 0
     UNIX Mode Bits: 777
     UNIX Mode Bits in Text: rwxrwxrwx
     ACLs: NTFS Security Descriptor
     Control:0x9504
    
    1... .... .... .... = Self Relative
     .0.. .... .... .... = RM Control Valid
     ..0. .... .... .... = SACL Protected
     ...1 .... .... .... = DACL Protected
     .... 0... .... .... = SACL Inherited
     .... .1.. .... .... = DACL Inherited
     .... ..0. .... .... = SACL Inherit Required
     .... ...1 .... .... = DACL Inherit Required
     .... .... ..0. .... = SACL Defaulted
     .... .... ...0 .... = SACL Present
     .... .... .... 0... = DACL Defaulted
     .... .... .... .1.. = DACL Present
     .... .... .... ..0. = Group Defaulted
     .... .... .... ...0 = Owner Defaulted
    
    Owner:BUILTIN\Administrators
     Group:BUILTIN\Administrators
     DACL - ACEs
     ALLOW-NTAP\Domain Admins-0x1f01ff-OI|CI
     0... .... .... .... .... .... .... .... = Generic Read
     .0.. .... .... .... .... .... .... .... = Generic Write
     ..0. .... .... .... .... .... .... .... = Generic Execute
     ...0 .... .... .... .... .... .... .... = Generic All
     .... ...0 .... .... .... .... .... .... = System Security
     .... .... ...1 .... .... .... .... .... = Synchronize
     .... .... .... 1... .... .... .... .... = Write Owner
     .... .... .... .1.. .... .... .... .... = Write DAC
     .... .... .... ..1. .... .... .... .... = Read Control
     .... .... .... ...1 .... .... .... .... = Delete
     .... .... .... .... .... ...1 .... .... = Write Attributes
     .... .... .... .... .... .... 1... .... = Read Attributes
     .... .... .... .... .... .... .1.. .... = Delete Child
     .... .... .... .... .... .... ..1. .... = Execute
     .... .... .... .... .... .... ...1 .... = Write EA
     .... .... .... .... .... .... .... 1... = Read EA
     .... .... .... .... .... .... .... .1.. = Append
     .... .... .... .... .... .... .... ..1. = Write
     .... .... .... .... .... .... .... ...1 = Read
    
    ALLOW-Everyone-0x100020-OI|CI
     0... .... .... .... .... .... .... .... = Generic Read
     .0.. .... .... .... .... .... .... .... = Generic Write
     ..0. .... .... .... .... .... .... .... = Generic Execute
     ...0 .... .... .... .... .... .... .... = Generic All
     .... ...0 .... .... .... .... .... .... = System Security
     .... .... ...1 .... .... .... .... .... = Synchronize
     .... .... .... 0... .... .... .... .... = Write Owner
     .... .... .... .0.. .... .... .... .... = Write DAC
     .... .... .... ..0. .... .... .... .... = Read Control
     .... .... .... ...0 .... .... .... .... = Delete
     .... .... .... .... .... ...0 .... .... = Write Attributes
     .... .... .... .... .... .... 0... .... = Read Attributes
     .... .... .... .... .... .... .0.. .... = Delete Child
     .... .... .... .... .... .... ..1. .... = Execute
     .... .... .... .... .... .... ...0 .... = Write EA
     .... .... .... .... .... .... .... 0... = Read EA
     .... .... .... .... .... .... .... .0.. = Append
     .... .... .... .... .... .... .... ..0. = Write
     .... .... .... .... .... .... .... ...0 = Read
  • If we have the appropriate permissions to traverse “/” then the NFS client attempts to find the file handle for the mount point via a LOOKUP call, using the file handle of vsroot in the path. It would look something like this:
    V4 Call (Reply In 271) LOOKUP DH: 0x92605bb8/flexvol
  • If the file handle exists, it gets returned to the client:
    fh.png
  • Then the client uses that file handle to run GETATTRs to see if it can access the mount:
    V4 Call (Reply In 275) GETATTR FH: 0x1f57355e

If all is clear, our mount succeeds!

If you’re interested, the successful Kerberos mount traces can be found here:

https://github.com/whyistheinternetbroken/TR-4073-setup/blob/master/nfs-krb5-trace-from-client.pcap

https://github.com/whyistheinternetbroken/TR-4073-setup/blob/master/nfs-krb5-trace-from-kdc.pcapng

But we’re not done… now the user that wants to access the mount has to go through another ticket process. In my case, I used a user named “student1.” This is because a lot of the Kerberos/NFSv4.x requests I get are generated by universities interested in setting up multiprotocol-ready home directories.

When a user like student1 wants to get into a Kerberized NFS mount, they can’t just cd into it. That would look like this:

# su student1
sh-4.2$ cd /mnt
sh: cd: /mnt: Not a directory

Oh look… another useless error! If I were to take that error literally, I would think “that mount doesn’t even exist!” But, it does:

sh-4.2$ mount | grep mnt
demo:/flexvol on /mnt type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=krb5,clientaddr=10.193.67.225,local_lock=none,addr=10.193.67.219)

What that error actually means is that the user requesting access does not have a valid Kerberos AS ticket (login) to make the request for a TGS (ticket granting ticket) to get a service ticket for NFS (nfs/server-hostname). We can see that via the klist -e command.

sh-4.2$ klist -e
klist: Credentials cache keyring 'persistent:1301:1301' not found

Before you can get into a mount that is only allowing Kerberos access, you have to get a Kerberos ticket. On Linux, you can do that via the kinit command, which is akin to a Windows login.

sh-4.2$ kinit
Password for student1@NTAP.LOCAL:
sh-4.2$ klist -e
Ticket cache: KEYRING:persistent:1301:1301
Default principal: student1@NTAP.LOCAL

Valid starting Expires Service principal
05/16/2017 15:54:01 05/17/2017 01:54:01 krbtgt/NTAP.LOCAL@NTAP.LOCAL
 renew until 05/23/2017 15:53:58, Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96

Now that I have a my ticket, I can cd into the mount. When I cd into a Kerberized NFS mount, the client will make TGS requests to the KDC (seen in the trace in packet 101) for the service ticket. If that process is successful, we get access:

sh-4.2$ cd /mnt
sh-4.2$ pwd
/mnt
sh-4.2$ ls
c0 c1 c2 c3 c4 c5 c6 c7 newfile2 newfile-nfs4
sh-4.2$ klist -e
Ticket cache: KEYRING:persistent:1301:1301
Default principal: student1@NTAP.LOCAL

Valid starting Expires Service principal
05/16/2017 15:55:32 05/17/2017 01:54:01 nfs/demo.ntap.local@NTAP.LOCAL
 renew until 05/23/2017 15:53:58, Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96
05/16/2017 15:54:01 05/17/2017 01:54:01 krbtgt/NTAP.LOCAL@NTAP.LOCAL
 renew until 05/23/2017 15:53:58, Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96

Now we’re done. (at least until our tickets expire…)

 

Behind the Scenes: Episode 86 – Veeam 9.5 Update 2

Welcome to the Episode 86, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

We wrap up “Release Week” with an episode on Veeam’s new release with Veeam Technical Evangelist/NetApp A-Team member Michael Cade! (@michaelcade1)

ga3dhof9

Find out what new goodness is in Veeam’s latest release and get a rundown of what Veeam actually is. For the official blog:

https://newsroom.netapp.com/blogs/tech-ontap-podcast-episode-86-veeam-9-5-update-2/

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

ONTAP 9.2RC1 is available!

Like clockwork, the 6 month cadence is upon us again.

clockwork_930w_spc-31

ONTAP 9.2RC1 is available for download here:

http://mysupport.netapp.com/NOW/download/software/ontap/9.2RC1/

If you’re interested in a podcast where we cover the ONTAP 9.2 features, check it out here:

Also out: OnCommand (truly) Unified Manager 7.2:

http://mysupport.netapp.com/documentation/productlibrary/index.html?productID=61373

For now, let’s dive in a bit, shall we?

First of all, I made sure to upgrade my own cluster to show some of the new stuff off. Went off without a hitch:

upgraded

Now, let’s start with one of the most eagerly awaited new features…

Aggregate Inline Deduplication

If you’re not familiar with deduplication, it’s a storage feature that allows blocks that are identical to rely on pointers to a single block instead of having multiple copies of the same blocks. For example, if I am storing multiple JPEG images on a share (or even inside the same PowerPoint file), deduplication will allow me to save storage space by storing just one copy of the data. The image below is an 8.4MB photo I took in Point Reyes, California:

point-reyes-info.png

If I store two copies of the file on a share (no deduplication), that means I use up 16MB.

wo-dedupe

If I use deduplication, then that means the duplicate blocks only take up 4KB per block as they are pointed back to a single copy of the blocks.

w-dedupe.png

If I have multiple copies of the same image, they all point back to the same blocks:

w-dedupe-multiples.png

Pretty cool, eh?

Well, there was *one* problem with how ONTAP does deduplication; the duplicate blocks only count against a single FlexVol volume. That meant if we had the same file in multiple volumes, you don’t get the benefits of deduplication across those volumes.

dedupe-multiple-flexvol.png

In ONTAP 9.2, that issue is resolved. You can now take advantage of deduplication when multiple volumes reside in the same physical aggregate.

dedupe-aggr.png

This is all currently done inline (as data is ingested) only, and currently only on All Flash FAS systems. The space savings come in handy in workloads such as ESXi datastores, where you may be applying OS patches across multiple VMs in multiple datastores hosted in multiple FlexVol volumes.

At a high level, this animation shows how it works:

aid-animation2

Another place where aggregate inline deduplication would rock? NetApp FlexGroup volumes, where a single container is comprised of multiple member FlexVols on the same physical storage. Speaking of FlexGroup volumes, that leads us to the next feature added to ONTAP 9.2.

Other storage efficiency improvements

In addition to aggregate inline dedupe, ONTAP 9.2 also adds:

  • Advanced Drive Partitioning v2 (ADPv2) support for FAS8xxx and FAS9xxx with spinning drives; previously ADPv2 was only supported on All Flash FAS
  • Increase of the maximum aggregate size to 800TB (was previously 400TB)
  • Automated aggregate provisioning in System Manager for easier aggregate creation

NetApp Volume Encryption on FlexGroup volumes

ONTAP 9.1 introduced volume-level encryption (NVE). We did a podcast on it if you’re interested in learning more about it, but in ONTAP 9.2, support for NVE was added to NetApp FlexGroup volumes. Now you can apply encryption only at the volume level (as opposed to the disks via NSE drives) for your large, unstructured NAS workloads.

To apply it, all you need is a volume encryption license. Then, use the same process you would use for a FlexVol volume.

Additionally, NVE can now be used on SnapLock compliance volumes!

Quality of Service (QoS) Minimums/Guaranteed QoS

In ONTAP 8.2, NetApp introduced Quality of Service to allow storage administrators to apply policies to volumes – and even files like luns or VMs – to prevent bully workloads from affecting other workloads in a cluster.

Last year, NetApp acquired SolidFire, which has a pretty mean QoS of its own where it actually approaches QoS from the other end of the spectrum – guaranteeing a performance floor for workloads that require a specific service level.

qos

I’m not 100% sure, but I’m guessing NetApp saw that and said “that’s pretty sweet. Let’s do that.”

So, they have. Now, ONTAP 9.2 has a maximum and a minimum/guaranteed QoS for storage administrators and service providers. Check out a video on it here:

ONTAP Select enhancements

ONTAP 9.2 also includes some ONTAP Select enhancements, such as:

  • 2-node HA support
  • FlexGroup volume support
  • Improved performance
  • Easier deployment
  • ESX Robo license
  • Single node ONTAP Select vNAS with VSAN and iSCSI LUN support
  • Inline deduplication support

Usability enhancements

ONTAP is also continuing its mission to make the deployment and configuration via the System Manager GUI easier and easier. In ONTAP 9.2, we bring:

  • Enhanced upgrade support
  • Application aware data management
  • Simplified cluster expansion
  • Simplified aggregate deployment
  • Guided cluster setup

FabricPools

We covered FabricPools in Episode 63 of the Tech ONTAP podcast. Essentially, FabricPools tier cold blocks from flash disk to cloud or an on-premises S3 target like StorageGRID WebScale. It’s not a replacement for backup or disaster recovery; it’s more of a way to lower your total cost of ownership for storage by moving data that is not actively in use to free up space for other workloads. This is all done automatically via a policy. It behaves more like an extension of the aggregate, as the pointers to the blocks that moved remain on the local storage device.

fabricpool

ONTAP 9.2 introduces version 1 of this feature, which will support the following:

  • Tiering to S3 (StorageGRID) or AWS
  • Snapshot-only tiering on primary storage
  • SnapMirror destination tiering on secondary storage

Future releases will add more functionality, so stay tuned for that! We’ll also be featuring FabricPools in a deep dive for a future podcast episode.

So there you have it! The latest release of ONTAP! Post your thoughts or questions in the comments below!

New doc – Top Best Practices for FlexGroup volumes

When I wrote TR-4571: NetApp FlexGroup Volume Best Practices and Implementation Guide, I wanted to keep the document under 50 pages to be more manageable and digestible. 95 pages later, I realized there was so much good information to pass on to people about NetApp FlexGroup volumes that I wasn’t going to be able to condense a TR down effectively.

maxresdefault

The TRs have been out for a while now, and I’m seeing that 95 pages might be a bit much for some people. Everyone’s busy! I was getting asked the same general questions about deploying FlexGroup volumes over and over and decided I needed to create a new, shorter best practices document that focused *only* on the most important, most frequently asked general best practices. It’s part TR, part FAQ. It’s more of an addendum, a sort of companion reader to TR-4571. And the best part?

It’s only FOUR PAGES LONG.

Check it out here:

http://www.netapp.com/us/media/tr-4571-a.pdf

ONTAP 9.1P3 is available!

office-raise-the-roof

ONTAP 9.1P3 is available for download today.

You can get it here:

mysupport.netapp.com/NOW/download/software/ontap/9.1P3

P-releases are patch releases and include bug fixes. Fixed in this release (since P2):

1015156 Storage controller disruption might occur when closing a CIFS session or a CIFS tree with large number of open files.
1028842 Storage system might experience a disruption under heavy load conditions
1032738 NDMP might fail to detect socket close condition causing the backup operations to enter into a hung state.
1036072 NA51 firmware release for 960 GB and 3.8 TB SSD series
1050210 New drive firmware to mitigate higher failure rate on a drive model
1055450 SnapMirror source system disrupts while pre-processing data to a format that can be understood by the destination system
1058975 SAN services disrupt in a two-node cluster when a node reboots while its partner is in a sustained taken-over state
1063710 Controller disruption occurs while processing read request in certain scenarios
1064449 ‘total_ops’ and ‘other_ops’ counters for the system object are incorrect
1064560 Large file deletion or LUN unmap operations can stall on FlexClone volumes
1065209 Replay cache resources held up in reserves lead to EJUKEBOX errors for NFS clients
1068280 X1133A-R6 initiator port fails to discover FCP target devices
1069032 Unnecessary aggregate metadata reads might lead to long Consistency Points or WAFL Hung
1069555 Unnecessary aggregate metadata reads might lead to long consistency points or the file system disrupts
1070116 Driver code repeatedly generates the ‘Error: Failed to open NVRAM device 0’ error message in messages.log
867605 NDMP DAR restore using non-ASCII (UTF-8) doesn’t seem to work
981677 Ethernet interface on the UTA2 X1143-R6 adapter and onboard UTA2 ports might become unresponsive

 

Behind the Scenes: Episode 83 – OnCommand Unified Manager 7.2

Welcome to the Episode 83, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we chat with Philip Bachman and Yossi Weihs about the new OnCommand Unified Manager update. Find out what’s new and why Unified Manager can truly claim to be a Unified Manager!

What is OnCommand Unified Manager?

OnCommand Unified Manager is NetApp’s monitoring software package. It was the next evolution of Data Fabric Manager (DFM) and allows storage administrators to monitor capacity, performance and other storage events. It also allows you to set up notifications and run scripts when thresholds have been reached. It can be deployed on Linux or Windows in OCUM 7.2, as well as a standard OVA in ESXi.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 82 – DockerCon Preview Featuring nDVP and Trident

Welcome to the Episode 82, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, it’s just Justin and Andrew chatting about the upcoming Docker conference in Austin, TX, as well as updates to the NetApp Docker Volume Plugin and Trident. We get a bit rambuncuous on this one and in honor of Easter, we leave plenty of Easter eggs!

If you’ll be in Austin for DockerCon or Boston for Red Hat Summit or OpenStack Summit, be sure to look for Andrew and the rest of the NetApp team!

For information on NetApp Docker Volume Plugin, Trident and more, go to the Pub:

banner-v3

http://netapp.io

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Behind the Scenes: Episode 81 – NetApp Service Level Manager

Welcome to the Episode 81, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we invited the NetApp Service Level Manager team to talk to us about how they’re revolutionizing storage as a service and automating day to day storage administration tasks. Join Executive Architect Evan Miller (@evancmiller), Product Manager Nagananda.Anur (naga@netapp.com) and Technical Director Ameet.Deulgaonkar (Ameet.Deulgaonkar@netapp.com, @Amyth18) as they detail the benefits of NetApp Service Level Manager!

For more details on NetApp Service Level Manager, we encourage you to check out the additional blogs below:

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

Case study: Using OSI methodology to troubleshoot NAS

Recently, I installed some 10GB cards into an AFF8040 so I could run some FlexGroup performance tests (stay tuned for that). I was able to install the cards myself, but to get them connected to a network here at NetApp’s internal labs, you have to file a ticket. This should sound familiar to many people, as this is how real-world IT works.

So I filed the ticket and eventually, the cards were connected. However, just like in real-world IT, the network team has no idea what the storage team (me) has configured, and the storage team (me) has no idea how the network team has things configured. So we had to troubleshoot a bit to get the cards to ping correctly. Turns out, they had a vlan tag on the ports that weren’t needed. Removed those and fixed the port channel and cool! We now had two 10GB LACP interfaces on a 2 node cluster!

Not so fast…

Turns out, ping is a great test for basic connectivity. But it’s awful for checking if stuff *actually works.* In this case, I could ping via the 10GB interfaces and even mount via NFSv3 and list directories, etc. But those are lightweight metadata operations.

Whenever I tried a heavier operation like a READ, WRITE or READDIRPLUS (incidentally, tab completion for a path when typing a command on an NFS mount? READDIRPLUS call), the client would hang indefinitely. When I would CTL + C out of the command, the process would sometimes also hang. And subsequent operations, including the GETATTR, LOOKUP, etc operations would also hang.

So, now I had a robust network that couldn’t even perform tab completions.

Narrowing down the issue

I like to start with a packet trace, as that gives me a hint where to focus my efforts. In this issue, I started a packet capture on both the client (10.63.150.161) and the cluster (10.193.67.218). In the traces, I saw some duplicate ACKs, as well as packets being sent but not replied to:

readdirplus-noreply.png

In the corresponding filer trace, I saw the READDIRPLUS call come in and get replied to, and then re-transmitted a bunch of times. But, as the trace above shows, the client never receives it.

readdirplus-filer

That means the filer is doing what it’s supposed to. The client is doing what it’s supposed to. But the network is blocking or dropping the packet for some reason.

When troubleshooting any issue, you have to start with a few basic steps (even though I like to start with the more complicated packet capture).

For instance…

What changed?

Well, this one was easy – I had added an entire new network into the mix. End to end. My previous ports were 1GB and worked fine. This was 10GB infrastructure, with LACP and jumbo frames. And I had no control over that network. Thus, I was left with client and server troubleshooting for now. I didn’t want to file another ticket before I had done my due diligence, in case I had done something stupid (totally within the realm of possibility, naturally).

So where did I go from there?

Start at layers 1, 2 and 3

The OSI model is something I used to take for granted as something interviewers asked because it seemed like a good question to stump people on. However, over the course of the last 10 years, I’ve come to realize it’s useful. What I was troubleshooting was NFS, which is all the way at layer 7 (the application layer).

osi-network-layer-cats[1]

So why start at layers 1-3? Why not start where my problem is?

Because with years of experience, you learn that the issue is rarely at the layer you’re seeing the issue manifest. It’s almost always farther down the stack. Where do you think the “Is it plugged in?” joke comes from?

media-cache-ak0-pinimg-com_736x_88_ac_c8_88acc8216648b26114507ca04686b357

Layer 1 means, essentially, is it plugged in? In this case, yes, it was. But it also means “are we seeing errors on the interfaces that are plugged in?” In ONTAP, you can see that with this command:

ontap9-tme-8040::*> node run * ifstat e2a
Node: ontap9-tme-8040-01

-- interface e2a (8 days, 23 hours, 14 minutes, 30 seconds) --

RECEIVE
 Frames/second: 1 | Bytes/second: 30 | Errors/minute: 0
 Discards/minute: 0 | Total frames: 84295 | Total bytes: 7114k
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 0
 No buffers: 0 | Non-primary u/c: 0 | L2 terminate: 9709
 Tag drop: 0 | Vlan tag drop: 0 | Vlan untag drop: 0
 Vlan forwards: 0 | CRC errors: 0 | Runt frames: 0
 Fragment: 0 | Long frames: 0 | Jabber: 0
 Error symbol: 0 | Illegal symbol: 0 | Bus overruns: 0
 Queue drop: 0 | Xon: 0 | Xoff: 0
 Jumbo: 0 | JMBuf RxFrames: 0 | JMBuf DrvCopy: 0
TRANSMIT
 Frames/second: 82676 | Bytes/second: 33299k | Errors/minute: 0
 Discards/minute: 0 | Total frames: 270m | Total bytes: 1080g
 Total errors: 0 | Total discards: 0 | Multi/broadcast: 4496
 Queue overflows: 0 | No buffers: 0 | Xon: 0
 Xoff: 0 | Jumbo: 13 | TSO non-TCP drop: 0
 Split hdr drop: 0 | Pktlen: 0 | Timeout: 0
 Timeout1: 0 | Stray Cluster Pk: 0
DEVICE
 Rx MBuf Sz: Large (3k)
LINK_INFO
 Current state: up | Up to downs: 22 | Speed: 10000m
 Duplex: full | Flowcontrol: none

In this case, the interface is pretty clean. No errors, no “no buffers,” no CRC errors, etc. I can also see that the ports are “up.” The up to downs are high, but that’s because I’ve been adding/removing this port from the ifgrp multiple times, which leads me to the next step…

Layer 2/3

Layer 2 includes the LACP/port channel, as well as the MTU settings.  Layer 3 can also include pings and some switches, as well as routing.

Since the port channel was a new change, I made sure that the networking team verified that the port channel was configured properly, with the correct ports added to the channel. I also made sure that the MTU was 9216 on the switch ports, as well as the ports on the client and storage. Those all checked out.

However, that doesn’t mean we’re done with layer 2; remember, basic pings worked fine, but those operate at 1500 MTU. That means we’re not actually testing jumbo frames here. The issue with the client was that any NFS operation that was not metadata was never making it back to the client; that suggests a network issue somewhere.

I didn’t mention before, but this cluster also has a properly working 1GB network on 1500 MTU on  the same subnet, so that told me routing was likely not an issue. And because the client was able to send the information just fine and had the 10GB established for a while, the issue likely wasn’t on the network segment the client was connected to. The problem resided somewhere between the filer 10GB ports and the new switch the ports were connected to. (Remember… what changed?)

Jumbo frames

From my experience with troubleshooting and general IT knowledge, I knew that for jumbo frames to work properly, they had to be configured up and down the entire stack of the network. I knew the client was configured for jumbo frames properly because it was a known entity that had been chugging along just fine. I also knew that the filer had jumbo frames enabled because I had control over those ports.

What I wasn’t sure of was if the switch had jumbo frames configured for the entire stack. I knew the switch ports were fine, but what about the switch uplinks?

Luckily, ping can tell us. Did you know you could ping using MTU sizes?

Pinging MTU in Windows

To ping using a packet size in Windows, use:

ping -f -l [size] [address]

-f means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size. If it can’t, you’ll see this:

C:\>ping -f -l 9000 10.193.67.218

Pinging 10.193.67.218 with 9000 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Ping statistics for 10.193.67.218:
 Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)

Then, try pinging with only -l (which specifies the packet size). If that fails, you have a good idea that your issue is MTU size. Note: My windows client didn’t have jumbo frames enabled, so I didn’t bother trying to use it to troubleshoot using it.

Pinging MTU in Linux

To ping using a packet size in Linux, use:

ping [-M do] [-s <packet size>] [host]

-f, again, means “don’t fragment the packet.” That means, if I am sending a jumbo frame, don’t break it up into pieces to fit. If you ping using -f and a large MTU, the MTU size needs to be able to squeeze into the network MTU size.

-M <hint>: Select Path MTU Discovery strategy.

<hint> may be either “do” (prohibit fragmentation, even local one), “want” (do PMTU discovery, fragment locally when packet size is large), or “dont” (do not set DF flag).

Keep in mind that the MTU size you specify won’t be *exactly* 9000; there’s some overhead involved. In the case of Linux, we’re dealing with about 28 bytes. So an MTU of 9000 will actually come across as 9028 and complain about the packet being too long:

# ping -M do -s 9000 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 9000(9028) bytes of data.
ping: local error: Message too long, mtu=9000

Instead, ping jumbo frames using 9000 – 28 = 8972:

# ping -M do -s 8972 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 8972(9000) bytes of data.
^C
--- 10.193.67.218 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1454ms

In this case, I lost 100% of my packets. Now, let’s ping using 1500 – 28 = 1472:

# ping -M do -s 1472 10.193.67.218
PING 10.193.67.218 (10.193.67.218) 1472(1500) bytes of data.
1480 bytes from 10.193.67.218: icmp_seq=1 ttl=249 time=0.778 ms
^C
--- 10.193.67.218 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 590ms
rtt min/avg/max/mdev = 0.778/0.778/0.778/0.000 ms

All good! Just to make sure, I pinged a known working client that has jumbo frames enabled end to end:

# ping -M do -s 8972 10.63.150.168
PING 10.63.150.168 (10.63.150.168) 8972(9000) bytes of data.
8980 bytes from 10.63.150.168: icmp_seq=1 ttl=64 time=1.12 ms
8980 bytes from 10.63.150.168: icmp_seq=2 ttl=64 time=0.158 ms
^C
--- 10.63.150.168 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1182ms
rtt min/avg/max/mdev = 0.158/0.639/1.121/0.482 ms

Looks like I have data pointing to jumbo frame configuration as my issue. And if you’ve ever dealt with a networking team, you’d better bring data. 🙂

 

Resolving the issue

The network team confirmed that the switch uplink was indeed not set to support jumbo frames. The change was going to take a bit of time, so rather than wait until then, I switched my ports to 1500 in the interim and everything was happy again. Once the jumbo frames get enabled on the cluster’s network segment, I can re-enable them on the cluster.

Where else can this issue crop up?

MTU mismatch is a colossal PITA. It’s hard to remember to look for it and hard to diagnose, especially if you don’t have access to all of the infrastructure.

In ONTAP, specifically, I’ve seen MTU mismatch break:

  • CIFS setup/performance
  • NFS operations
  • SnapMirror replication

Pretty much anything you do over a network can be affected, so if you run into a problem all the way up at the application layer, remember the OSI model and start with the following:

  • Check layers 1-3
  • Ask yourself “what changed?”
  • Compare against working configurations, if possible

Behind the Scenes: Episode 80 – NetApp MyAutosupport Dashboard Refresh

Welcome to the Episode 80, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

group-4-2016

This week on the podcast, we brought in Sudip Hore (@sudiphore) to talk about the new and improved MyAutosupport Dashboard.  Join us as we discuss the new predictive and proactive features and what’s coming in the future.

Also, check out the blog I wrote on it here:
https://whyistheinternetbroken.wordpress.com/2017/03/21/new-myautosupport-dashboard/.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here: