How pNFS could benefit cloud architecture

** Edited on April 2, 2021 **
Funny story about this post. Someone pointed out I had some broken links, so I went in and edited the links. When I clicked “publish” it re-posted the article, which was actually a pointer back to an old DatacenterDude article I wrote from 2015 – which no longer exists. So I started getting *more* pings about broken links and plenty of people seemed to be interested in the content. Thanks to the power of the Wayback Machine, I was able to resurrect the post and decided to do some modernization while I was at it.

Yesterday, I was speaking with a customer who is a cloud provider. They were discussing how to use NFSv4 with Data ONTAP for one of their customers. As we were talking, I brought up pNFS and its capabilities. They were genuinely excited about what pNFS could do for their particular use case. In the cloud, the idea is to remove the overhead of managing infrastructure, so most cloud architectures are geared towards automation, limiting management, etc. In most cases, that’s great, but for data locality in NAS environments, we need a way to make those operations seamless, as well as providing the best possible security available. That’s where pNFS comes in.

18cloudy-600

So, let’s talk about what pNFS is and in what use cases you may want to use it.

What is pNFS?

pNFS is “parallel NFS,” which is a little bit of a misnomer in ONTAP, as it doesn’t do parallel reads and writes across single files (i.e., striping). In the case of pNFS on Data ONTAP, NetApp currently supports file-level pNFS, so the object store would be a flexible volume on an aggregate of physical disks.

pNFS in ONTAP establishes a metadata path to the NFS server and then splits off the data path to its own dedicated path. The client works with the NFS server to determine which path is local to the physical location of the files in the NFS filesystem via DEVICEINFO and LAYOUTGETINFO metadata calls (specific to NFSv4.1 and later) and then dynamically redirects the path to be local. Think of it as ALUA for NFS.

The following graphic shows how that all takes place.

pNFS

pNFS defines the notion of a device that is generated by the server (that is, an NFS server running on Data ONTAP) and sent to the client. This process helps the client locate the data and send requests directly over the path local to that data. Data ONTAP generates one pNFS device per flexible volume. The metadata path does not change, so metadata requests might still be remote. In a Data ONTAP pNFS implementation, every data LIF is considered an NFS server, so pNFS only works if each node owns at least one data LIF per NFS SVM. Doing otherwise negates the benefits of pNFS, which is data locality regardless of which IP address a client connects to.

The pNFS device contains information about the following:

  • Volume constituents
  • Network location of the constituents

The device information is cached to the local node for improved performance.

To see pNFS devices in the cluster, use the following command in advanced privilege:

cluster::> set diag
cluster::*> vserver nfs pnfs devices cache show

pNFS Components

There are three main components of pNFS:

  • Metadata server
    • Handles all nondata traffic such as GETATTR, SETATTR, and so on
    • Responsible for maintaining metadata that informs the clients of the file locations
    • Located on the NetApp NFS server and established via the mount point
  • Data server
    • Stores file data and responds to READ and WRITE requests
    • Located on the NetApp NFS server
    • Inode information also resides here
  • Clients

pNFS is covered in further detail in NetApp TRs 4067 (NFS) and 4571 (FlexGroup volumes)

How Can I Tell pNFS is Being Used?

To check if pNFS is in use, you can run statistics counters to check for “pnfs_layout_conversions” counters. If the number of pnfs_layout_conversions are incrementing, then pNFS is in use. Keep in mind that if you try to use pNFS with a single network interface, the data layout conversations won’t take place and pNFS won’t be used, even if it’s enabled. 

cluster::*> statistics start -object nfsv4_1_diag
cluster::*> statistics show -object nfsv4_1_diag -counter pnfs_layout_conversions


Object: nfsv4_1_diag
Instance: nfs4_1_diag
Start-time: 4/9/2020 16:29:50
End-time: 4/9/2020 16:31:03
Elapsed-time: 73s
Scope: node1

    Counter                                                     Value
   -------------------------------- --------------------------------
   pnfs_layout_conversions                                      4053

Gotta keep ’em separated!

One thing that is beneficial about the design of pNFS is that the metadata paths are separated from the read/write paths. Once a mount is established, the metadata path is set on the IP address used for mount and does not move without manual intervention. In Data ONTAP, that path could live anywhere in the cluster. (Up to 24 physical nodes with multiple ports on each node!)

That buys you resiliency, as well as flexibility to control where the metadata will be served.

The data path, however, will only be established on reads and writes. That path is determined in conversations between the client and server and is dynamic. Any time the physical location of a volume changes, the data path changes automatically, without need to intervene by the clients or the storage administrator. So, unlike NFSv3 or even NFSv4, you no longer would need to break the TCP connection to move the path for reads and writes to be local (via unmount or LIF migrations). And with NFSv4.x, the statefulness of the connection can be preserved.

That means more time for everyone. Data can be migrated in real time, non-disruptively, based on the storage needs of the client.

For example, I have a volume that lives on node cluster01 of my cDOT cluster:

cluster::> vol show -vserver SVM -volume unix -fields node
 (volume show)

vserver volume node
------- ------ --------------
SVM     unix   cluster01

I have data LIFs on each node in my cluster:

 cluster::> net int show -vserver SVM
(network interface show)

Logical     Status     Network                       Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
SVM
             data1      up/up     10.63.57.237/18    cluster01     e0c     true
             data2      up/up     10.63.3.68/18      cluster02     e0c     true
2 entries were displayed.

In the above list:

  • 10.63.3.68 will be my metadata path, since that’s where I mounted.
  • 10.63.57.237 will be my data path, as it is local to the physical node cluster02.

When I mount, the TCP connection is established to the node where the data LIF lives:

nfs-client# mount -o minorversion=1 10.63.3.68:/unix /unix

cluster::> network connections active show -remote-ip 10.228.225.140
Vserver     Interface              Remote
Name        Name:Local             Port Host:Port              Protocol/Service
---------- ---------------------- ---------------------------- ----------------
Node: cluster02
SVM         data2:2049             nfs-client.domain.netapp.com:912
                                                                TCP/nfs

My metadata path is established to cluster02, but my data volume lives on cluster01.

On a basic cd and ls into the mount, all the traffic is seen on the metadata path. (stuff like GETATTR, ACCESS, etc):

83     6.643253      10.228.225.140       10.63.3.68    NFS    270    V4 Call (Reply In 85) GETATTR
85     6.648161      10.63.3.68    10.228.225.140       NFS    354    V4 Reply (Call In 83) GETATTR
87     6.652024      10.228.225.140       10.63.3.68    NFS    278    V4 Call (Reply In 88) ACCESS 
88     6.654977      10.63.3.68    10.228.225.140       NFS    370    V4 Reply (Call In 87) ACCESS

When I start I/O to that volume, the path gets updated to the local path by way of new pNFS calls (specified in RFC-5663):

28     2.096043      10.228.225.140       10.63.3.68    NFS    314    V4 Call (Reply In 29) LAYOUTGET
29     2.096363      10.63.3.68    10.228.225.140       NFS    306    V4 Reply (Call In 28) LAYOUTGET
30     2.096449      10.228.225.140       10.63.3.68    NFS    246    V4 Call (Reply In 31) GETDEVINFO
31     2.096676      10.63.3.68    10.228.225.140       NFS    214    V4 Reply (Call In 30) GETDEVINFO
  1. In LAYOUTGET, the client asks the server “where does this filehandle live?”
  2. The server responds with the device ID and physical location of the filehandle.
  3. Then, the client asks “what devices to access that physical data are avaiabe to me?” via GETDEVINFO.
  4. The server responds with the list of available devices/IP addresses.

justin_getdevinfo

Once that communication takes place (and note that the conversation occurs in sub-millisecond times), the client then establishes the new TCP connection for reads and writes:

32     2.098771      10.228.225.140       10.63.57.237  TCP    74     917 > nfs [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=937300318 TSecr=0 WS=128
33     2.098996      10.63.57.237  10.228.225.140       TCP    78     nfs > 917 [SYN, ACK] Seq=0 Ack=1 Win=33580 Len=0 MSS=1460 SACK_PERM=1 WS=128 TSval=2452178641 TSecr=937300318
34     2.099042      10.228.225.140       10.63.57.237  TCP    66     917 > nfs [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=937300318 TSecr=2452178641

And we can see the connection established on the cluster to both the metadata and data locations:

cluster::> network connections active show -remote-ip 10.228.225.140
Vserver     Interface              Remote
Name        Name:Local             Port Host:Port              Protocol/Service
---------- ---------------------- ---------------------------- ----------------
Node: cluster01
SVM         data2:2049             nfs-client.domain.netapp.com:912
                                                               TCP/nfs
Node: cluster02 
SVM         data2:2049             nfs-client.domain.netapp.com:912 
                                                               TCP/nfs

Then we start our data transfer on the new path (data path 10.63.57.237):

38     2.099798      10.228.225.140       10.63.57.237  NFS    250    V4 Call (Reply In 39) EXCHANGE_ID
39     2.100137      10.63.57.237  10.228.225.140       NFS    278    V4 Reply (Call In 38) EXCHANGE_ID
40     2.100194      10.228.225.140       10.63.57.237  NFS    298    V4 Call (Reply In 42) CREATE_SESSION
42     2.100537      10.63.57.237  10.228.225.140       NFS    194    V4 Reply (Call In 40) CREATE_SESSION

157    2.106388      10.228.225.140       10.63.57.237  NFS    15994  V4 Call (Reply In 178) WRITE StateID: 0x0d20 Offset: 196608 Len: 65536
163    2.106421      10.63.57.237  10.228.225.140       NFS    182    V4 Reply (Call In 127) WRITE

If I do a chmod later, the metadata path is used (10.63.3.68):

341    27.268975     10.228.225.140       10.63.3.68    NFS    310    V4 Call (Reply In 342) SETATTR FH: 0x098eaec9
342    27.273087     10.63.3.68    10.228.225.140       NFS    374    V4 Reply (Call In 341) SETATTR | ACCESS

How do I make sure metadata connections don’t pile up?

When you have many clients mounting to an NFS server, you generally want to try to control which nodes those clients are mounting to. In the cloud, this becomes trickier to do, as clients and storage system management may be handled by the cloud providers. So, we’d want to have a noninteractive way to do this.

With ONTAP, you have two options to load balance TCP connections for metadata. You can use the tried and true DNS round-robin method, but the NFS server doesn’t have any idea what IP addresses have been issued by the DNS server, so as a result, there are no guarantees the connections won’t pile up.

Another way to deal with connections is to leverage the ONTAP feature for on-box DNS load balancing. This feature allows storage administrators to set up a DNS forwarding zone on a DNS server (BIND, Active Directory or otherwise) to forward requests to the clustered Data ONTAP data LIFs, which can act as DNS servers complete with SOA records! The cluster will determine which IP address to issue to a client based on the following factors:

  • CPU load
  • overall node throughput

This helps ensure that any TCP connection that is established is done so in a logical manner based on performance of the phyical hardware.

I cover both types of DNS load balancing in TR-4523: DNS Load Balancing in ONTAP.

What about that data agility?

What’s great about pNFS is that it is a perfect fit for storage operating systems like ONTAP. NetApp and RedHat worked together closely on the protocol enhancement, and it shows in its overall implementation.

In ONTAP, there is the concept of non-disruptive volume moves. This feature gives storage administrators agility and flexibility in their clusters, as well as enabling service and cloud providers a way to charge based on tiers (pay as you grow!).

For example, if I am a cloud provider, I could have a 24-node cluster as a backend. Some HA pairs could be All-Flash FAS (AFF) nodes for high-performance/low latency workloads. Some HA pairs could be SATA or SAS drives for low performance/high capacity/archive storage. If I am providing storage to a customer that wants to implement high performance computing applications, I could sell them the performance tier. If those applications are only going to run during the summer months, we can use the performance tier, and after the jobs are complete, we can move them back to SATA/SAS drives for storage and even SnapMirror or SnapVault them off to a DR site for safekeeping. Once the job cycle comes back around, I can nondisruptively move the volumes back to flash. That saves the customer money, as they only pay for the performance they’re using, and that saves the cloud provider money since they can free up valuable flash real estate for other customers that need performance tier storage.

What happens when a volume moves in pNFS?

When a volume move occurs, the client is notified of the change via the pNFS calls I mentioned earlier. When the file attempts to OPEN for writing, the server responds, “that file is somewhere else now.”

220    24.971992     10.228.225.140       10.63.3.68    NFS    386    V4 Call (Reply In 221) OPEN DH: 0x76306a29/testfile3
221    24.981737     10.63.3.68    10.228.225.140       NFS    482    V4 Reply (Call In 220) OPEN StateID: 0x1077

The client says, “cool, where is it now?”

222    24.992860     10.228.225.140       10.63.3.68    NFS    314    V4 Call (Reply In 223) LAYOUTGET
223    25.005083     10.63.3.68    10.228.225.140       NFS    306    V4 Reply (Call In 222) LAYOUTGET
224    25.005268     10.228.225.140       10.63.3.68    NFS    246    V4 Call (Reply In 225) GETDEVINFO
225    25.005550     10.63.3.68    10.228.225.140       NFS    214    V4 Reply (Call In 224) GETDEVINFO

Then the client uses the new path to start writing, with no interaction needed.

251    25.007448     10.228.225.140       10.63.57.237  NFS    7306   V4 Call WRITE StateID: 0x15da Offset: 0 Len: 65536
275    25.007987     10.228.225.140       10.63.57.237  NFS    7306   V4 Call WRITE StateID: 0x15da Offset: 65536 Len: 65536

Automatic Data Tiering

If you have an on-premises storage system and want to save storage infrastructure costs by automatically tiering cold data to the cloud or to an on-premises object storage system, you could leverage NetApp FabricPool, which allows you to set tiering policies to chunk off cold blocks of data to more cost effective storage and then retrieve those blocks whenever they are requested by the end user. Again, we’re taking the guesswork and labor out of data management, which is becoming critical in a world driven towards managed services.

For more information on FabricPool:

TR-4598: FabricPool Best Practices

Tech ONTAP Podcast Episode 268 – NetApp FabricPool and S3 in ONTAP 9.8

What about FlexGroup volumes?

As of ONTAP 9.7, NFSv4.1 and pNFS is supported with FlexGroup volumes, which is an intriguing solution.

Part of the challenge of a FlexGroup volume is that you’re guaranteed to have remote I/O across a cluster network when you span multiple nodes. But since pNFS automatically redirects traffic to local paths, you can greatly reduce the amount of intracluster traffic.

A FlexGroup volume operates as a single entity, but is constructed of multiple FlexVol member volumes. Each member volume contains unique files that are not striped across volumes. When NFS operations connect to FlexGroup volumes, ONTAP handles the redirection of operations over a cluster network.

With pNFS, these remote operations are reduced, because the data layout mappings track the member volume locations and local network interfaces; they also redirect reads/writes to the local member volume inside a FlexGroup volume, even though the client only sees a single namespace. This approach enables a scale-out NFS solution that is more seamless and easier to manage, and it also reduces cluster network traffic and balances data network traffic more evenly across nodes.

FlexGroup pNFS differs a bit from FlexVol pNFS. Even though FlexGroup load-balances between metadata servers for file opens, pNFS uses a different algorithm. pNFS tries to direct traffic to the node on which the target file is located. If multiple data interfaces per node are given, connections can be made to each of the LIFs, but only one of the LIFs of the set is used to direct traffic to volumes per network interface.

What workloads should I use with pNFS?

pNFS is leveraging NFSv4.1 and later as its protocol, which means you get all the benefits of NFSv4.1 (security, Kerberos and lock integration, lease-based locks, delegations, ACLs, etc.). But you also get the potential negatives of NFSv4.x, such as higher overhead for operations due to the compound calls, state ID handling, locking, etc. and disruptions during storage failovers that you wouldn’t see with NFSv3 due to the stateful nature of NFSv4.x.

Performance can be severely impacted with some workloads, such as high file count workloads/high metadata workloads (think EDA, software development, etc). Why? Well, recall that pNFS is parallel for reads and writes – but the metadata operations still use a single interface for communication. So if your NFS workload is 80% GETATTR, then 80% of your workload won’t benefit from the localization and load balancing that pNFS provides. Instead, you’ll be using NFSv4.1 as if pNFS were disabled.

Plus, with millions of files, even if you’re doing heavy reads and writes, that means you’re redirecting paths constantly with pNFS (creating millions of DEVICEINFO and LAYOUTGET calls), which may prove more inefficient than simply using NFSv4.1 without pNFS.

pNFS also would need to be supported by the clients you’re using, so if you want to use it for something like VMware datastores, you’re out of luck (for now). VMware currently supports NFSv4.1, but not pNFS (they went with session trunking, which ONTAP does not currently support).

File-based pNFS works best with workloads that do a lot of sequential IO, such as databases, Hadoop/Apache Spark, AI training workloads, or other large file workloads, where reads and writes dominate the IO.

What about the performance?

In TR-4067, I did some basic performance testing on NFSv3 vs. NFSv4.1 for those types of workloads and the results were that pNFS stacked up nicely with NFSv3.

These tests were done using dd in parallel to simulate a sequential I/O workload. This isn’t intended to show the upper limits of the system (I used an AFF 8040 and some VM clients with low RAM and 1GB networks), but instead were intended to show an apples to apples comparison of NFSv3 and NFS4.1 with and without pNFS, using different wsize/rsize values. Be sure to do your own tests before implementing in production.

Note that our completion time for this workload using pNFS was a full 5 minutes faster than NFSv3 using a 1MB wsize/rsize value.

Test (wsize/rsize setting)Completion Time
NFSv3 (1MB)15m23s
NFSv3 (256K)14m17s
NFSv3 (64K)14m48s
NFSv4.1 (1MB)15m6s
NFSv4.1 (256K)12m10s
NFSv4.1 (64K)15m8s
NFSv4.1 (1MB; pNFS)10m54s
NFSv4.1 (256K; pNFS)12m16s
NFSv4.1 (64K; pNFS)13m57s
NFSv4.1 (1MB; delegations)13m6s
NFSv4.1 (256K; delegations)15m25s
NFSv4.1 (64K; delegations)13m48s
NFSv4.1 (1MB; pNFS + delegations)11m7s
NFSv4.1 (256K; pNFS + delegations)13m26s
NFSv4.1 (64K; pNFS + delegations)10m6s

The IOPS were lower overall for NFSv4.1 than NFSv3; that’s because NFSv4.1 combines operations into single packets. Thus, NFSv4.1 will be less chatty over the network than NFSv3. On the downside, the payloads are larger, so the NFS server has more processing to do for each packet, which can impact CPU, and with more IOPS, you can see a drop in performance due to that overhead.

Where NFSv4.1 beat out NFSv3 was with the latency and throughput – since we can guarantee data locality, we get benefits of fastpathing the reads/writes to the files, rather than the extra processing needed to traverse the cluster network.

Test
(wsize/rsize setting)
Average Read Latency (ms)Average Read Throughput (MB/s)Average Write Latency (ms)Average Write Throughput (MB/s)Average Ops
NFSv3 (1MB)665427.91160530
NFSv3 (256K)1.47662.911092108
NFSv3 (64K).26952.211108791
NFSv4.1 (1MB)6.562736.81400582
NFSv4.1 (256K)1.47123.211602352
NFSv4.1 (64K).16061.213107809
NFSv4.1 (1MB; pNFS)3.684026.81370818
NFSv4.1 (256K; pNFS)1.18075.215602410
NFSv4.1 (64K; pNFS).18351.914908526
NFSv4.1
(1MB; delegations)
5.168432.91290601
NFSv4.1
(256K; delegations)
1.36483.311401995
NFSv4.1
(64K; delegations)
.16631.310007822
NFSv4.1
(1MB; pNFS + delegations)
3.894122.41110696
NFSv4.1
(256K; pNFS + delegations)
1.17953.311402280
NFSv4.1
(64K; pNFS + delegations)
.18151117011130

For high file count workloads, NFSv3 did much better. This test created 800,000 small files (512K) in parallel. For this high metadata workload, NFSv3 completed 2x as fast as NFSv4.1. pNFS added some time savings versus NFSv4.1 without pNFS, but overall, we can see where we may run into problems with this type of workload. Future releases of ONTAP will get better with this type of workload using NFSv4.1 (these tests were on 9.7).

Test (wsize/rsize setting)Completion TimeCPU %Average throughput (MB/s)Average total IOPS
NFSv3 (1MB)17m29s32%3517696
NFSv3 (256K)16m34s34.5%3728906
NFSv3 (64K)16m11s39%39413566
NFSv4.1 (1MB)38m20s26%1677746
NFSv4.1 (256K)38m15s27.5%1677957
NFSv4.1 (64K)38m13s31%17210221
NFSv4.1 pNFS (1MB)35m44s27%1718330
NFSv4.1 pNFS (256K)35m9s28.5%1758894
NFSv4.1 pNFS (64K)36m41s33%17110751

Enter nconnect

One of the keys to pNFS performance is parallelization of operations across volumes, nodes, etc. But it doesn’t necessarily parallelize network connections across these workloads. That’s where the new NFS mount option nconnect comes in.

The purpose of nconnect is to provide multiple transport connections per TCP connection or mount point on a client. This helps increase parallelism and performance for NFS mounts – particularly for single client workloads. Details about nconnect and how it can increase performance for NFS in Cloud Volumes ONTAP can be found in the blog post The Real Baseline Performance Story: NetApp Cloud Volumes Service for AWS. ONTAP 9.8 offers official support for the use of nconnect with NFS mounts, provided the NFS client also supports it. If you would like to use nconnect, check to see if your client version provides it and use ONTAP 9.8 or later. ONTAP 9.8 and later supports nconnect by default with no option needed.

Client support for nconnect varies, but the latest RHEL 8.3 release supports it, as do the latest Ubuntu and SLES releases. Be sure to verify if your OS vendor supports it.

Our Customer Proof of Concept lab (CPOC) did some benchmarking of nconnect with NFSv3 and pNFS using a sequential I/O workload on ONTAP 9.8 and saw some really promising results.

  • Single NVIDIA DGX-2 client
  • Ubuntu 20.04.2
  • NFSv4.1 with pNFS and nconnect
  • AFF A400 cluster
  • NetApp FlexGroup volume
  • 256K wsize/rsize
  • 100GbE connections
  • 32 x 1GB files

In these tests, the following throughput results were seen. Latency for both were sub 1ms.

TestBandwidth
NFSv310.2 GB/s
NFSv4.1/pNFS21.9 GB/s
Both NFSv3 and NFSv4.1 used nconnect=16

In these tests, NFSv4.1 with pNFS doubled the performance for the sequential read workload at 250us latency. Since the files were 1GB in size, the reads were almost entirely from the controller RAM, but it’s not unreasonable to see that as the reality for a majority of workloads, as most systems have enough RAM to see similar results.

David Arnette and I discuss it a bit in this podcast:

Episode 283 – NetApp ONTAP AI Reference Architectures

Note: Benchmark tests such as SAS iotest will purposely recommend setting file sizes larger than the system RAM to avoid any caching benefits and instead will measure the network bandwidth of the transfer. In real world application scenarios, RAM, network, storage and CPU are all working together to create the best possible performance scenarios.

pNFS Best Practices with ONTAP

pNFS best practices in ONTAP don’t differ much from normal NAS best practices, but here are a few to keep in mind. In general:

  • Use the latest supported client OS version.
  • Use the latest supported ONTAP patch release.
  • Create a data LIF per node, per SVM to ensure data locality for all nodes.
  • Avoid using LIF migration on the metadata server data LIF, because NFSv4.1 is a stateful protocol and LIF migrations can cause brief outages as the NFS states are reestablished.
  • In environments with multiple NFSv4.1 clients mounting, balance the metadata server connections across multiple nodes to avoid piling up metadata operations on a single node or network interface.
  • If possible, avoid using multiple data LIFs on the same node in an SVM.
  • In general, avoid mounting NFSv3 and NFSv4.x on the same datasets. If you can’t avoid this, check with the application vendor to ensure that locking can be managed properly.
  • If you’re using NFS referrals with pNFS, keep in mind that referrals establish a local metadata server, but data I/O still redirect. With FlexGroup volumes, the member volumes might live on multiple nodes, so NFS referrals aren’t of much use. Instead, use DNS load balancing to spread out connections.

Drop any questions into the comments below!

Introducing: ONTAP Recipes!

Official NetApp ONTAP recipes blog here:

https://community.netapp.com/t5/ONTAP-Recipes/bd-p/ONTAPRecipes

One of the key initiatives NetApp has had over the course of the past few years is driving the simplicity and ease of use of ONTAP, its flagship storage software. Some of that work is going into the GUIs that run ONTAP, such as:

  • OnCommand System Manager being moved on-box to prevent the need to manage it on external systems, starting in ONTAP 8.3
  • Application provisioning templates for NAS and SAN applications starting in ONTAP 8.3.2 (including Oracle, VMware, Hyper-V, SQL, SAP HANA and others)
  • Performance headroom/capacity in System Manager in ONTAP  9.0
  • Top client/performance visibility in OnCommand System Manager via ONTAP  9.0
  • Intelligent, automatic balanced placement of storage objects when provisioning volumes and LUNs in ONTAP 9.2
  • Simplified cluster setup, ease of management when adding new nodes, automated non-disruptive upgrades starting in ONTAP 8.3.2 and later
  • Unification of OnCommand Performance Manager and Unified Manager into a single OVA in OnCommand 7.2
  • Better overall look and feel of the GUIs

There’s plenty more to tout, but this is a blog about NetApp’s newest way to help storage administrators (and reluctant, de facto storage administrators) manage ONTAP via…

ONTAP Recipes!

If you’ve ever watched a cooking show, the chef will show you the ingredients and how to assemble/mix/prep. Then, into the oven. Within seconds, through the magic of television, the steaming, hot, fully cooked dish is ready to eat. Super easy, right?

maxresdefault[1]

What they don’t show you is the slicing, chopping, cutting and dicing of the ingredients. That’s done ahead of time and measured out into little dishes. They also don’t show you the various times you inevitably forget to add an ingredient, or you add too much, or you have to run to the store to pick up something you forgot.

Then, the ultimate lie – they don’t let on that the perfectly cooked meal was prepared well before the show was filmed, waiting in the oven in all its perfection.

And that’s ok! We don’t want to see “how the sausage is made.”

We just want to consume it. And our storage is not that much different.

That’s the idea behind ONTAP recipes – they are intended to be written in an easy to follow order. Easy to read. Easy to consume. The goal is to deliver a new recipe each week. If you have a specific recipe you’d like to see, comment here or on the official NetApp ONTAP recipe page. Happy eating!

maxresdefault[1]

Here’s the latest one. The goal was to correspond with MongoDB World in Chicago on June 20-21:

https://community.netapp.com/t5/Data-ONTAP-Discussions/ONTAP-Recipes-Deploy-a-MongoDB-test-dev-environment-on-DP-Secondary/m-p/131941

For all the others, go here:

https://community.netapp.com/t5/user/viewprofilepage/user-id/60363

New doc – Top Best Practices for FlexGroup volumes

When I wrote TR-4571: NetApp FlexGroup Volume Best Practices and Implementation Guide, I wanted to keep the document under 50 pages to be more manageable and digestible. 95 pages later, I realized there was so much good information to pass on to people about NetApp FlexGroup volumes that I wasn’t going to be able to condense a TR down effectively.

maxresdefault

The TRs have been out for a while now, and I’m seeing that 95 pages might be a bit much for some people. Everyone’s busy! I was getting asked the same general questions about deploying FlexGroup volumes over and over and decided I needed to create a new, shorter best practices document that focused *only* on the most important, most frequently asked general best practices. It’s part TR, part FAQ. It’s more of an addendum, a sort of companion reader to TR-4571. And the best part?

It’s only FOUR PAGES LONG.

Check it out here:

http://www.netapp.com/us/media/tr-4571-a.pdf

Ransomware, NetApp and You

The world is a nefarious place. All you have to do is read the latest headlines to see why.

As the use of the Internet has expanded to include things like cloud and the Internet of Things, the number of threats have also expanded. Computer viruses, root kits, spoofing, phishing, spear phishing, denial of service attacks, hacking, Nigerian princes promising a million dollars to your 75 year old mother in law… all of these things are challenges that IT professionals face every day.

One of the nastier security issues out there is something called “ransomware.” It’s exactly what it sounds like – someone gets control of your data via one of the aforementioned ways and encrypts it and holds it ransom, usually for payment in dollars or bitcoin. It’s the Internet version of “Taken” and it often requires someone with a very particular set of skills to combat.

69718596602d11906ebc9004d38b4d6cdea2f1eec494dda537d49f3aac4e52c6

How do you combat ransomware?

There are essentially two ways to combat ransomware:

  1. Threat prevention via securing your networks, authentication and user education.
  2. Restoring from backup (wait. Did we backup???)

NetApp has long been known for its superior Snapshot technology, but with ransomware, it now has a new use case.

If you store your data on NetApp storage and keep a regular cadence of snapshots, you can recover nearly instantaneously from ransomware attacks and be back in business in minutes. Snapshots are readonly, so they can’t be modified by attackers. If someone locks your data up, unlock it by rolling back to happier times, such as when your data was not being held hostage by ransomware.

Matt Watts (@mtjwatts) recently did an excellent job coming up with “10 Good Reasons” for NetApp with regard to ransomware protection. Here is the infographic:

whynetapp

NetApp won’t necessarily prevent ransomware, but they can help get you out of a sticky situation.

In addition to the above, NetApp Security Technical Marketing Engineer Andrae Middleton (Andrae.Middleton@netapp.com) wrote up a Technical Report on Ransomware and NetApp that will be out very soon. You can find that here:

http://www.netapp.com/us/media/tr-4572.pdf

Andrae also has some other useful NetApp security related documentation here:

DS-3846: Security Features in ONTAP 9

TR-4569 Security Hardening Guide for ONTAP 9

We also had Andrae on the Tech ONTAP podcast, along with NetApp A-Team member Jarett Kulm (@jk47theweapon):

 

9.1RC2 is now available!

 

9.1RC2 is now available!

That’s right – release candidate now available. If you have concerns over the “RC” designation, allow me to recap what I mentioned in a previous blog post:

RC versions have completed a rigorous set of internal NetApp tests and are are deemed ready for public consumption. Each release candidate would provide bug fixes that eventually lead up to the GA edition. Keep in mind that all release candidates are fully supported by NetApp, even if there is a GA version available. However, while RC is perfectly fine to run in production environments, GA is the recommended version of any ONTAP software release.

For a more official take on it, see the NetApp link:

http://mysupport.netapp.com/NOW/products/ontap_releasemodel/post70.shtml

What’s new in ONTAP 9.1?

At a high level, ONTAP 9.1 brings:

9.1RC2 specifically brings (outside of bug fixes):

  • Support for the DS460C shelves
  • Official support for backup of NAS to cloud via AltaVault (SnapMirror)
  • SMB support for NetApp FlexGroup volumes

Happy upgrading!

For info about ONTAP 9.0, see:

ONTAP 9 RC1 is now available!

ONTAP 9.0 is now generally available (GA)!

 

Behind the Scenes: Episode 61 – Security and Storage

Welcome to the Episode 61, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

ep61

This week on the podcast, we discuss security in storage systems with the new security TME Andrae Middleton and NetApp A-Team member Jarett Kulm (@JK47theweapon) of High Availability, Inc. We cover security at rest, in-flight, methodologies, ransomware and much more!

Also be sure to check out our podcast on NetApp Volume Encryption.

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

ONTAP 9.1 RC1 is now available!

For info about ONTAP 9.0, see:

ONTAP 9 RC1 is now available!

ONTAP 9.0 is now generally available (GA)!

While many of the features of ONTAP 9.1 were announced at Insight 2016 in Las Vegas, the official release of the software wasn’t scheduled until the first week of October, which was the week after the conference.

For Insight Las Vegas highlights, see http://www.netapp-insight.com/las-vegas-highlights.

Get used to more features being released for ONTAP in the coming years. We’ve sped up the release cycle to get more cool stuff out faster!

But now, ONTAP 9.1 RC1 available!

That’s right – the next major release of ONTAP is now available. If you have concerns over the “RC” designation, allow me to recap what I mentioned in a previous blog post:

RC versions have completed a rigorous set of internal NetApp tests and are are deemed ready for public consumption. Each release candidate would provide bug fixes that eventually lead up to the GA edition. Keep in mind that all release candidates are fully supported by NetApp, even if there is a GA version available. However, while RC is perfectly fine to run in production environments, GA is the recommended version of any ONTAP software release.

For a more official take on it, see the NetApp link:

http://mysupport.netapp.com/NOW/products/ontap_releasemodel/post70.shtml

What’s new in ONTAP 9.1?

At a high level, ONTAP 9.1 brings:

If you have questions about any of the above, leave a comment and I’ll address them in a future blog post.

Happy upgrading!

 

 

 

 

NetApp FlexGroup: An evolution of NAS

evolution-of-man-parodies-333

Check out the official NetApp version of this blog on the NetApp Newsroom!

I’ve been the NFS TME at NetApp for 3 years now.

I also cover name services (LDAP, NIS, DNS, etc.) and occasionally answer the stray CIFS/SMB question. I look at NAS as a data utility, not unlike water or electricity in your home. You need it, you love it, but you don’t really think about it too much and it doesn’t really excite you.

However, once I heard that NetApp was creating a brand new distributed file system that could evolve how NAS works, I jumped at the opportunity to be a TME for it. So, now, I am the Technical Marketing Engineer for NFS, Name Services and NetApp FlexGroup (and sometimes CIFS/SMB). How’s that for a job title?

We covered NetApp FlexGroup in the NetApp Tech ONTAP Podcast the week of June 30, but I wanted to write up a blog post to expand upon the topic a little more.

Now that ONTAP 9.1 is available, it was time to update the blog here.

For the official Technical Report, check out TR-4557 – NetApp FlexGroup Technical Overview.

For the best practice guide, see TR-4571 – NetApp FlexGroup Best Practices and Implementation Guide.

Here are a couple videos I did at Insight:

I also had a chance to chat with Enrico Signoretti at Insight:

Data is growing.

It’s no secret… we’re leaving – some may say, left – the days behind where 100TB in a single volume is enough space to accommodate a single file system. Files are getting larger and datasets are increasing. For instance, think about the sheer amount of data that’s needed to keep something like a photo or video repository running. Or a global GPS data structure. Or Electronic Design Automation environments designing the latest computer chipset. Or seismic data analyzing oil and gas locations.

Environments like these require massive amounts of capacity, with billions of files in some cases. Scale-out NAS storage devices are the best way to approach these use cases because of the flexibility, but it’s important to be able to scale the existing architecture in a simple and efficient manner.

For a while, storage systems like ONTAP had a single construct to handle these workloads – the Flexible Volume (or, FlexVol).

FlexVols are great, but…

For most use cases, FlexVols are perfect. They are large enough (up to 100TB) and can handle enough files (up to 2 billion). For NAS workloads, they can do just about anything. But where you start to see issues with the FlexVol is when you start to increase the number of metadata operations in a file system. The FlexVol volume will serialize these operations and won’t use all possible CPU threads for the operations. I think of it like a traffic jam due to lane closures; when a lane is closed, everyone has to merge, causing slowdowns.

traffic-jam

When all lanes are open, traffic is free to move normally and concurrently.

traffic-clear.png

Additionally, because a FlexVol volume is tied directly to a physical aggregate and node, your NAS operations are also tied to that single aggregate or node. If you have a 10-node cluster, each with multiple aggregates, you might not be getting the most bang for your buck.

That’s where NetApp FlexGroup comes in.

FlexGroup has been designed to solve multiple issues in large-scale NAS workloads.

  • Capacity – Scales to multiple petabytes
  • High file counts – Hundreds of billions of files
  • Performance – parallelized operations in NAS workloads, across CPUs, nodes, aggregates and constituent member FlexVol volumes
  • Simplicity of deployment – Simple-to-use GUI in System Manager allows fast provisioning of massive capacity
  • Load balancing – Use all your cluster resources for a single namespace

With FlexGroup volumes, NAS workloads can now take advantage of every resource available in a cluster. Even with a single node cluster, a FlexGroup can balance workloads across multiple FlexVol constituents and aggregates.

How does a FlexGroup volume work at a high level?

FlexGroup volumes essentially take the already awesome concept of a FlexVol volume and simply enhances it by stitching together multiple FlexVol member constituents into a single namespace that acts like a single FlexVol to clients and storage administrators.

A FlexGroup volume would roughly look like this from an ONTAP perspective:

fg-diagram

Files are not striped, but instead are placed systematically into individual FlexVol member volumes that work together under a single access point. This concept is very similar in function to a multiple FlexVol volume configuration, where volumes are junctioned together to simulate a large bucket.

multi-flexvol.png

However, multiple FlexVol volume configurations add complexity via junctions, export policies and manual decisions for volume placement across cluster nodes, as well as needing to re-design applications to point to a filesystem structure that is being defined by the storage rather than by the application.

To a NAS client, a FlexGroup volume would look like a single bucket of storage:

flexgroup-client.png

When a client creates a file in a FlexGroup, ONTAP will decide which member FlexVol volume is the best possible container for that write based on a number of things such as capacity across members, throughput, last accessed… Basically, doing all the hard work for you. The idea is to keep the members as balanced as possible without hurting performance predictability at all, and, in fact, increasing performance in some workloads.

The creates can arrive on any node in the cluster. Once the request arrives to the cluster, if ONTAP chooses a member volume that’s different than where the request arrived, a hardlink is created within ONTAP (remote or local, depending on the request) and the create is then passed on to the designated member volume. All of this is transparent to clients.

Reads and writes after a file is created will operate much like they already do in ONTAP FlexVols now; the system will tell the client where the file location is and point that client to that particular member volume. As such, you would see much better gains with initial file ingest versus reads/writes after the files have already been placed.

 

Why is this better?

 

When NAS operations can be allocated across multiple FlexVol volumes, we don’t run into the issue of serialization in the system. Instead, we start spreading the workload across multiple file systems (FlexVol volumes) joined together (the FlexGroup volume). And unlike Infinite Volumes, there is no concept of a single FlexVol volume to handle metadata operations – every member volume in a FlexGroup volume is eligible to process metadata operations. As a result, FlexGroup volumes perform better than Infinite Volumes in most cases.

What kind of performance boost are we potentially seeing?

In preliminary testing of a FlexGroup against a single FlexVol, we’ve seen up to 6x the performance. And that was with simple spinning SAS disk. This was the set up used:

  • Single FAS8080 node
  • SAS drives
  • 16 FlexVol member constituents
  • 2 aggregates
  • 8 members per aggregate

The workload used to test the FlexGroup as a software build using Git. In the graph below, we can see that operations such as checkout and clone show the biggest performance boosts, as they take far less time to run to completion on a FlexGroup than on a single FlexVol.

fg-git-graph

Adding more nodes and members can improve performance. Adding AFF into the mix can help latency. Here’s a similar test comparison with an AFF system. This test used GIT, but did a compile of gcc instead of the Linux source code to give us more files.

aff-fg.png

In this case, we see similar performance between a single FlexVol and FlexGroup. We do see slightly better performance with multiple FlexVols (junctioned), but doing that creates complexity and doesn’t offer a true single namespace of >100TB.

We also did some recent AFF testing with a GIT workload. This time, the compile was the gcc library, rather than a Linux compile. This gave us more files and folders to work with. The systems used were an AFF8080 (4 nodes) and an A700 (2 nodes).

aff-completiontimes.png

Simple management

FlexGroup volumes allow storage administrators to deploy multiple petabytes of storage to clients in a single container within a matter of seconds. This provides capacity, as well as similar performance gains you’d see with multiple junctioned FlexVol volumes. (FYI, a junction is essentially just mounting a FlexVol to a FlexVol)

In addition to that, there is compatability out of the gate with OnCommand products. The OnCommand TME Yuvaraju B has created a video showing this, which you can see here:

Snapshots

This section is added after the blog post was already published, as per one of the blog comments. I just simply forgot to mention it. 🙂

In the first release of NetApp FlexGroup, we’ll have access to snapshot functionality. Essentially, this works the same as regular snapshots in ONTAP – it’s done at the FlexVol level and will capture a point in time of the filesystem and lock blocks into place with pointers. I cover general snapshot technology in the blog post Snapshots and Polaroids: Neither Last Forever.

Because a FlexGroup is a collection of member FlexVols, we want to be sure snapshots are captured at the exact same time for filesystem consistency. As such, FlexGroup snapshots are coordinated by ONTAP to be taken at the same time. If a member FlexVol cannot take a snapshot for any reason, the FlexGroup snapshot fails and ONTAP cleans things up.

SnapMirror

FlexGroup supports SnapMirror for disaster recovery. This currently replicates up to 32 member volumes per FlexGroup (100 total per cluster) to a DR site. SnapMirror will take a snapshot of all member volumes at once and then do a concurrent transfer of the members to the DR site.

Automatic Incremental Resiliency

Also included in the FlexGroup feature is a new mechanism that seeks out metadata inconsistencies and fixes them when a client requests access, in real time. No outages. No interruptions. The entire FlexGroup remains online while this happens and the clients don’t even notice when a repair takes place. In fact, no one would know if we didn’t trigger a pesky EMS message to ONTAP to ensure a storage administrator knows we fixed something. Pretty underrated new aspect of FlexGroup, if you ask me.

How do you get NetApp FlexGroup?

NetApp FlexGroup is currently available in ONTAP 9.1 for general availability. It can be used by anyone, but should only be used for the specific use cases covered in the FlexGroup TR-4557. I also cover best practices in TR-4571.

In ONTAP 9.1, FlexGroup supports:

  • NFSv3 and SMB 2.x/3.x (RC2 for SMB support; see TR-4571 for feature support)
  • Snapshots
  • SnapMirror
  • Thin Provisioning
  • User and group quota reporting
  • Storage efficiencies (inline deduplication, compression, compaction; post-process deduplication)
  • OnCommand Performance Manager and System Manager support
  • All-flash FAS (incidentally, the *only* all-flash array that currently supports this scale)
  • Sharing SVMs with FlexVols
  • Constituent volume moves

To get more information, please email flexgroups-info@netapp.com.

What other ONTAP 9 features enhance NetApp FlexGroup volumes?

While FlexGroup as a feature is awesome on its own, there are also a number of ONTAP 9 features added that make a FlexGroup even more attractive, in my opinion.

I cover ONTAP 9 in ONTAP 9 RC1 is now available! but the features I think benefit FlexGroup right out of the gate include:

  • 15 TB SSDs – once we support flash, these will be a perfect fit for FlexGroup
  • Per-aggregate CPs – never bottleneck a node on an over-used aggregate again
  • RAID Triple Erasure Coding (RAID-TEC) – triple parity to add extra protection to your large data sets

Be sure to keep an eye out for more news and information regarding FlexGroup. If you have specific questions, I’ll answer them in the comments section (provided they’re not questions I’m not allowed to answer). 🙂

If you missed the NetApp Insight session I did on FlexGroup volumes, you can find session 60411-2 here:

https://www.brainshark.com/go/netapp-sell/insight-library.html?cf=12089#bsk-lightbox

(Requires a login)

Also, check out my blog on XCP, which I think would be a pretty natural fit for migration off existing NAS systems onto FlexGroup.

Behind the Scenes: Episode 57 – Scale Out Networking in ONTAP

Welcome to the Episode 57, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

This week on the podcast, we invite Juan Mojica (@juan_m_mojica), Product Manager at NetApp, for a technical discussion about scale out networking in ONTAP. We cover IP Spaces, broadcast domains and subnets, as well as some other tidbits to help you understand how the network stack works in your cluster.

We originally had plans for another podcast on a new feature in ONTAP 9.1, but then we found out we couldn’t publish it until the week of Insight. So…. stay tuned! 😉

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

You can listen here:

What’s the deal with remote I/O in ONTAP?

cropped-jerry-seinfeld-stand-up-comedy-seinfeld1

I’m sure most of you have seen Seinfeld, so be sure to read the title in your head as if Seinfeld is delivering it.

I used a comedian as a starter because this post is about a question that I get asked – a lot – that is kind of a running joke by now.

The set up…

When Clustered Data ONTAP first came out, there was a pretty big kerfuffle (love that word) about the architecture of the OS. After all, wasn’t it just a bunch of 7-Mode systems stitched together with duct tape?

Actually, no.

It’s a complete re-write of the ONTAP operating system, for one. The NAS stack from 7-Mode was gutted and became a new architecture built for clustering.

Then, in 8.1, the SAN concepts in 7-Mode were re-done for clustering.

So, while a clustered Data ONTAP cluster is, at the hardware level, a series of HA pairs stitched together with a 10GB network, the operating system has been turned into essentially what I like to call a storage blade center. Your storage systems span clusters of up to 24 physical hardware nodes, effectively obfuscating the hardware and allowing a single management plane for the entire subsystem.

Every node in a cluster is aware of every other node, as well as every other storage object. If a volume lives on node 1, then node 20 knows about it and where it lives via the concept of a replicated database (RDB).

Additionally, the cluster also has a clustered networking stack, where an IP address or WWPN is presented via a logical interface (a LIF). While SAN LIFs have to stay put and leverage host-side pathing for data locality, NAS LIFs have the ability to migrate across any node and any port in the cluster.

However, volumes are still located on physical disks and owned by physical nodes, even though you can move them around via volume move or vol rehost. LIFs are still located on physical ports and nodes, even though you can move them around and load balance connections on them. This raises the question…

What is the deal with remote I/O in ONTAP?

Since you can have multiple nodes in a cluster and a volume can only exist on one node (well, unless you want to check out FlexGroups), and since data LIFs live on single or aggregated ports on a single node, you are bound to run into scenarios where you end up traversing the backend cluster network for data operations unless you want to take on the headache of ensuring every client mounts to a specific IP address to ensure data locality, or you want to leverage one of the data locality features in NAS, such as pNFS or node referrals on initial connection (available for NFSv4.x and CIFS/SMB). I cover some of the NFS-related data locality features in TR-4067, and CIFS autolocation is covered in TR-4191.

In SAN, we have ALUA to manage that locality (or optimized paths), but even adding an extra layer of protection in the form of protocol locality can’t avoid scenarios where interfaces go down or volumes move around after a TCP connection has been established.

That backend network? Why, it’s a 10GB dedicated network with 2-4 dedicated ports per node. No other data is allowed on the network other than cluster operations. Data I/O traverses the network in a proprietary protocol known as SpinNP, which leverages TCP to guarantee the arrival of packets. And, with the advent of 40GB ethernet and other speedier methods of data transfer, I’d be shocked if we didn’t see that backend network improve over the next 5-10 years. The types of operations that traverse the cluster network include:

  • SpinNP for data/local snapmirror
  • ZAPI calls

That’s pretty much it. It’s a beefy, robust backend network that is *extremely* hard to saturate. You’re more likely to bottleneck somewhere else (like your client) before you overload a cluster network.

So now that we’ve established that remote I/O will likely happen, let’s talk about if that matters…

The punchline

simpson_krusty_il_clown

Remote I/O absolutely adds overhead to operations. There’s no technical way around saying it. Suggesting there is no penalty would be dishonest. The amount of penalty, however, varies, depending on protocol. This is especially true when  you consider that NAS operations will leverage a fast path when you localize data.

But the question wasn’t “is there a penalty?” The question is “does it matter?”

I’ll answer with some anecdotal evidence – I spent 5 years in support, working on escalations for clustered Data ONTAP for 3 of those years. I closed thousands of cases over that time period. In that time, I *never* fixed a performance issue by making sure a customer used a local data path.  And believe me, it wasn’t for lack of effort. I *wanted* remote traffic to be the root cause, because that was the easy answer.

Sure, it could help when dealing with really low latency applications, such as Oracle. But in those cases, you architect the solution with data locality in mind. In the other vast majority of scenarios, the “remote I/O” penalty is pretty much irrelevant and causes more hand wringing than necessary.

The design of clustered Data ONTAP was intended to help storage administrators stop worrying about the layout of the data. Let’s start allowing it to do its job!