** Edited on April 2, 2021 **
Funny story about this post. Someone pointed out I had some broken links, so I went in and edited the links. When I clicked “publish” it re-posted the article, which was actually a pointer back to an old DatacenterDude article I wrote from 2015 – which no longer exists. So I started getting *more* pings about broken links and plenty of people seemed to be interested in the content. Thanks to the power of the Wayback Machine, I was able to resurrect the post and decided to do some modernization while I was at it.
Yesterday, I was speaking with a customer who is a cloud provider. They were discussing how to use NFSv4 with Data ONTAP for one of their customers. As we were talking, I brought up pNFS and its capabilities. They were genuinely excited about what pNFS could do for their particular use case. In the cloud, the idea is to remove the overhead of managing infrastructure, so most cloud architectures are geared towards automation, limiting management, etc. In most cases, that’s great, but for data locality in NAS environments, we need a way to make those operations seamless, as well as providing the best possible security available. That’s where pNFS comes in.
So, let’s talk about what pNFS is and in what use cases you may want to use it.
What is pNFS?
pNFS is “parallel NFS,” which is a little bit of a misnomer in ONTAP, as it doesn’t do parallel reads and writes across single files (i.e., striping). In the case of pNFS on Data ONTAP, NetApp currently supports file-level pNFS, so the object store would be a flexible volume on an aggregate of physical disks.
pNFS in ONTAP establishes a metadata path to the NFS server and then splits off the data path to its own dedicated path. The client works with the NFS server to determine which path is local to the physical location of the files in the NFS filesystem via DEVICEINFO and LAYOUTGETINFO metadata calls (specific to NFSv4.1 and later) and then dynamically redirects the path to be local. Think of it as ALUA for NFS.
The following graphic shows how that all takes place.
pNFS defines the notion of a device that is generated by the server (that is, an NFS server running on Data ONTAP) and sent to the client. This process helps the client locate the data and send requests directly over the path local to that data. Data ONTAP generates one pNFS device per flexible volume. The metadata path does not change, so metadata requests might still be remote. In a Data ONTAP pNFS implementation, every data LIF is considered an NFS server, so pNFS only works if each node owns at least one data LIF per NFS SVM. Doing otherwise negates the benefits of pNFS, which is data locality regardless of which IP address a client connects to.
The pNFS device contains information about the following:
- Volume constituents
- Network location of the constituents
The device information is cached to the local node for improved performance.
To see pNFS devices in the cluster, use the following command in advanced privilege:
cluster::> set diag
cluster::*> vserver nfs pnfs devices cache show
There are three main components of pNFS:
- Metadata server
- Handles all nondata traffic such as GETATTR, SETATTR, and so on
- Responsible for maintaining metadata that informs the clients of the file locations
- Located on the NetApp NFS server and established via the mount point
- Data server
- Stores file data and responds to READ and WRITE requests
- Located on the NetApp NFS server
- Inode information also resides here
How Can I Tell pNFS is Being Used?
To check if pNFS is in use, you can run statistics counters to check for “pnfs_layout_conversions” counters. If the number of pnfs_layout_conversions are incrementing, then pNFS is in use. Keep in mind that if you try to use pNFS with a single network interface, the data layout conversations won’t take place and pNFS won’t be used, even if it’s enabled.
cluster::*> statistics start -object nfsv4_1_diagcluster::*> statistics show -object nfsv4_1_diag -counter pnfs_layout_conversions
Start-time: 4/9/2020 16:29:50
End-time: 4/9/2020 16:31:03
Gotta keep ’em separated!
One thing that is beneficial about the design of pNFS is that the metadata paths are separated from the read/write paths. Once a mount is established, the metadata path is set on the IP address used for mount and does not move without manual intervention. In Data ONTAP, that path could live anywhere in the cluster. (Up to 24 physical nodes with multiple ports on each node!)
That buys you resiliency, as well as flexibility to control where the metadata will be served.
The data path, however, will only be established on reads and writes. That path is determined in conversations between the client and server and is dynamic. Any time the physical location of a volume changes, the data path changes automatically, without need to intervene by the clients or the storage administrator. So, unlike NFSv3 or even NFSv4, you no longer would need to break the TCP connection to move the path for reads and writes to be local (via unmount or LIF migrations). And with NFSv4.x, the statefulness of the connection can be preserved.
That means more time for everyone. Data can be migrated in real time, non-disruptively, based on the storage needs of the client.
For example, I have a volume that lives on node cluster01 of my cDOT cluster:
cluster::> vol show -vserver SVM -volume unix -fields node (volume show) vserver volume node ------- ------ -------------- SVM unix cluster01
I have data LIFs on each node in my cluster:
cluster::> net int show -vserver SVM (network interface show) Logical Status Network Current Current Is Vserver Interface Admin/Oper Address/Mask Node Port Home ----------- ---------- ---------- ------------------ ------------- ------- ---- SVM data1 up/up 10.63.57.237/18 cluster01 e0c true data2 up/up 10.63.3.68/18 cluster02 e0c true 2 entries were displayed.
In the above list:
- 10.63.3.68 will be my metadata path, since that’s where I mounted.
- 10.63.57.237 will be my data path, as it is local to the physical node cluster02.
When I mount, the TCP connection is established to the node where the data LIF lives:
nfs-client# mount -o minorversion=1 10.63.3.68:/unix /unix cluster::> network connections active show -remote-ip 10.228.225.140 Vserver Interface Remote Name Name:Local Port Host:Port Protocol/Service ---------- ---------------------- ---------------------------- ---------------- Node: cluster02 SVM data2:2049 nfs-client.domain.netapp.com:912 TCP/nfs
My metadata path is established to cluster02, but my data volume lives on cluster01.
On a basic cd and ls into the mount, all the traffic is seen on the metadata path. (stuff like GETATTR, ACCESS, etc):
83 6.643253 10.228.225.140 10.63.3.68 NFS 270 V4 Call (Reply In 85) GETATTR 85 6.648161 10.63.3.68 10.228.225.140 NFS 354 V4 Reply (Call In 83) GETATTR 87 6.652024 10.228.225.140 10.63.3.68 NFS 278 V4 Call (Reply In 88) ACCESS 88 6.654977 10.63.3.68 10.228.225.140 NFS 370 V4 Reply (Call In 87) ACCESS
When I start I/O to that volume, the path gets updated to the local path by way of new pNFS calls (specified in RFC-5663):
28 2.096043 10.228.225.140 10.63.3.68 NFS 314 V4 Call (Reply In 29) LAYOUTGET 29 2.096363 10.63.3.68 10.228.225.140 NFS 306 V4 Reply (Call In 28) LAYOUTGET 30 2.096449 10.228.225.140 10.63.3.68 NFS 246 V4 Call (Reply In 31) GETDEVINFO 31 2.096676 10.63.3.68 10.228.225.140 NFS 214 V4 Reply (Call In 30) GETDEVINFO
LAYOUTGET, the client asks the server “where does this filehandle live?”
- The server responds with the device ID and physical location of the filehandle.
- Then, the client asks “what devices to access that physical data are avaiabe to me?” via
- The server responds with the list of available devices/IP addresses.
Once that communication takes place (and note that the conversation occurs in sub-millisecond times), the client then establishes the new TCP connection for reads and writes:
32 2.098771 10.228.225.140 10.63.57.237 TCP 74 917 > nfs [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=937300318 TSecr=0 WS=128 33 2.098996 10.63.57.237 10.228.225.140 TCP 78 nfs > 917 [SYN, ACK] Seq=0 Ack=1 Win=33580 Len=0 MSS=1460 SACK_PERM=1 WS=128 TSval=2452178641 TSecr=937300318 34 2.099042 10.228.225.140 10.63.57.237 TCP 66 917 > nfs [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=937300318 TSecr=2452178641
And we can see the connection established on the cluster to both the metadata and data locations:
cluster::> network connections active show -remote-ip 10.228.225.140 Vserver Interface Remote Name Name:Local Port Host:Port Protocol/Service ---------- ---------------------- ---------------------------- ---------------- Node: cluster01 SVM data2:2049 nfs-client.domain.netapp.com:912 TCP/nfs Node: cluster02 SVM data2:2049 nfs-client.domain.netapp.com:912 TCP/nfs
Then we start our data transfer on the new path (data path 10.63.57.237):
38 2.099798 10.228.225.140 10.63.57.237 NFS 250 V4 Call (Reply In 39) EXCHANGE_ID 39 2.100137 10.63.57.237 10.228.225.140 NFS 278 V4 Reply (Call In 38) EXCHANGE_ID 40 2.100194 10.228.225.140 10.63.57.237 NFS 298 V4 Call (Reply In 42) CREATE_SESSION 42 2.100537 10.63.57.237 10.228.225.140 NFS 194 V4 Reply (Call In 40) CREATE_SESSION 157 2.106388 10.228.225.140 10.63.57.237 NFS 15994 V4 Call (Reply In 178) WRITE StateID: 0x0d20 Offset: 196608 Len: 65536 163 2.106421 10.63.57.237 10.228.225.140 NFS 182 V4 Reply (Call In 127) WRITE
If I do a chmod later, the metadata path is used (10.63.3.68):
341 27.268975 10.228.225.140 10.63.3.68 NFS 310 V4 Call (Reply In 342) SETATTR FH: 0x098eaec9 342 27.273087 10.63.3.68 10.228.225.140 NFS 374 V4 Reply (Call In 341) SETATTR | ACCESS
How do I make sure metadata connections don’t pile up?
When you have many clients mounting to an NFS server, you generally want to try to control which nodes those clients are mounting to. In the cloud, this becomes trickier to do, as clients and storage system management may be handled by the cloud providers. So, we’d want to have a noninteractive way to do this.
With ONTAP, you have two options to load balance TCP connections for metadata. You can use the tried and true DNS round-robin method, but the NFS server doesn’t have any idea what IP addresses have been issued by the DNS server, so as a result, there are no guarantees the connections won’t pile up.
Another way to deal with connections is to leverage the ONTAP feature for on-box DNS load balancing. This feature allows storage administrators to set up a DNS forwarding zone on a DNS server (BIND, Active Directory or otherwise) to forward requests to the clustered Data ONTAP data LIFs, which can act as DNS servers complete with SOA records! The cluster will determine which IP address to issue to a client based on the following factors:
- CPU load
- overall node throughput
This helps ensure that any TCP connection that is established is done so in a logical manner based on performance of the phyical hardware.
I cover both types of DNS load balancing in TR-4523: DNS Load Balancing in ONTAP.
What about that data agility?
What’s great about pNFS is that it is a perfect fit for storage operating systems like ONTAP. NetApp and RedHat worked together closely on the protocol enhancement, and it shows in its overall implementation.
In ONTAP, there is the concept of non-disruptive volume moves. This feature gives storage administrators agility and flexibility in their clusters, as well as enabling service and cloud providers a way to charge based on tiers (pay as you grow!).
For example, if I am a cloud provider, I could have a 24-node cluster as a backend. Some HA pairs could be All-Flash FAS (AFF) nodes for high-performance/low latency workloads. Some HA pairs could be SATA or SAS drives for low performance/high capacity/archive storage. If I am providing storage to a customer that wants to implement high performance computing applications, I could sell them the performance tier. If those applications are only going to run during the summer months, we can use the performance tier, and after the jobs are complete, we can move them back to SATA/SAS drives for storage and even SnapMirror or SnapVault them off to a DR site for safekeeping. Once the job cycle comes back around, I can nondisruptively move the volumes back to flash. That saves the customer money, as they only pay for the performance they’re using, and that saves the cloud provider money since they can free up valuable flash real estate for other customers that need performance tier storage.
What happens when a volume moves in pNFS?
When a volume move occurs, the client is notified of the change via the pNFS calls I mentioned earlier. When the file attempts to OPEN for writing, the server responds, “that file is somewhere else now.”
220 24.971992 10.228.225.140 10.63.3.68 NFS 386 V4 Call (Reply In 221) OPEN DH: 0x76306a29/testfile3 221 24.981737 10.63.3.68 10.228.225.140 NFS 482 V4 Reply (Call In 220) OPEN StateID: 0x1077
The client says, “cool, where is it now?”
222 24.992860 10.228.225.140 10.63.3.68 NFS 314 V4 Call (Reply In 223) LAYOUTGET 223 25.005083 10.63.3.68 10.228.225.140 NFS 306 V4 Reply (Call In 222) LAYOUTGET 224 25.005268 10.228.225.140 10.63.3.68 NFS 246 V4 Call (Reply In 225) GETDEVINFO 225 25.005550 10.63.3.68 10.228.225.140 NFS 214 V4 Reply (Call In 224) GETDEVINFO
Then the client uses the new path to start writing, with no interaction needed.
251 25.007448 10.228.225.140 10.63.57.237 NFS 7306 V4 Call WRITE StateID: 0x15da Offset: 0 Len: 65536 275 25.007987 10.228.225.140 10.63.57.237 NFS 7306 V4 Call WRITE StateID: 0x15da Offset: 65536 Len: 65536
Automatic Data Tiering
If you have an on-premises storage system and want to save storage infrastructure costs by automatically tiering cold data to the cloud or to an on-premises object storage system, you could leverage NetApp FabricPool, which allows you to set tiering policies to chunk off cold blocks of data to more cost effective storage and then retrieve those blocks whenever they are requested by the end user. Again, we’re taking the guesswork and labor out of data management, which is becoming critical in a world driven towards managed services.
For more information on FabricPool:
What about FlexGroup volumes?
As of ONTAP 9.7, NFSv4.1 and pNFS is supported with FlexGroup volumes, which is an intriguing solution.
Part of the challenge of a FlexGroup volume is that you’re guaranteed to have remote I/O across a cluster network when you span multiple nodes. But since pNFS automatically redirects traffic to local paths, you can greatly reduce the amount of intracluster traffic.
A FlexGroup volume operates as a single entity, but is constructed of multiple FlexVol member volumes. Each member volume contains unique files that are not striped across volumes. When NFS operations connect to FlexGroup volumes, ONTAP handles the redirection of operations over a cluster network.
With pNFS, these remote operations are reduced, because the data layout mappings track the member volume locations and local network interfaces; they also redirect reads/writes to the local member volume inside a FlexGroup volume, even though the client only sees a single namespace. This approach enables a scale-out NFS solution that is more seamless and easier to manage, and it also reduces cluster network traffic and balances data network traffic more evenly across nodes.
FlexGroup pNFS differs a bit from FlexVol pNFS. Even though FlexGroup load-balances between metadata servers for file opens, pNFS uses a different algorithm. pNFS tries to direct traffic to the node on which the target file is located. If multiple data interfaces per node are given, connections can be made to each of the LIFs, but only one of the LIFs of the set is used to direct traffic to volumes per network interface.
What workloads should I use with pNFS?
pNFS is leveraging NFSv4.1 and later as its protocol, which means you get all the benefits of NFSv4.1 (security, Kerberos and lock integration, lease-based locks, delegations, ACLs, etc.). But you also get the potential negatives of NFSv4.x, such as higher overhead for operations due to the compound calls, state ID handling, locking, etc. and disruptions during storage failovers that you wouldn’t see with NFSv3 due to the stateful nature of NFSv4.x.
Performance can be severely impacted with some workloads, such as high file count workloads/high metadata workloads (think EDA, software development, etc). Why? Well, recall that pNFS is parallel for reads and writes – but the metadata operations still use a single interface for communication. So if your NFS workload is 80% GETATTR, then 80% of your workload won’t benefit from the localization and load balancing that pNFS provides. Instead, you’ll be using NFSv4.1 as if pNFS were disabled.
Plus, with millions of files, even if you’re doing heavy reads and writes, that means you’re redirecting paths constantly with pNFS (creating millions of DEVICEINFO and LAYOUTGET calls), which may prove more inefficient than simply using NFSv4.1 without pNFS.
pNFS also would need to be supported by the clients you’re using, so if you want to use it for something like VMware datastores, you’re out of luck (for now). VMware currently supports NFSv4.1, but not pNFS (they went with session trunking, which ONTAP does not currently support).
File-based pNFS works best with workloads that do a lot of sequential IO, such as databases, Hadoop/Apache Spark, AI training workloads, or other large file workloads, where reads and writes dominate the IO.
What about the performance?
In TR-4067, I did some basic performance testing on NFSv3 vs. NFSv4.1 for those types of workloads and the results were that pNFS stacked up nicely with NFSv3.
These tests were done using dd in parallel to simulate a sequential I/O workload. This isn’t intended to show the upper limits of the system (I used an AFF 8040 and some VM clients with low RAM and 1GB networks), but instead were intended to show an apples to apples comparison of NFSv3 and NFS4.1 with and without pNFS, using different wsize/rsize values. Be sure to do your own tests before implementing in production.
Note that our completion time for this workload using pNFS was a full 5 minutes faster than NFSv3 using a 1MB wsize/rsize value.
|Test (wsize/rsize setting)||Completion Time|
|NFSv4.1 (1MB; pNFS)||10m54s|
|NFSv4.1 (256K; pNFS)||12m16s|
|NFSv4.1 (64K; pNFS)||13m57s|
|NFSv4.1 (1MB; delegations)||13m6s|
|NFSv4.1 (256K; delegations)||15m25s|
|NFSv4.1 (64K; delegations)||13m48s|
|NFSv4.1 (1MB; pNFS + delegations)||11m7s|
|NFSv4.1 (256K; pNFS + delegations)||13m26s|
|NFSv4.1 (64K; pNFS + delegations)||10m6s|
The IOPS were lower overall for NFSv4.1 than NFSv3; that’s because NFSv4.1 combines operations into single packets. Thus, NFSv4.1 will be less chatty over the network than NFSv3. On the downside, the payloads are larger, so the NFS server has more processing to do for each packet, which can impact CPU, and with more IOPS, you can see a drop in performance due to that overhead.
Where NFSv4.1 beat out NFSv3 was with the latency and throughput – since we can guarantee data locality, we get benefits of fastpathing the reads/writes to the files, rather than the extra processing needed to traverse the cluster network.
|Average Read Latency (ms)||Average Read Throughput (MB/s)||Average Write Latency (ms)||Average Write Throughput (MB/s)||Average Ops|
|NFSv4.1 (1MB; pNFS)||3.6||840||26.8||1370||818|
|NFSv4.1 (256K; pNFS)||1.1||807||5.2||1560||2410|
|NFSv4.1 (64K; pNFS)||.1||835||1.9||1490||8526|
(1MB; pNFS + delegations)
(256K; pNFS + delegations)
(64K; pNFS + delegations)
For high file count workloads, NFSv3 did much better. This test created 800,000 small files (512K) in parallel. For this high metadata workload, NFSv3 completed 2x as fast as NFSv4.1. pNFS added some time savings versus NFSv4.1 without pNFS, but overall, we can see where we may run into problems with this type of workload. Future releases of ONTAP will get better with this type of workload using NFSv4.1 (these tests were on 9.7).
|Test (wsize/rsize setting)||Completion Time||CPU %||Average throughput (MB/s)||Average total IOPS|
|NFSv4.1 pNFS (1MB)||35m44s||27%||171||8330|
|NFSv4.1 pNFS (256K)||35m9s||28.5%||175||8894|
|NFSv4.1 pNFS (64K)||36m41s||33%||171||10751|
One of the keys to pNFS performance is parallelization of operations across volumes, nodes, etc. But it doesn’t necessarily parallelize network connections across these workloads. That’s where the new NFS mount option nconnect comes in.
The purpose of nconnect is to provide multiple transport connections per TCP connection or mount point on a client. This helps increase parallelism and performance for NFS mounts – particularly for single client workloads. Details about nconnect and how it can increase performance for NFS in Cloud Volumes ONTAP can be found in the blog post The Real Baseline Performance Story: NetApp Cloud Volumes Service for AWS. ONTAP 9.8 offers official support for the use of nconnect with NFS mounts, provided the NFS client also supports it. If you would like to use nconnect, check to see if your client version provides it and use ONTAP 9.8 or later. ONTAP 9.8 and later supports nconnect by default with no option needed.
Client support for nconnect varies, but the latest RHEL 8.3 release supports it, as do the latest Ubuntu and SLES releases. Be sure to verify if your OS vendor supports it.
Our Customer Proof of Concept lab (CPOC) did some benchmarking of nconnect with NFSv3 and pNFS using a sequential I/O workload on ONTAP 9.8 and saw some really promising results.
- Single NVIDIA DGX-2 client
- Ubuntu 20.04.2
- NFSv4.1 with pNFS and nconnect
- AFF A400 cluster
- NetApp FlexGroup volume
- 256K wsize/rsize
- 100GbE connections
- 32 x 1GB files
In these tests, the following throughput results were seen. Latency for both were sub 1ms.
In these tests, NFSv4.1 with pNFS doubled the performance for the sequential read workload at 250us latency. Since the files were 1GB in size, the reads were almost entirely from the controller RAM, but it’s not unreasonable to see that as the reality for a majority of workloads, as most systems have enough RAM to see similar results.
David Arnette and I discuss it a bit in this podcast:
Note: Benchmark tests such as SAS iotest will purposely recommend setting file sizes larger than the system RAM to avoid any caching benefits and instead will measure the network bandwidth of the transfer. In real world application scenarios, RAM, network, storage and CPU are all working together to create the best possible performance scenarios.
pNFS Best Practices with ONTAP
pNFS best practices in ONTAP don’t differ much from normal NAS best practices, but here are a few to keep in mind. In general:
- Use the latest supported client OS version.
- Use the latest supported ONTAP patch release.
- Create a data LIF per node, per SVM to ensure data locality for all nodes.
- Avoid using LIF migration on the metadata server data LIF, because NFSv4.1 is a stateful protocol and LIF migrations can cause brief outages as the NFS states are reestablished.
- In environments with multiple NFSv4.1 clients mounting, balance the metadata server connections across multiple nodes to avoid piling up metadata operations on a single node or network interface.
- If possible, avoid using multiple data LIFs on the same node in an SVM.
- In general, avoid mounting NFSv3 and NFSv4.x on the same datasets. If you can’t avoid this, check with the application vendor to ensure that locking can be managed properly.
- If you’re using NFS referrals with pNFS, keep in mind that referrals establish a local metadata server, but data I/O still redirect. With FlexGroup volumes, the member volumes might live on multiple nodes, so NFS referrals aren’t of much use. Instead, use DNS load balancing to spread out connections.
Drop any questions into the comments below!