How pNFS could benefit cloud architecture

** Edited on April 2, 2021 **
Funny story about this post. Someone pointed out I had some broken links, so I went in and edited the links. When I clicked “publish” it re-posted the article, which was actually a pointer back to an old DatacenterDude article I wrote from 2015 – which no longer exists. So I started getting *more* pings about broken links and plenty of people seemed to be interested in the content. Thanks to the power of the Wayback Machine, I was able to resurrect the post and decided to do some modernization while I was at it.

Yesterday, I was speaking with a customer who is a cloud provider. They were discussing how to use NFSv4 with Data ONTAP for one of their customers. As we were talking, I brought up pNFS and its capabilities. They were genuinely excited about what pNFS could do for their particular use case. In the cloud, the idea is to remove the overhead of managing infrastructure, so most cloud architectures are geared towards automation, limiting management, etc. In most cases, that’s great, but for data locality in NAS environments, we need a way to make those operations seamless, as well as providing the best possible security available. That’s where pNFS comes in.

18cloudy-600

So, let’s talk about what pNFS is and in what use cases you may want to use it.

What is pNFS?

pNFS is “parallel NFS,” which is a little bit of a misnomer in ONTAP, as it doesn’t do parallel reads and writes across single files (i.e., striping). In the case of pNFS on Data ONTAP, NetApp currently supports file-level pNFS, so the object store would be a flexible volume on an aggregate of physical disks.

pNFS in ONTAP establishes a metadata path to the NFS server and then splits off the data path to its own dedicated path. The client works with the NFS server to determine which path is local to the physical location of the files in the NFS filesystem via DEVICEINFO and LAYOUTGETINFO metadata calls (specific to NFSv4.1 and later) and then dynamically redirects the path to be local. Think of it as ALUA for NFS.

The following graphic shows how that all takes place.

pNFS

pNFS defines the notion of a device that is generated by the server (that is, an NFS server running on Data ONTAP) and sent to the client. This process helps the client locate the data and send requests directly over the path local to that data. Data ONTAP generates one pNFS device per flexible volume. The metadata path does not change, so metadata requests might still be remote. In a Data ONTAP pNFS implementation, every data LIF is considered an NFS server, so pNFS only works if each node owns at least one data LIF per NFS SVM. Doing otherwise negates the benefits of pNFS, which is data locality regardless of which IP address a client connects to.

The pNFS device contains information about the following:

  • Volume constituents
  • Network location of the constituents

The device information is cached to the local node for improved performance.

To see pNFS devices in the cluster, use the following command in advanced privilege:

cluster::> set diag
cluster::*> vserver nfs pnfs devices cache show

pNFS Components

There are three main components of pNFS:

  • Metadata server
    • Handles all nondata traffic such as GETATTR, SETATTR, and so on
    • Responsible for maintaining metadata that informs the clients of the file locations
    • Located on the NetApp NFS server and established via the mount point
  • Data server
    • Stores file data and responds to READ and WRITE requests
    • Located on the NetApp NFS server
    • Inode information also resides here
  • Clients

pNFS is covered in further detail in NetApp TRs 4067 (NFS) and 4571 (FlexGroup volumes)

How Can I Tell pNFS is Being Used?

To check if pNFS is in use, you can run statistics counters to check for “pnfs_layout_conversions” counters. If the number of pnfs_layout_conversions are incrementing, then pNFS is in use. Keep in mind that if you try to use pNFS with a single network interface, the data layout conversations won’t take place and pNFS won’t be used, even if it’s enabled. 

cluster::*> statistics start -object nfsv4_1_diag
cluster::*> statistics show -object nfsv4_1_diag -counter pnfs_layout_conversions


Object: nfsv4_1_diag
Instance: nfs4_1_diag
Start-time: 4/9/2020 16:29:50
End-time: 4/9/2020 16:31:03
Elapsed-time: 73s
Scope: node1

    Counter                                                     Value
   -------------------------------- --------------------------------
   pnfs_layout_conversions                                      4053

Gotta keep ’em separated!

One thing that is beneficial about the design of pNFS is that the metadata paths are separated from the read/write paths. Once a mount is established, the metadata path is set on the IP address used for mount and does not move without manual intervention. In Data ONTAP, that path could live anywhere in the cluster. (Up to 24 physical nodes with multiple ports on each node!)

That buys you resiliency, as well as flexibility to control where the metadata will be served.

The data path, however, will only be established on reads and writes. That path is determined in conversations between the client and server and is dynamic. Any time the physical location of a volume changes, the data path changes automatically, without need to intervene by the clients or the storage administrator. So, unlike NFSv3 or even NFSv4, you no longer would need to break the TCP connection to move the path for reads and writes to be local (via unmount or LIF migrations). And with NFSv4.x, the statefulness of the connection can be preserved.

That means more time for everyone. Data can be migrated in real time, non-disruptively, based on the storage needs of the client.

For example, I have a volume that lives on node cluster01 of my cDOT cluster:

cluster::> vol show -vserver SVM -volume unix -fields node
 (volume show)

vserver volume node
------- ------ --------------
SVM     unix   cluster01

I have data LIFs on each node in my cluster:

 cluster::> net int show -vserver SVM
(network interface show)

Logical     Status     Network                       Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
SVM
             data1      up/up     10.63.57.237/18    cluster01     e0c     true
             data2      up/up     10.63.3.68/18      cluster02     e0c     true
2 entries were displayed.

In the above list:

  • 10.63.3.68 will be my metadata path, since that’s where I mounted.
  • 10.63.57.237 will be my data path, as it is local to the physical node cluster02.

When I mount, the TCP connection is established to the node where the data LIF lives:

nfs-client# mount -o minorversion=1 10.63.3.68:/unix /unix

cluster::> network connections active show -remote-ip 10.228.225.140
Vserver     Interface              Remote
Name        Name:Local             Port Host:Port              Protocol/Service
---------- ---------------------- ---------------------------- ----------------
Node: cluster02
SVM         data2:2049             nfs-client.domain.netapp.com:912
                                                                TCP/nfs

My metadata path is established to cluster02, but my data volume lives on cluster01.

On a basic cd and ls into the mount, all the traffic is seen on the metadata path. (stuff like GETATTR, ACCESS, etc):

83     6.643253      10.228.225.140       10.63.3.68    NFS    270    V4 Call (Reply In 85) GETATTR
85     6.648161      10.63.3.68    10.228.225.140       NFS    354    V4 Reply (Call In 83) GETATTR
87     6.652024      10.228.225.140       10.63.3.68    NFS    278    V4 Call (Reply In 88) ACCESS 
88     6.654977      10.63.3.68    10.228.225.140       NFS    370    V4 Reply (Call In 87) ACCESS

When I start I/O to that volume, the path gets updated to the local path by way of new pNFS calls (specified in RFC-5663):

28     2.096043      10.228.225.140       10.63.3.68    NFS    314    V4 Call (Reply In 29) LAYOUTGET
29     2.096363      10.63.3.68    10.228.225.140       NFS    306    V4 Reply (Call In 28) LAYOUTGET
30     2.096449      10.228.225.140       10.63.3.68    NFS    246    V4 Call (Reply In 31) GETDEVINFO
31     2.096676      10.63.3.68    10.228.225.140       NFS    214    V4 Reply (Call In 30) GETDEVINFO
  1. In LAYOUTGET, the client asks the server “where does this filehandle live?”
  2. The server responds with the device ID and physical location of the filehandle.
  3. Then, the client asks “what devices to access that physical data are avaiabe to me?” via GETDEVINFO.
  4. The server responds with the list of available devices/IP addresses.

justin_getdevinfo

Once that communication takes place (and note that the conversation occurs in sub-millisecond times), the client then establishes the new TCP connection for reads and writes:

32     2.098771      10.228.225.140       10.63.57.237  TCP    74     917 > nfs [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=937300318 TSecr=0 WS=128
33     2.098996      10.63.57.237  10.228.225.140       TCP    78     nfs > 917 [SYN, ACK] Seq=0 Ack=1 Win=33580 Len=0 MSS=1460 SACK_PERM=1 WS=128 TSval=2452178641 TSecr=937300318
34     2.099042      10.228.225.140       10.63.57.237  TCP    66     917 > nfs [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=937300318 TSecr=2452178641

And we can see the connection established on the cluster to both the metadata and data locations:

cluster::> network connections active show -remote-ip 10.228.225.140
Vserver     Interface              Remote
Name        Name:Local             Port Host:Port              Protocol/Service
---------- ---------------------- ---------------------------- ----------------
Node: cluster01
SVM         data2:2049             nfs-client.domain.netapp.com:912
                                                               TCP/nfs
Node: cluster02 
SVM         data2:2049             nfs-client.domain.netapp.com:912 
                                                               TCP/nfs

Then we start our data transfer on the new path (data path 10.63.57.237):

38     2.099798      10.228.225.140       10.63.57.237  NFS    250    V4 Call (Reply In 39) EXCHANGE_ID
39     2.100137      10.63.57.237  10.228.225.140       NFS    278    V4 Reply (Call In 38) EXCHANGE_ID
40     2.100194      10.228.225.140       10.63.57.237  NFS    298    V4 Call (Reply In 42) CREATE_SESSION
42     2.100537      10.63.57.237  10.228.225.140       NFS    194    V4 Reply (Call In 40) CREATE_SESSION

157    2.106388      10.228.225.140       10.63.57.237  NFS    15994  V4 Call (Reply In 178) WRITE StateID: 0x0d20 Offset: 196608 Len: 65536
163    2.106421      10.63.57.237  10.228.225.140       NFS    182    V4 Reply (Call In 127) WRITE

If I do a chmod later, the metadata path is used (10.63.3.68):

341    27.268975     10.228.225.140       10.63.3.68    NFS    310    V4 Call (Reply In 342) SETATTR FH: 0x098eaec9
342    27.273087     10.63.3.68    10.228.225.140       NFS    374    V4 Reply (Call In 341) SETATTR | ACCESS

How do I make sure metadata connections don’t pile up?

When you have many clients mounting to an NFS server, you generally want to try to control which nodes those clients are mounting to. In the cloud, this becomes trickier to do, as clients and storage system management may be handled by the cloud providers. So, we’d want to have a noninteractive way to do this.

With ONTAP, you have two options to load balance TCP connections for metadata. You can use the tried and true DNS round-robin method, but the NFS server doesn’t have any idea what IP addresses have been issued by the DNS server, so as a result, there are no guarantees the connections won’t pile up.

Another way to deal with connections is to leverage the ONTAP feature for on-box DNS load balancing. This feature allows storage administrators to set up a DNS forwarding zone on a DNS server (BIND, Active Directory or otherwise) to forward requests to the clustered Data ONTAP data LIFs, which can act as DNS servers complete with SOA records! The cluster will determine which IP address to issue to a client based on the following factors:

  • CPU load
  • overall node throughput

This helps ensure that any TCP connection that is established is done so in a logical manner based on performance of the phyical hardware.

I cover both types of DNS load balancing in TR-4523: DNS Load Balancing in ONTAP.

What about that data agility?

What’s great about pNFS is that it is a perfect fit for storage operating systems like ONTAP. NetApp and RedHat worked together closely on the protocol enhancement, and it shows in its overall implementation.

In ONTAP, there is the concept of non-disruptive volume moves. This feature gives storage administrators agility and flexibility in their clusters, as well as enabling service and cloud providers a way to charge based on tiers (pay as you grow!).

For example, if I am a cloud provider, I could have a 24-node cluster as a backend. Some HA pairs could be All-Flash FAS (AFF) nodes for high-performance/low latency workloads. Some HA pairs could be SATA or SAS drives for low performance/high capacity/archive storage. If I am providing storage to a customer that wants to implement high performance computing applications, I could sell them the performance tier. If those applications are only going to run during the summer months, we can use the performance tier, and after the jobs are complete, we can move them back to SATA/SAS drives for storage and even SnapMirror or SnapVault them off to a DR site for safekeeping. Once the job cycle comes back around, I can nondisruptively move the volumes back to flash. That saves the customer money, as they only pay for the performance they’re using, and that saves the cloud provider money since they can free up valuable flash real estate for other customers that need performance tier storage.

What happens when a volume moves in pNFS?

When a volume move occurs, the client is notified of the change via the pNFS calls I mentioned earlier. When the file attempts to OPEN for writing, the server responds, “that file is somewhere else now.”

220    24.971992     10.228.225.140       10.63.3.68    NFS    386    V4 Call (Reply In 221) OPEN DH: 0x76306a29/testfile3
221    24.981737     10.63.3.68    10.228.225.140       NFS    482    V4 Reply (Call In 220) OPEN StateID: 0x1077

The client says, “cool, where is it now?”

222    24.992860     10.228.225.140       10.63.3.68    NFS    314    V4 Call (Reply In 223) LAYOUTGET
223    25.005083     10.63.3.68    10.228.225.140       NFS    306    V4 Reply (Call In 222) LAYOUTGET
224    25.005268     10.228.225.140       10.63.3.68    NFS    246    V4 Call (Reply In 225) GETDEVINFO
225    25.005550     10.63.3.68    10.228.225.140       NFS    214    V4 Reply (Call In 224) GETDEVINFO

Then the client uses the new path to start writing, with no interaction needed.

251    25.007448     10.228.225.140       10.63.57.237  NFS    7306   V4 Call WRITE StateID: 0x15da Offset: 0 Len: 65536
275    25.007987     10.228.225.140       10.63.57.237  NFS    7306   V4 Call WRITE StateID: 0x15da Offset: 65536 Len: 65536

Automatic Data Tiering

If you have an on-premises storage system and want to save storage infrastructure costs by automatically tiering cold data to the cloud or to an on-premises object storage system, you could leverage NetApp FabricPool, which allows you to set tiering policies to chunk off cold blocks of data to more cost effective storage and then retrieve those blocks whenever they are requested by the end user. Again, we’re taking the guesswork and labor out of data management, which is becoming critical in a world driven towards managed services.

For more information on FabricPool:

TR-4598: FabricPool Best Practices

Tech ONTAP Podcast Episode 268 – NetApp FabricPool and S3 in ONTAP 9.8

What about FlexGroup volumes?

As of ONTAP 9.7, NFSv4.1 and pNFS is supported with FlexGroup volumes, which is an intriguing solution.

Part of the challenge of a FlexGroup volume is that you’re guaranteed to have remote I/O across a cluster network when you span multiple nodes. But since pNFS automatically redirects traffic to local paths, you can greatly reduce the amount of intracluster traffic.

A FlexGroup volume operates as a single entity, but is constructed of multiple FlexVol member volumes. Each member volume contains unique files that are not striped across volumes. When NFS operations connect to FlexGroup volumes, ONTAP handles the redirection of operations over a cluster network.

With pNFS, these remote operations are reduced, because the data layout mappings track the member volume locations and local network interfaces; they also redirect reads/writes to the local member volume inside a FlexGroup volume, even though the client only sees a single namespace. This approach enables a scale-out NFS solution that is more seamless and easier to manage, and it also reduces cluster network traffic and balances data network traffic more evenly across nodes.

FlexGroup pNFS differs a bit from FlexVol pNFS. Even though FlexGroup load-balances between metadata servers for file opens, pNFS uses a different algorithm. pNFS tries to direct traffic to the node on which the target file is located. If multiple data interfaces per node are given, connections can be made to each of the LIFs, but only one of the LIFs of the set is used to direct traffic to volumes per network interface.

What workloads should I use with pNFS?

pNFS is leveraging NFSv4.1 and later as its protocol, which means you get all the benefits of NFSv4.1 (security, Kerberos and lock integration, lease-based locks, delegations, ACLs, etc.). But you also get the potential negatives of NFSv4.x, such as higher overhead for operations due to the compound calls, state ID handling, locking, etc. and disruptions during storage failovers that you wouldn’t see with NFSv3 due to the stateful nature of NFSv4.x.

Performance can be severely impacted with some workloads, such as high file count workloads/high metadata workloads (think EDA, software development, etc). Why? Well, recall that pNFS is parallel for reads and writes – but the metadata operations still use a single interface for communication. So if your NFS workload is 80% GETATTR, then 80% of your workload won’t benefit from the localization and load balancing that pNFS provides. Instead, you’ll be using NFSv4.1 as if pNFS were disabled.

Plus, with millions of files, even if you’re doing heavy reads and writes, that means you’re redirecting paths constantly with pNFS (creating millions of DEVICEINFO and LAYOUTGET calls), which may prove more inefficient than simply using NFSv4.1 without pNFS.

pNFS also would need to be supported by the clients you’re using, so if you want to use it for something like VMware datastores, you’re out of luck (for now). VMware currently supports NFSv4.1, but not pNFS (they went with session trunking, which ONTAP does not currently support).

File-based pNFS works best with workloads that do a lot of sequential IO, such as databases, Hadoop/Apache Spark, AI training workloads, or other large file workloads, where reads and writes dominate the IO.

What about the performance?

In TR-4067, I did some basic performance testing on NFSv3 vs. NFSv4.1 for those types of workloads and the results were that pNFS stacked up nicely with NFSv3.

These tests were done using dd in parallel to simulate a sequential I/O workload. This isn’t intended to show the upper limits of the system (I used an AFF 8040 and some VM clients with low RAM and 1GB networks), but instead were intended to show an apples to apples comparison of NFSv3 and NFS4.1 with and without pNFS, using different wsize/rsize values. Be sure to do your own tests before implementing in production.

Note that our completion time for this workload using pNFS was a full 5 minutes faster than NFSv3 using a 1MB wsize/rsize value.

Test (wsize/rsize setting)Completion Time
NFSv3 (1MB)15m23s
NFSv3 (256K)14m17s
NFSv3 (64K)14m48s
NFSv4.1 (1MB)15m6s
NFSv4.1 (256K)12m10s
NFSv4.1 (64K)15m8s
NFSv4.1 (1MB; pNFS)10m54s
NFSv4.1 (256K; pNFS)12m16s
NFSv4.1 (64K; pNFS)13m57s
NFSv4.1 (1MB; delegations)13m6s
NFSv4.1 (256K; delegations)15m25s
NFSv4.1 (64K; delegations)13m48s
NFSv4.1 (1MB; pNFS + delegations)11m7s
NFSv4.1 (256K; pNFS + delegations)13m26s
NFSv4.1 (64K; pNFS + delegations)10m6s

The IOPS were lower overall for NFSv4.1 than NFSv3; that’s because NFSv4.1 combines operations into single packets. Thus, NFSv4.1 will be less chatty over the network than NFSv3. On the downside, the payloads are larger, so the NFS server has more processing to do for each packet, which can impact CPU, and with more IOPS, you can see a drop in performance due to that overhead.

Where NFSv4.1 beat out NFSv3 was with the latency and throughput – since we can guarantee data locality, we get benefits of fastpathing the reads/writes to the files, rather than the extra processing needed to traverse the cluster network.

Test
(wsize/rsize setting)
Average Read Latency (ms)Average Read Throughput (MB/s)Average Write Latency (ms)Average Write Throughput (MB/s)Average Ops
NFSv3 (1MB)665427.91160530
NFSv3 (256K)1.47662.911092108
NFSv3 (64K).26952.211108791
NFSv4.1 (1MB)6.562736.81400582
NFSv4.1 (256K)1.47123.211602352
NFSv4.1 (64K).16061.213107809
NFSv4.1 (1MB; pNFS)3.684026.81370818
NFSv4.1 (256K; pNFS)1.18075.215602410
NFSv4.1 (64K; pNFS).18351.914908526
NFSv4.1
(1MB; delegations)
5.168432.91290601
NFSv4.1
(256K; delegations)
1.36483.311401995
NFSv4.1
(64K; delegations)
.16631.310007822
NFSv4.1
(1MB; pNFS + delegations)
3.894122.41110696
NFSv4.1
(256K; pNFS + delegations)
1.17953.311402280
NFSv4.1
(64K; pNFS + delegations)
.18151117011130

For high file count workloads, NFSv3 did much better. This test created 800,000 small files (512K) in parallel. For this high metadata workload, NFSv3 completed 2x as fast as NFSv4.1. pNFS added some time savings versus NFSv4.1 without pNFS, but overall, we can see where we may run into problems with this type of workload. Future releases of ONTAP will get better with this type of workload using NFSv4.1 (these tests were on 9.7).

Test (wsize/rsize setting)Completion TimeCPU %Average throughput (MB/s)Average total IOPS
NFSv3 (1MB)17m29s32%3517696
NFSv3 (256K)16m34s34.5%3728906
NFSv3 (64K)16m11s39%39413566
NFSv4.1 (1MB)38m20s26%1677746
NFSv4.1 (256K)38m15s27.5%1677957
NFSv4.1 (64K)38m13s31%17210221
NFSv4.1 pNFS (1MB)35m44s27%1718330
NFSv4.1 pNFS (256K)35m9s28.5%1758894
NFSv4.1 pNFS (64K)36m41s33%17110751

Enter nconnect

One of the keys to pNFS performance is parallelization of operations across volumes, nodes, etc. But it doesn’t necessarily parallelize network connections across these workloads. That’s where the new NFS mount option nconnect comes in.

The purpose of nconnect is to provide multiple transport connections per TCP connection or mount point on a client. This helps increase parallelism and performance for NFS mounts – particularly for single client workloads. Details about nconnect and how it can increase performance for NFS in Cloud Volumes ONTAP can be found in the blog post The Real Baseline Performance Story: NetApp Cloud Volumes Service for AWS. ONTAP 9.8 offers official support for the use of nconnect with NFS mounts, provided the NFS client also supports it. If you would like to use nconnect, check to see if your client version provides it and use ONTAP 9.8 or later. ONTAP 9.8 and later supports nconnect by default with no option needed.

Client support for nconnect varies, but the latest RHEL 8.3 release supports it, as do the latest Ubuntu and SLES releases. Be sure to verify if your OS vendor supports it.

Our Customer Proof of Concept lab (CPOC) did some benchmarking of nconnect with NFSv3 and pNFS using a sequential I/O workload on ONTAP 9.8 and saw some really promising results.

  • Single NVIDIA DGX-2 client
  • Ubuntu 20.04.2
  • NFSv4.1 with pNFS and nconnect
  • AFF A400 cluster
  • NetApp FlexGroup volume
  • 256K wsize/rsize
  • 100GbE connections
  • 32 x 1GB files

In these tests, the following throughput results were seen. Latency for both were sub 1ms.

TestBandwidth
NFSv310.2 GB/s
NFSv4.1/pNFS21.9 GB/s
Both NFSv3 and NFSv4.1 used nconnect=16

In these tests, NFSv4.1 with pNFS doubled the performance for the sequential read workload at 250us latency. Since the files were 1GB in size, the reads were almost entirely from the controller RAM, but it’s not unreasonable to see that as the reality for a majority of workloads, as most systems have enough RAM to see similar results.

David Arnette and I discuss it a bit in this podcast:

Episode 283 – NetApp ONTAP AI Reference Architectures

Note: Benchmark tests such as SAS iotest will purposely recommend setting file sizes larger than the system RAM to avoid any caching benefits and instead will measure the network bandwidth of the transfer. In real world application scenarios, RAM, network, storage and CPU are all working together to create the best possible performance scenarios.

pNFS Best Practices with ONTAP

pNFS best practices in ONTAP don’t differ much from normal NAS best practices, but here are a few to keep in mind. In general:

  • Use the latest supported client OS version.
  • Use the latest supported ONTAP patch release.
  • Create a data LIF per node, per SVM to ensure data locality for all nodes.
  • Avoid using LIF migration on the metadata server data LIF, because NFSv4.1 is a stateful protocol and LIF migrations can cause brief outages as the NFS states are reestablished.
  • In environments with multiple NFSv4.1 clients mounting, balance the metadata server connections across multiple nodes to avoid piling up metadata operations on a single node or network interface.
  • If possible, avoid using multiple data LIFs on the same node in an SVM.
  • In general, avoid mounting NFSv3 and NFSv4.x on the same datasets. If you can’t avoid this, check with the application vendor to ensure that locking can be managed properly.
  • If you’re using NFS referrals with pNFS, keep in mind that referrals establish a local metadata server, but data I/O still redirect. With FlexGroup volumes, the member volumes might live on multiple nodes, so NFS referrals aren’t of much use. Instead, use DNS load balancing to spread out connections.

Drop any questions into the comments below!

Behind the Scenes: Episode 248 – NABox and NetApp ONTAP Performance Monitoring with NetApp Harvest

Welcome to the Episode 248, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

2019-insight-design2-warhol-gophers

Recently, I tried to set up NetApp Harvest from scratch, using Grafana and Graphite. By the end of the day, with no working instance, I realized that it just wasn’t going to happen. Then, I remembered NABox and within minutes, I was able to start monitoring NetApp ONTAP performance.

This week, we invite NetApp Technical Account Manager Yann Bizeul (@ybontap), NetApp Performance TME Dan Isaacs (@danisaacs) and Client Solutions Architect at AHEAD, Dan Burkland (@dburkland) to discuss NABox – a self-contained, pre-configured VM instance of NetApp Harvest that lets you start monitoring NetApp ONTAP performance in a matter of minutes!

NABox can be found at: https://nabox.org/

Dan Burkland’s blog posts:

My blog post on converting the NABox OVA for use with Hyper-V:

Podcast Transcriptions

We also are piloting a new transcription service, so if you want a written copy of the episode, check it out here (just set expectations accordingly):

Episode 248: NABox and NetApp ONTAP Performance Monitoring with NetApp Harvest – Transcript

Just use the search field to look for words you want to read more about. (For example, search for “storage”)

transcript.png

Or, click the “view transcript” button:

gong-transcript

Be sure to give us feedback on the transcription in the comments here or via podcast@netapp.com! If you have requests for other previous episode transcriptions, let me know!

Finding the Podcast

You can find this week’s episode here:

Tech ONTAP Podcast · Episode 248: NABox and NetApp ONTAP Performance Monitoring with Harvest

Find the Tech ONTAP Podcast on:

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

Using NABox for NetApp Performance Monitoring on Microsoft Hyper-V

If you’ve ever tried to install and configure Grafana, you’ll find that it’s not the easiest thing to use. Simply installing it and getting it to work right can be challenging, and when you factor in adding NetApp monitoring in using Harvest, it gets a bit more complicated.

There are some fairly good step-by-step configuration guides out there, such as this one on the NetApp communities, as well as this blog by Dan Burkland (@dburkland) that uses Docker to containerize NetApp Harvest.

There’s also a regularly updated .ova file called “NABox” that uses an “all-in-one” approach to deploying a monitoring VM. This was created/is managed by current NetApp employee Yann Bizeul (@ybontap).

This was the approach I used recently.

NABox uses .ova files, as mentioned, which are proprietary to VMware deployments. The .ova file is essentially a compressed file with:

  • .mf file
  • .ovf file
  • two .vmdk disks

The .ovf is an XML file that contains the configuration of the VM – stuff like processors, RAM, network, etc.

The .mf file is basically a checksum of the files you get in the .ova.

The VMDKs are the disks you attach to the VM.

Deploy NABox in Hyper-V

Deploying the .ova is easy when you use VMware and covered on the NABox documentation page. But currently, there are no steps to use the image with Hyper-V. That’s where this blog comes in. (you likely can port these steps over to other virtualization technologies)

This section basically replaces “Deploying the OVA” on the NABox site.

  1. Download the latest NABox .ova, the NetApp SDK and NetApp Harvest
  2. Use your favorite zip tool to extract the files from the .ova (I use 7zip)
  3. Convert the .vmdk files to Hyper-V compatible .vhd using VirtualBox(as described in this blog). These are the commands I used/the results.
    PS C:\Program Files\Oracle\VirtualBox> .\VBoxManage.exe clonemedium --format vhd C:\Users\parisi\Downloads\NAbox-2.6.1\NAbox-2.6.1-disk1.vmdk C:\Users\parisi\Downloads\NAbox-2.6.1\NAbox-2.6.1-disk1.vhd
    0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
    Clone medium created in format 'vhd'. 
    UUID: b6608a15-c334-49a7-9416-80ce516efea4PS 
    
    C:\Program Files\Oracle\VirtualBox> .\VBoxManage.exe clonemedium --format vhd C:\Users\parisi\Downloads\NAbox-2.6.1\NAbox-2.6.1-disk2.vmdk C:\Users\parisi\Downloads\NAbox-2.6.1\NAbox-2.6.1-disk2.vhd
    0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
    Clone medium created in format 'vhd'. UUID: def48aad-4e6c-4475-80d5-f145c5153e97
  4. Open the .ovf file in your XML editor of choice. I used Notepad++.
  5. Find the RAM requirements and processor count.
    <Item>
       <rasd:AllocationUnits>hertz * 10^6</rasd:AllocationUnits>
       <rasd:Description>Number of Virtual CPUs</rasd:Description>
       <rasd:ElementName>2 virtual CPU(s)</rasd:ElementName>
       <rasd:InstanceID>1</rasd:InstanceID>
       <rasd:ResourceType>3</rasd:ResourceType>
       <rasd:VirtualQuantity>2</rasd:VirtualQuantity>
    </Item>
    <Item>
       <rasd:AllocationUnits>byte * 2^20</rasd:AllocationUnits>
       <rasd:Description>Memory Size</rasd:Description>
       <rasd:ElementName>2048MB of memory</rasd:ElementName>
       <rasd:InstanceID>2</rasd:InstanceID>
       <rasd:ResourceType>4</rasd:ResourceType>
       <rasd:VirtualQuantity>2048</rasd:VirtualQuantity>
    </Item>
  6. Create a new Hyper-V VM with the same number of processors and same amount of RAM as specified in the .ovf
  7. Attach the disk 1 and disk2 .vhd files to the Hyper-V VM.
  8. Power on the VM and configure as described in NABox Basic Configuration.
  9. Finish the rest of the steps listed on the NABox page.

That’s it! Now you, too, can have a running instance of NetApp Harvest running on Microsoft Hyper-V in less than 30 minutes!

naboxhyperv

Behind the Scenes: Episode 157 – Performance Analysis Using OnCommand Unified Manager

Welcome to the Episode 157, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

tot-gopher

This week on the podcast, we welcome Mr. Performance himself, Tony Gaddis (gaddis@netapp.com) to give us a tutorial on easily finding performance issues using OnCommand Unified Manager, as well as some common “rules of thumb” when it comes to how much latency and node utilization is too much.

Also, check out Tony’s NetApp Insight 2018 session in Las Vegas and Barcelona:

1181-1 – ONTAP Storage Performance Design Considerations for Emerging Technologies

Podcast listener Mick Landry was kind enough to document the “rules of thumb” that I forgot to add to the blog in the comments. Here they are:

  1. Performance utilization on a node > 85% points to latency issue on the node (broad latency for volumes on the node)
  2. Performance capacity used on a node > 100% points one or more volumes on the node that have latency due to CPU resources running out.
    • This is not an indicator of CPU headroom.
    • 100% is “optimal” – below is wiggle room.
  3. Spinning disk
    • Aggregate performance utilization – not capacity.
    • > 50% relates to disk latency impact will increase.
    • When queueing starts will double or triple latency on slow platters.
    • Performance utilization of the disk drive.
  4. Fragmented free space on spinning disk
    • Increases CP processing time
    • 85% utilization of capacity of aggregate, this will become a problem.
    • > 90% will impact heavy workloads
  5. Node utilization from an HA point of view
    • Keep the sum on the node utilizations less than 100% and will be okay.
    • For “user hours”, on “revenue generating systems”
  6. Disk
    • Spinning disk utilization < 50%
  7. Aggregate latency expectations
    • SATA latency < 12ms
    • SAS latency < 8ms
    • SSD latency < 2ms

Finding the Podcast

You can find this week’s episode here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

Behind the Scenes: Episode 149 – Cloud Volume Services Performance with Oracle Databases

Welcome to the Episode 149, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

tot-gopher

This week on the podcast, TME Chad Morgenstern (@sockpupets) joins us to discuss how performance looks in Cloud Volume Services for Oracle Database workloads.

Interested in Cloud Volume Services? You can investigate on your own here:

https://cloud.netapp.com/cloud-volumes

You can also check out Eiki Hrafnsson’s Cloud Field Day presentation on Cloud Volume Services here:

http://techfieldday.com/appearance/netapp-presents-at-cloud-field-day-3/

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

This week’s episode is here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

Behind the Scenes: Episode 147 – SPC-1v3 Results – NetApp AFF A800

Welcome to the Episode 147, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

tot-gopher

This week on the podcast, we find out how the new NetApp A800 system fared in the rigorous SPC-1 v3 storage benchmarks. Can the NVMe attached SSDs truly help reduce latency while maintaining high number of IOPs? Performance TME Dan Isaacs (@danisaacs) and the workload engineering team of Scott Lane, Jim Laing and Joe Scott join us to discuss! 

Check out the published results here: 

http://spcresults.org/benchmarks/results/spc1-spc1e#A32007

And the official NetApp blog:

https://blog.netapp.com/nvme-benchmark-spc-1-testing-validates-breakthrough-performance-aff/

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

This week’s episode is here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

ONTAP 9.4RC1 is now available!

Hear ye! Hear ye! All ye storage admins! ONTAP 9.4RC1 is announced today!

sully-hearye

That’s right! Every 6 months, without fail, a new ONTAP version with a payload of new features is released.

You can find ONTAP 9.4RC1 here:

http://mysupport.netapp.com/NOW/download/software/ontap/9.4RC1

For info on what a release candidate is, see:

http://mysupport.netapp.com/NOW/products/ontap_releasemodel/

Also, check out the documentation center:

docs.netapp.com/ontap-9/index.jsp

NetApp published a general overview blog on NVMe with Joel Reich here:

https://blog.netapp.com/the-future-is-here-ai-ready-cloud-connected-all-flash-storage-with-nvme/

Stay tuned for a more general ONTAP 9.4 overview blog on the official site. Also, I recorded a brief 5-minute teaser/trailer for ONTAP 9.4 features and podcasts coming soon. Find that here:

Also a new lightboard video! Watch me write… BACKWARDS???

This blog is intended to go a little deeper into the main features available in ONTAP 9.4. We’ll break them down as follows:

  • Cloud
  • Performance
  • Efficiency
  • Security
  • General ONTAP Goodness

Without further ado…

Cloud!

FabricPools were introduced in ONTAP 9.2 as a way to tier blocks from your performance tier solution to a capacity tier, such as cloud or StorageGrid.

We covered FabricPools in detail in episode 92 of the Tech ONTAP Podcast, which you can find here:

In ONTAP 9.4, the first major updates to the feature have been released! FabricPools in ONTAP 9.4 bring the following…

Tiering cold data from the active file system

Prior to ONTAP 9.4, FabricPools only tiered cold data from snapshots on primary systems and data protection volumes on secondary systems. This allowed ONTAP to free up valuable real estate on flash systems for data actively being used. In ONTAP 9.4, inactive blocks can now be tiered off to cloud or StorageGrid from the active file system. ONTAP does this automatically by way of a new “auto” tiering policy, which has a configurable cooling period of 2-63 days (-tiering-minimum-cooling-days option in CLI). This cooling period determines how long ONTAP will wait before tiering off data considered “cool” by the policy to the FabricPool tiering destination. The tiering destination choices used to be only Amazon S3 and StorageGrid, but ONTAP 9.4 brings us…

Tiering to Azure Blob Storage

Support for Azure Blob storage was added to ONTAP 9.4 for FabricPools, which gives storage administrators more options for cloud providers. In addition, other cloud providers (such as Google Cloud, IBM Cloud Object Storage, etc) can be added via product variance requests (PVR) to your NetApp Sales reps. Keep in mind that only one cloud provider per FabricPool aggregate can be used.

fabricpools-afs

But how do you know if FabricPools will be of any value to you?

Inactive Data Reporting

Inactive Data Reporting is new in ONTAP 9.4 and can offer insight from OnCommand System Manager into whether there’s enough inactive data in your system for FabricPools to make a difference.

fabricpools-inactive-report.png

By default, this feature is enabled for aggregates participating in FabricPools, but you can also enable it via the CLI for non-FabricPool aggregates to predict space savings with the following command:

storage aggregate modify -aggregate <name> -is-inactive-data-reporting-enabled true

You can also test the performance of your FabricPool target with…

Object Store Profiler

Also new in ONTAP 9.4, the Object Store Profiler provides a way to evaluate the performance (via throughput and latency) to your desired FabricPool target. From the CLI, start the profiler using:

storage aggregate object-store profiler start -object-store-name <name> -node <name>

Then show the results with:

storage aggregate object-store profiler show

This gives a general idea of how FabricPools will work for you before you implement them.

object-profiler

But that’s not the only object store enhancements. FabricPools in ONTAP 9.4 also offers….

Better efficiency for object storage

Prior to ONTAP 9.4, there was really no concept of freeing up space on the object store once the data blocks that had been tiered off were deleted on the source. ONTAP would see the free space, but the capacity tier would not. ONTAP 9.4 offers object defragmentation for the FabricPool destination to free up deleted blocks on the destination. This is done without any admin interaction at a specific % of free space by default for different providers. The default settings are:

  • 15%Microsoft Azure Blob Storage
  • 20% Amazon S3
  • 40% StorageGRID Webscale

These percentages are adjustable via the CLI with the following command in advanced privilege:

storage aggregate object-store modify –aggregate <name> -object-store-name <name> –unreclaimed-space-threshold <%> (0%-99%)

ONTAP 9.4 also brings support for the data compaction functionality to FabricPool aggregates to provide even more storage efficiency. For more information on data compaction, see TR-4476.

What’s great about ONTAP 9.4 is that FabricPool can now be used on any ONTAP deployment (other than MCC) with…

Support for ONTAP Select and ONTAP Cloud

FabricPools can now tier from a cloud instance to a cloud tier. This is especially useful now that we have NetApp Cloud Volumes, which run on a performance tier.

Additionally, you can use FabricPools on all versions of ONTAP Select, whether standard or Premium. This means you can tier from ONTAP Select, even if it has spinning media running under the covers. This support for spinning media does not extend into FAS systems, however – just ONTAP Select. The concern there is performance; FabricPools won’t perform well on FAS systems with spinning media.

So that’s all for the FabricPool section. Now let’s talk…

Performance!

ONTAP 9.4’s biggest news is the introduction of support for NVMe over fibre channel, as well as the NVMe attached SSDs in the new AFF A800 platform. This gives NetApp the industry’s first end-to-end NVMe platform. If you’re interested in a deep dive into what NVMe is, this podcast covered it:

Early testing numbers on the new platform show sub-200 micro-second latencies, with 1.3 million IOPS per HA pair at sub-500 micro-second latencies and 34GB/s throughput. It’s a pretty beastly system.

NVMe is integral to implementaion of workloads such as machine learning and AI, which powers tech like self-driving cars, IoT devices and other budding tech.

nvme-ai.png

If you’re a NetApp employee or partner, check out the recording of the Solutions Insight Webcast from May 9 that covers NVMe in more detail.

Another performance enhancement in ONTAP 9.4 is SMB multichannel, which provides a way for SMB3 connections to leverage more TCP streams and CPU cores on the ONTAP system to increase throughput. This especially benefits SQL server workloads.

smb-multichannel.png

The new platform and ONTAP 9.4 update doesn’t just add performance, however. It also adds…

More efficiency!

The new AFF A800 platform chassis offers efficiency in the form of both power/cooling and rack space savings with >2.5PB of storage (based on a 4:72 storage efficiency ratio) in a 4U footprint. Later, when the platform supports larger NVMe attached drives, we’ll see even more density. ONTAP 9.4 also brings support for 30TB SAS attached SSDs.

But ONTAP 9.4 also brings some additional efficiencies, such as…

Snapshot block sharing

snapshot-block-share

 

Prior to ONTAP 9.4, deduplication did not take blocks locked in a snapshot under consideration for storage efficiencies. In ONTAP 9.4, if a file is locked in a snapshot *and* it exists in the active file system, deduplication will reduce the blocks needed for the file in the active file system to save even more space. ONTAP 9.4 is also adding support for up to 1,023 snapshots per FlexVol.

Background Aggregate Level Deduplication

background-aggr-dedupe

Deduplication at the aggregate level was added in ONTAP 9.2 and provides storage efficiencies when identical blocks exist across volumes in the same aggregate. This was all done inline. In ONTAP 9.4, you can now deduplicate at the aggregate level on data that’s already been placed.

Automatic Efficiency Enablement on Data Protection Volumes

auto-dedupe-schedule.png

ONTAP 9.4 also automatically enables all storage efficiencies on data protection volumes to help simplify the role of storage administrators and save space on secondary systems.

Decreased Node Root Aggregate Sizes

Every node in an ONTAP cluster has a node root aggregate, which hosts a node root volume. The node root volume holds logs, system critical files and any core files that might get generated in the event of a crash. The core file size is based on the size of system memory. As platforms add memory to systems, these core files get larger, which was causing the core files to increase, which made root volume sizes increase… wait. This is getting confusing. Here’s a diagram:

root-vol-size-equation

Advanced Disk Partitioning (or root-data partitioning) helped save some space by spreading the volume across disk partitions, but we took steps to save even more space. For example, the 1TB root aggregate that would have been needed on the A800 node gets reduced down to just 150GB!

Long story short – ONTAP 9.4 with newer systems moved the ever-increasing core files from disk media to the local flash boot storage. This applies only to newer systems (such as the A800, FAS2700 and beyond) that have large enough boot devices to hold 2 core files and cannot be retroactively applied to older systems.

ONTAP 9.4 is also bringing…

More Security!

One of the areas of ONTAP that I feel has seen some of the most significant enhancements over the past several years  has been security (credit to Juan Mojica for making it happen).

Starting with the onboard key manager, which grew into NetApp Volume Encryption and evolved into off-box key manager support and multi-factor authentication, security has grown leaps and bounds in ONTAP. This is necessary in today’s hyper-focused security minded IT organizations, as hacks, breaches and ransomware attacks are all very fresh in their minds.

ONTAP 9.4 is bringing several more security features that don’t just help guard against external threats, but also help cover internal threats (or user mistakes) from hurting a business’s bottom line.

First of all, admins can upgrade to…

Validated ONTAP Images!

validated-ontap ONTAP is now a validated image, which gives administrators peace of mind that they’re not accidentally installing some hacked version of ONTAP that can compromise their systems. In addition, it prevents engineering builds of ONTAP (which can expose clusters to undiscovered bugs or disruptions) from being used to upgrade on clusters in the field. This helps minimize the risk and exposure of running unverified builds of ONTAP.

But we’re not just protecting against upgrading to unverified installations. ONTAP 9.4 also provides…

Key-based boot technology

secure-boot

Onboard Key Manager can be leveraged to prevent reboots without a passphrase. This protects against nefarious attempts to change the admin password on a system (which can be done with console/service processor access to the boot menu of a node), as well as against physical theft of systems. In addition to the onboard key manager, you can also enable protected boot with a USB key – but you’d need a product variance request (PVR). Check with your NetApp sales rep for details. Next generation platform (yet to be released) will also provide the ability to use UEFI Secure Boot, which works in conjunction with validated ONTAP images to not only prevent upgrades to unverified ONTAP images, but from running them at all.

These provide security against external and internal threats alike, but what do you do when someone accidentally writes a classified document to a public, unclassified share

Securely purge it!

secure-purge

ONTAP 9.4 provides the ability to cryptographically shred individual files from the drive while the system remains online, and the rest of the files remain intact. This can be helpful for data spillage – e.g. when a classified document ends up in an unclassified location. This is also particularly timely and useful for the upcoming GDPR regulations’ “Right to Erasure” rules.

Security is playing a big part in the new release of ONTAP. In addition, here’s some more…

General ONTAP goodness

ONTAP 9.4 also brings several other valuable features, such as:

  • Rapid disk zeroing technology – initialize disks near-instantaneously in newer platforms!
  • 3-step, 1-click ONTAP upgrades – even easier to update your cluster non-disruptively
  • Install ONTAP without needing a separate web or FTP server
  • SQL Server support for Application Data Management in System Manager

So, there you are! A thorough rundown of the new features in ONTAP 9.4. If you feel I missed something, feel free to reach out in the comments with input!

Check out these brief videos for some lightboard action on new ONTAP 9.4 stuff:

Some other information on the launch can be found as follows:

GCP Cloud Volumes for NFS with native access to the GCP tool suite (Google Cloud)
https://blog.netapp.com/sweet-new-storage-service-from-netapp-for-google-cloud-platform/ 

Storage Grid Update 11.1
https://blog.netapp.com/storagegrid-11-1-and-netapp-hci-the-perfect-one-two-punch-for-scaling-your-environment/ 

A800 and the A220
https://blog.netapp.com/the-future-is-here-ai-ready-cloud-connected-all-flash-storage-with-nvme/ 

ONTAP 9.4 with first to market NVMe/FC support
http://www.demartek.com/Demartek_NetApp_Broadcom_NVMe_over_Fibre_Channel_Evaluation_2018-05.html

ONTAP 9.3 is now GA!

ONTAP 9 is on a new cadence model, which brings a new release every 6 months.

Today, ONTAP 9.3GA is available here!

http://mysupport.netapp.com/NOW/download/software/ontap/9.3

ONTAP 9.3 was announced at NetApp Insight 2017 in Las Vegas and was covered at a high level by Jeff Baxter in the following blog:

Announcing NetApp ONTAP 9.3: The Next Step in Modernizing Your Data Management

Jeff has a follow-up infographic here:

https://blog.netapp.com/10-good-reasons-to-upgrade-to-ontap-9-3-infographic/

I also did a brief video summary here:

We also did a podcast with ONTAP Chief Evangelist Jeff Baxter (@baxontap) and ONTAP SVP Octavian Tanase (@octav) here:

For info on what GA means, see:

http://mysupport.netapp.com/NOW/products/ontap_releasemodel/

Also, check out the documentation center:

docs.netapp.com/ontap-9/index.jsp

The general theme around ONTAP 9.3 is modernization of the data center. Here’s a high level list of features, with more detail on some of them later in this blog.

Security enhancements

Simplicity innovations

  • MongoDB support added to application provisioning
  • Simplified data protection flows in System Manager
  • Guided cluster setup and expansion
  • Adaptive QoS

Performance and efficiency improvements

  • Up to 30% performance improvement for specific workloads via WAFL improvements, parallelization and flash optimizations
  • Automatic schedules for deduplication
  • Backgroup inline aggregate deduplication (AFF only; automatic schedule only)

NetApp FlexGroup volume features

This is covered in more detail in What’s New for NetApp FlexGroup Volumes in ONTAP 9.3?

  • Qtrees
  • Antivirus
  • Volume autogrow
  • SnapVault/Unified SnapMirror
  • SMB Change/notify
  • QoS Maximums
  • Improved automated load balancing logic

Data Fabric additions

  • SolidFire to ONTAP SnapMirror
  • MetroCluster over IP

Now, let’s look at a few of the features in a bit more detail. If you have things you want covered more, leave a comment.

Multifactor Authentication (MFA)

Traditionally, to log in to an ONTAP system as an admin, all you needed was a username and password and you’d get root-level access to all storage virtual machines in a cluster. If you’re the benevolent storage admin, that’s great! If you’re a hostile actor, great!* (*unless you’re the benevolent storage admin… then, not so great)

ONTAP 9.3 introduces the ability to configure an external Identity Provider (IdP) server to interact with OnCommand System Manager and Unified Manager to require a key to be passed in addition to a username and password. Initial support for IdP will include Microsoft Active Directory Federation Services and Shibboleth.

MFA

For the command line, the multifactor portion would be passed by way of SSH keys currently. We cover MFA in the following Tech ONTAP podcast:

SnapLock Enhancements

SnapLock is a NetApp ONTAP feature that provides data compliance for businesses that need to preserve data for regulatory reasons, such as HIPAA standards (SnapLock compliance) or for internal requirements, such as needing to preserve records (SnapLock enterprise).

ONTAP 9.3 provides a few enhancements to SnapLock, including one that isn’t available from any storage provider currently.

legal-hold.png

Legal hold is useful in the event that a court has ordered specific documents to be preserved for an ongoing case or investigation. This can be applied to multiple files and remains in effect until you choose to remove it.

event-based

Event-based retention allows storage administrators to set protections on data based on defined events, such as an employee leaving the company (to avoid disgruntled deletions), or for insurance use cases (such as death of a policy holder).

vol-append.png

Volume append mode is the SnapLock feature I alluded to, where no one else can currently accomplish this. Essentially, it’s for media workloads (audio and video) and will write-protect the portion of the files that have already been streamed and allow appending to those files after they’ve been protected. It’s kind of like having a CD-R on  your storage system.

Performance improvements

improve-perf

Every release of ONTAP strives to improve performance in some way. ONTAP 9.3 introduces performance enhancements (mostly for SAN)/block via the following changes:

  • Read latency reductions via WAFL optimizations for All Flash FAS SAN (block) systems
  • Better parallelization for all workloads on mid-range and high-end systems (FAS and AFF) to deliver more throughput/IOPS at lower latencies
  • Parallelization of the iSCSI layer to allow iSCSI to use more cores (best results on 20 core or higher systems)

The following graphs show some examples of that performance improvement versus ONTAP 9.2.

a700-fcp

a700-iscsi

Adaptive Quality of Service (QoS)

Adaptive QoS is a way for storage administrators to allow ONTAP to manage the number of IOPS per TB of volume space without the need to intervene. You simply set a service level class and let ONTAP control the rest.

The graphic below shows how it works.

adaptive-qos

We cover QoS minimums and performance enhancements in the following Tech ONTAP podcast:

MetroCluster over IP

MetroCluster is a way for clusters to operate in a high availability manner over long distances. (hundreds of kilometers) Traditionally, MetroCluster has been done over fiber channel networks due to low latency requirements needed to guarantee writes can be committed to both sites.

However, now that IP networks are getting more robust, ONTAP is able to support MetroCluster over IP, which provides the following benefits:

  • Reduced CapEx and OpEx (no more dedicated fiber channel networks, cards, bridges)
  • Simplicty of management (use existing IP networks)

mcc-ip.png

The ONTAP 9.3 release is going to be a limited release for this feature, with the following caveats:

  • A700, FAS9000 only
  • 100km limit
  • Dedicated ISL with extended VLAN currently required
  • 1 iWARP card per node

We cover MetroCluster over IP in this podcast:

SolidFire to ONTAP SnapMirror

A few years back, the concept of a data fabric (where all of your data can be moved anywhere with the click of a button) was introduced.

That vision continued this year with the inclusion of SnapMirror from SolidFire (and NetApp HCI systems) to ONTAP.

sf-snapmirror.png

ONTAP 9.3 will allow storage administrators to implement a disaster recovery plan for their SolidFire systems.

This includes the following:

  • Baseline and incremental replication using NetApp SnapMirror from SolidFire to ONTAP
  • Failover storage to ONTAP for disaster recovery
  • Failback storage from ONTAP to SolidFire
    • Only for LUNs replicated from SolidFire
    • Replication from ONTAP to SolidFire only for failback

That covers a deeper look at some of the new ONTAP 9.3 features. Feel free to comment if you want to learn more about these features, or any not listed in the overview.

Behind the Scenes: Episode 117 – Storage QoS in ONTAP 9.3

Welcome to the Episode 117, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

tot-gopher

This week on the podcast, we invited the NTAPFLIGuy, Mike Peppers, to talk about QoS and performance in ONTAP 9.3. Listen for a general overview of QoS maximums and minimums, as well as the new Adaptive QoS feature!

Finding the Podcast

The podcast is all finished and up for listening. You can find it on iTunes or SoundCloud or by going to techontappodcast.com.

This week’s episode is here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

ONTAP 9.3RC1 is now available!

ONTAP 9.3 was announced at NetApp Insight 2017 in Las Vegas and was covered at a high level by Jeff Baxter in the following blog:

Announcing NetApp ONTAP 9.3: The Next Step in Modernizing Your Data Management

I also did a brief video summary here:

We also did a podcast with ONTAP Chief Evangelist Jeff Baxter (@baxontap) and ONTAP SVP Octavian Tanase (@octav) here:

ONTAP releases are delivered every 6 months, with the odd numbered releases landing around time for Insight. Now, the first release candidate for 9.3 is available here:

http://mysupport.netapp.com/NOW/download/software/ontap/9.3RC1

For info on what a release candidate is, see:

http://mysupport.netapp.com/NOW/products/ontap_releasemodel/

Also, check out the documentation center:

docs.netapp.com/ontap-9/index.jsp

The general theme around ONTAP 9.3 is modernization of the data center. I cover this at Insight in session 30682-2, which is available as a recording from Las Vegas for those with a login. If you’re going to Insight in Berlin, feel free to add it to your schedule builder. Here’s a high level list of features, with more detail on some of them later in this blog.

Security enhancements

Simplicity innovations

  • MongoDB support added to application provisioning
  • Simplified data protection flows in System Manager
  • Guided cluster setup and expansion
  • Adaptive QoS

Performance and efficiency improvements

  • Up to 30% performance improvement for specific workloads via WAFL improvements, parallelization and flash optimizations
  • Automatic schedules for deduplication
  • Backgroup inline aggregate deduplication (AFF only; automatic schedule only)

NetApp FlexGroup volume features

This is covered in more detail in What’s New for NetApp FlexGroup Volumes in ONTAP 9.3?

  • Qtrees
  • Antivirus
  • Volume autogrow
  • SnapVault/Unified SnapMirror
  • SMB Change/notify
  • QoS Maximums
  • Improved automated load balancing logic

Data Fabric additions

  • SolidFire to ONTAP SnapMirror
  • MetroCluster over IP

Now, let’s look at a few of the features in a bit more detail. If you have things you want covered more, leave a comment.

Multifactor Authentication (MFA)

Traditionally, to log in to an ONTAP system as an admin, all you needed was a username and password and you’d get root-level access to all storage virtual machines in a cluster. If you’re the benevolent storage admin, that’s great! If you’re a hostile actor, great!* (*unless you’re the benevolent storage admin… then, not so great)

ONTAP 9.3 introduces the ability to configure an external Identity Provider (IdP) server to interact with OnCommand System Manager and Unified Manager to require a key to be passed in addition to a username and password. Initial support for IdP will include Microsoft Active Directory Federation Services and Shibboleth.

MFA

For the command line, the multifactor portion would be passed by way of SSH keys currently.

SnapLock Enhancements

SnapLock is a NetApp ONTAP feature that provides data compliance for businesses that need to preserve data for regulatory reasons, such as HIPAA standards (SnapLock compliance) or for internal requirements, such as needing to preserve records (SnapLock enterprise).

ONTAP 9.3 provides a few enhancements to SnapLock, including one that isn’t available from any storage provider currently.

legal-hold.png

Legal hold is useful in the event that a court has ordered specific documents to be preserved for an ongoing case or investigation. This can be applied to multiple files and remains in effect until you choose to remove it.

event-based

Event-based retention allows storage administrators to set protections on data based on defined events, such as an employee leaving the company (to avoid disgruntled deletions), or for insurance use cases (such as death of a policy holder).

vol-append.png

Volume append mode is the SnapLock feature I alluded to, where no one else can currently accomplish this. Essentially, it’s for media workloads (audio and video) and will write-protect the portion of the files that have already been streamed and allow appending to those files after they’ve been protected. It’s kind of like having a CD-R on  your storage system.

Performance improvements

improve-perf

Every release of ONTAP strives to improve performance in some way. ONTAP 9.3 introduces performance enhancements (mostly for SAN)/block via the following changes:

  • Read latency reductions via WAFL optimizations for All Flash FAS SAN (block) systems
  • Better parallelization for all workloads on mid-range and high-end systems (FAS and AFF) to deliver more throughput/IOPS at lower latencies
  • Parallelization of the iSCSI layer to allow iSCSI to use more cores (best results on 20 core or higher systems)

The following graphs show some examples of that performance improvement versus ONTAP 9.2.

a700-fcp

a700-iscsi

Adaptive Quality of Service (QoS)

Adaptive QoS is a way for storage administrators to allow ONTAP to manage the number of IOPS per TB of volume space without the need to intervene. You simply set a service level class and let ONTAP control the rest.

The graphic below shows how it works.

adaptive-qos

MetroCluster over IP

MetroCluster is a way for clusters to operate in a high availability manner over long distances. (hundreds of kilometers) Traditionally, MetroCluster has been done over fiber channel networks due to low latency requirements needed to guarantee writes can be committed to both sites.

However, now that IP networks are getting more robust, ONTAP is able to support MetroCluster over IP, which provides the following benefits:

  • Reduced CapEx and OpEx (no more dedicated fiber channel networks, cards, bridges)
  • Simplicty of management (use existing IP networks)

mcc-ip.png

The ONTAP 9.3 release is going to be a limited release for this feature, with the following caveats:

  • A700, FAS9000 only
  • 100km limit
  • Dedicated ISL with extended VLAN currently required
  • 1 iWARP card per node

SolidFire to ONTAP SnapMirror

A few years back, the concept of a data fabric (where all of your data can be moved anywhere with the click of a button) was introduced.

That vision continued this year with the inclusion of SnapMirror from SolidFire (and NetApp HCI systems) to ONTAP.

sf-snapmirror.png

ONTAP 9.3 will allow storage administrators to implement a disaster recovery plan for their SolidFire systems.

This includes the following:

  • Baseline and incremental replication using NetApp SnapMirror from SolidFire to ONTAP
  • Failover storage to ONTAP for disaster recovery
  • Failback storage from ONTAP to SolidFire
    • Only for LUNs replicated from SolidFire
    • Replication from ONTAP to SolidFire only for failback

That covers a deeper look at some of the new ONTAP 9.3 features. Feel free to comment if you want to learn more about these features, or any not listed in the overview.