How to Map File and Folder Locations to NetApp ONTAP FlexGroup Member Volumes with XCP

The concept behind a NetApp FlexGroup volume is that ONTAP presents a single large namespace for NAS data, while ONTAP handles the balance and placement of files and folders to the underlying FlexVol member volumes, rather than a storage administrator needing to manage that.

I cover it in more detail in:

There’s also this USENIX presentation:

However, while not knowing/caring where files and folders live in your cluster is nice most of the time, there are occasions where you may need to figure out where a file or folder *actually* lives in the cluster – such as if a member volume has a large imbalance of capacity usage and you need to know what files need to be deleted/moved out of that volume. Previously, there’s been no real good way to do that, but thanks to the efforts of one of our global solutions architects (and one of the inventors of XCP), we now have a way and we don’t even need a treasure map.

Amazon.com: Plastic Treasure Map Party Accessory (1 count) (1/Pkg): Kitchen  & Dining

What is NetApp XCP?

If you’re unfamiliar with NetApp XCP, it’s NetApp’s FREE copy utility/data move that also can be used to do file analytics. There are other use cases, too:

Using XCP to delete files en masse: A race against rm

How to find average file size and largest file size using XCP

Because XCP can run in parallel from a client, it can perform tasks (such as find) much faster in high file count environments, so you’re not sitting around waiting for a command to finish for minutes/hours/days.

Since a FlexGroup is pretty much made for high file count environments, we’d want a way to quickly find files and their locations.

ONTAP NAS and File Handles

In How to identify a file or folder in ONTAP in NFS packet traces, I covered how to find inode information and a little bit about how ONTAP file handles are created/presented. The deep details aren’t super important here, but the general concept – that FlexGroup member volume information is stored in file handles that NFS can read – is.

Using that information and some parsing, there’s a Python script that can be used as an XCP plugin to translate file handles into member volume index numbers and present them in easy-to-read formats.

That Python script can be found here:

FlexGroup File Mapper

How to Use the “FlexGroup File Mapper” plugin with XCP

First of all, you’d need a client that has XCP installed. The version isn’t super important, but the latest release is generally the best release to use.

There are two methods we’ll use here to map files to member volumes.

  1. Scan All Files/Folders in a FlexGroup and Map Them All to Member Volumes
  2. Use a FlexGroup Member Volume Number and Find All Files in that Member Volume

To do this, I’ll use a FlexGroup that has ~2 million files.

::*> df -i FGNFS
Filesystem iused ifree %iused Mounted on Vserver
/vol/FGNFS/ 2001985 316764975 0% /FGNFS DEMO

Getting the XCP Host Ready

First, copy the FlexGroup File Mapper plugin to the XCP host. The file name isn’t important, but when you run the XCP command, you’ll either want to specify the plugin’s location or run the command from the folder the plugin lives in.

On my XCP host, I have the plugin named fgid.py in /testXCP:

# ls -la | grep fgid.py
-rw-r--r-- 1 502 admin 1645 Mar 25 17:34 fgid.py
# pwd
/testXCP

Scan All Files/Folders in a FlexGroup and Map Them All to Member Volumes

In this case, we’ll map all files and folders to their respective FlexGroup member volumes.

This is the command I use:

xcp diag -run fgid.py scan -fmt '"{} {}".format(x, fgid(x))' 10.10.10.10:/exportname

You can also include -parallel (n) to control how many processes spin up to do this work and you can use > filename at the end to pipe the output to a file (recommended).

For example, scanning ~2 million files in this volume took just 37 seconds!

# xcp diag -run fgid.py scan -fmt '"{} {}".format(x, fgid(x))' 10.10.10.10:/FGNFS > FGNFS.txt
402,061 scanned, 70.6 MiB in (14.1 MiB/s), 367 KiB out (73.3 KiB/s), 5s
751,933 scanned, 132 MiB in (12.3 MiB/s), 687 KiB out (63.9 KiB/s), 10s
1.10M scanned, 193 MiB in (12.2 MiB/s), 1007 KiB out (63.6 KiB/s), 15s
1.28M scanned, 225 MiB in (6.23 MiB/s), 1.14 MiB out (32.6 KiB/s), 20s
1.61M scanned, 283 MiB in (11.6 MiB/s), 1.44 MiB out (60.4 KiB/s), 25s
1.91M scanned, 335 MiB in (9.53 MiB/s), 1.70 MiB out (49.5 KiB/s), 31s
2.00M scanned, 351 MiB in (3.30 MiB/s), 1.79 MiB out (17.4 KiB/s), 36s
Sending statistics…

Xcp command : xcp diag -run fgid.py scan -fmt "{} {}".format(x, fgid(x)) 10.10.10.10:/FGNFS
Stats : 2.00M scanned
Speed : 351 MiB in (9.49 MiB/s), 1.79 MiB out (49.5 KiB/s)
Total Time : 37s.
STATUS : PASSED

The file created was 120MB, though… that’s a LOT of text to sort through.

-rw-r--r--. 1 root root 120M Apr 27 15:28 FGNFS.txt

So, there’s another way to do this, right? Correct!

If I know the folder I want to filter, or even a matching of file names, I can use -match in the command. In this case, I want to find all folders named dir_33.

This is the command:

# xcp diag -run fgid.py scan -fmt '"{} {}".format(x, fgid(x))' -match "name=='dir_33'" 10.10.10.10:/FGNFS > dir_33_FGNFS.txt

This is the output of the file. Two folders – one in member volume 3, one in member volume 4:

# cat dir_33_FGNFS.txt
x.x.x.x:/FGNFS/files/client1/dir_33 3
x.x.x.x:/FGNFS/files/client2/dir_33 4

If I want to use pattern matching for file names (ie, I know I want all files with “moarfiles3” in the name), then I can do this using regex and/or wildcards. More examples can be found in the XCP user guides.

Here’s the command I used. It found 440,400 files with that pattern in 27s.

# xcp diag -run fgid.py scan -fmt '"{} {}".format(x, fgid(x))' -match "fnm('moarfiles3*')" 10.10.10.10:/FGNFS > moarfiles3_FGNFS.txt

507,332 scanned, 28,097 matched, 89.0 MiB in (17.8 MiB/s), 465 KiB out (92.9 KiB/s), 5s
946,796 scanned, 132,128 matched, 166 MiB in (15.4 MiB/s), 866 KiB out (80.1 KiB/s), 10s
1.31M scanned, 209,340 matched, 230 MiB in (12.8 MiB/s), 1.17 MiB out (66.2 KiB/s), 15s
1.73M scanned, 297,647 matched, 304 MiB in (14.8 MiB/s), 1.55 MiB out (77.3 KiB/s), 20s
2.00M scanned, 376,195 matched, 351 MiB in (9.35 MiB/s), 1.79 MiB out (48.8 KiB/s), 25s
Sending statistics…

Filtered: 444400 matched, 1556004 did not match

Xcp command : xcp diag -run fgid.py scan -fmt "{} {}".format(x, fgid(x)) -match fnm('moarfiles3*') 10.10.10.10:/FGNFS
Stats : 2.00M scanned, 444,400 matched
Speed : 351 MiB in (12.6 MiB/s), 1.79 MiB out (65.7 KiB/s)
Total Time : 27s.

And this is a sample of some of those entries (the file is 27MB):

x.x.x.x:/FGNFS/files/client1/dir_45/moarfiles3158.txt 3
x.x.x.x:/FGNFS/files/client1/dir_45/moarfiles3159.txt 3

I can also look for files over a certain size. In this volume, the files are all 4K in size; but in my TechONTAP volume, I have varying file sizes. In this case, I want to find all .wav files greater than 100MB. This command didn’t seem to pipe to a file for me, but the output was only 16 files.

# xcp diag -run fgid.py scan -fmt '"{} {}".format(x, fgid(x))' -match "fnm('.wav') and size > 500*M" 10.10.10.11:/techontap > TechONTAP_ep.txt

10.10.10.11:/techontap/Episodes/Episode 20x - Genomics Architecture/ep20x-genomics-meat.wav 4
10.10.10.11:/techontap/archive/combine.band/Media/Audio Files/ep104-webex.output.wav 5
10.10.10.11:/techontap/archive/combine.band/Media/Audio Files/ep104-mics.output.wav 3
10.10.10.11:/techontap/archive/Episode 181 - Networking Deep Dive/ep181-networking-deep-dive-meat.output.wav 6
10.10.10.11:/techontap/archive/Episode 181 - Networking Deep Dive/ep181-networking-deep-dive-meat.wav 2

Filtered: 16 matched, 7687 did not match

xcp command : xcp diag -run fgid.py scan -fmt "{} {}".format(x, fgid(x)) -match fnm('.wav') and size > 100M 10.10.10.11:/techontap
Stats : 7,703 scanned, 16 matched
Speed : 1.81 MiB in (1.44 MiB/s), 129 KiB out (102 KiB/s)
Total Time : 1s.
STATUS : PASSED

But what if I know that a member volume is getting full and I want to see what files are in that member volume?

Use a FlexGroup Member Volume Number and Find All Files in that Member Volume

In the case where I know what member volume needs to be addressed, I can use XCP to search using the FlexGroup index number. The index number lines up with the member volume numbers, so if the index number is 6, then we know the member volume is 6.

In my 2 million file FG, I want to filter by member 6, so I use this command, which shows there are ~95019 files in member 6:

# xcp diag -run fgid.py scan -match 'fgid(x)==6' -parallel 10 -l 10.10.10.10:/FGNFS > member6.txt

 615,096 scanned, 19 matched, 108 MiB in (21.6 MiB/s), 563 KiB out (113 KiB/s), 5s
 1.03M scanned, 5,019 matched, 180 MiB in (14.5 MiB/s), 939 KiB out (75.0 KiB/s), 10s
 1.27M scanned, 8,651 matched, 222 MiB in (8.40 MiB/s), 1.13 MiB out (43.7 KiB/s), 15s
 1.76M scanned, 50,019 matched, 309 MiB in (17.3 MiB/s), 1.57 MiB out (89.9 KiB/s), 20s
 2.00M scanned, 62,793 matched, 351 MiB in (8.35 MiB/s), 1.79 MiB out (43.7 KiB/s), 25s

Filtered: 95019 matched, 1905385 did not match

Xcp command : xcp diag -run fgid.py scan -match fgid(x)==6 -parallel 10 -l 10.10.10.10:/FGNFS
Stats       : 2.00M scanned, 95,019 matched
Speed       : 351 MiB in (12.5 MiB/s), 1.79 MiB out (65.0 KiB/s)
Total Time  : 28s.
STATUS      : PASSED

When I check against the files-used for that member volume, it lines up pretty well:

::*> vol show -vserver DEMO -volume FGNFS__0006 -fields files-used
vserver volume      files-used
------- ----------- ----------
DEMO    FGNFS__0006 95120

And the output file shows not just the file names, but also the sizes!

rw-r--r-- --- root root 4KiB 4KiB 18h22m FGNFS/files/client2/dir_143/moarfiles1232.txt
rw-r--r-- --- root root 4KiB 4KiB 18h22m FGNFS/files/client2/dir_143/moarfiles1233.txt
rw-r--r-- --- root root 4KiB 4KiB 18h22m FGNFS/files/client2/dir_143/moarfiles1234.txt

And, if I choose, I can filter further with the sizes. Maybe I just want to see files in that member volume that are 4K or less (in this case, that’s all of them):

# xcp diag -run fgid.py scan -match 'fgid(x)==6 and size < 4*K' -parallel 10 -l 10.10.10.10:/FGNFS

In my “TechONTAP” volume, I look for 500MB files or greater in member 6:

# xcp diag -run fgid.py scan -match 'fgid(x)==6 and size > 500*M' -parallel 10 -l 10.10.10.11:/techontap

rw-r--r-- --- 501 games 596MiB 598MiB 3y219d techontap/Episodes/Episode 1/Epidose 1 GBF.band/Media/Audio Files/Tech ONTAP Podcast - Episode 1 - AFF with Dan Isaacs v3_1.aif
rw-r--r-- --- 501 games 885MiB 888MiB 3y219d techontap/archive/Prod - old MacBook/Insight 2016_Day2_TechOnTap_JParisi_ASullivan_GDekhayser.mp4
rw-r--r-- --- 501 games 787MiB 790MiB 1y220d techontap/archive/Episode 181 - Networking Deep Dive/ep181-networking-deep-dive-meat.output.wav

Filtered: 3 matched, 7700 did not match

Xcp command : xcp diag -run fgid.py scan -match fgid(x)==6 and size > 500*M -parallel 10 -l 10.10.10.11:/techontap
Stats : 7,703 scanned, 3 matched
Speed : 1.81 MiB in (1.53 MiB/s), 129 KiB out (109 KiB/s)
Total Time : 1s.
STATUS : PASSED

So, there you have it! A way to find files in a specific member volume inside of a FlexGroup! Let me know if you have any comments or questions below.

How to identify a file or folder in ONTAP in NFS packet traces

When you’re troubleshooting NFS issues, sometimes you have to collect a packet capture to see what’s going on. But the issue is, packet captures don’t really tell you the file or folder names. I like to use Wireshark for Mac and Windows, and regular old tcpdump for Linux. For ONTAP, you can run packet captures using this KB (requires NetApp login):

How to capture packet traces (tcpdump) on ONTAP 9.2+ systems

By default, Wireshark shows NFS packets like this ACCESS call. We see a FH, which is in hex, and then we see another filehandle that’s even more unreadable. We’ll occasionally see file names in the trace (like copy-file below), but if we need to find out why an ACCESS call fails, we’ll have difficulty:

Luckily, Wireshark has some built-in stuff to crack open those NFS file handles in ONTAP.

Also, check out this new blog:

How to Map File and Folder Locations to NetApp ONTAP FlexGroup Member Volumes with XCP

Changing Wireshark Settings

First, we’d want to set the NFS preferences. That’s done via Edit -> Preferences and then by clicking on “Protocols” in the left hand menu and selecting NFS:

Here, you’ll see some options that you can read more about by mousing over them:

I just select them all.

When we go to the packet we want to analyze, we can right click and select “Decode As…”:

This brings up the “Decode As” window. Here, we have “NFS File Handle Types” pre-populated. Double-click (none) under “Current” and you get a drop down menu. You’ll get some options for NFS, including…. ONTAP! In this case, since I’m using clustered ONTAP, I select ontap_gx_v3. (GX is what clustered ONTAP was before clustered ONTAP was clustered ONTAP):

If you click “OK” it will apply to the current session only. If you click “Save” it will keep those preferences every time.

Now, when the ACCESS packet is displayed, I get WAY more information about the file in question and they’re translated to decimal values.

Those still don’t mean a lot to us, but I’ll get to that.

Mapping file handle values to files in ONTAP

Now, we can use the ONTAP CLI and the packet capture to discern exactly what file has that ACCESS call.

Every volume in ONTAP has a unique identifier called a “Master Set ID” (or MSID). You can see the volume’s MSID with the following diag priv command:

cluster::*> vol show -vserver DEMO -volume vol2 -fields msid
vserver volume  msid
------- ------- -----------
DEMO    vol2    2163230318

If you know the volume name you’re troubleshooting, then that makes life easier – just use find in the packet details.

If you don’t, the MSID can be found in a packet trace in the ACCESS reply as the “fsid”:

You can then find the volume name and exported path with the MSID in the ONTAP CLI with:

cluster::*> set diag; vol show -vserver DEMO -msid  2163230318 -fields volume,junction-path
vserver volume  junction-path
------- ------- ----------- 
DEMO    vol2    /vol2 

File and directory handles are constructed using that MSID, which is why each volume is considered a distinct filesystem. But we don’t care about that, because Wireshark figures all that out for us and we can use the ONTAP CLI to figure it out as well.

The pertinent information in the trace as it maps to the files and folders are:

  • Spin file id = inode number in ONTAP
  • Spin file unique id = file generation number
  • File id = inode number as seen by the NFS client

If you know the volume and file or folder’s name, you can easily find the inode number in ONTAP with this command:

cluster::*> set advanced; showfh -vserver DEMO /vol/vol2/folder
Vserver                Path
---------------------- ---------------------------
DEMO                   /vol/vol2/folder
flags   snapid fileid    generation fsid       msid         dsid
------- ------ --------- ---------- ---------- ------------ ------------
0x8000  0      0x658e    0x227ed312 -          -            0x1639

In the above, the values are in hex, but we can translate with a hex converter, like this one:

https://www.rapidtables.com/convert/number/hex-to-decimal.html

So, for the values we got:

  • file ID (inode) 0x658e = 25998
  • generation ID 0x227ed312 = 578736914

In the trace, that matches up:

Finding file names and paths by inode number

But what happens if you don’t know the file name and just have the information from the trace?

One way is to use the nodeshell level command “inodepath.”

::*> node run -node node1 inodepath -v files 15447
Inode 15447 in volume files (fsid 0x142a) has 1 name.
Volume UUID is: 76a69b93-cc2f-11ea-b16f-00a098696eda
[ 1] Primary pathname = /vol/files/newapps/user1-file-smb

This will work with a FlexGroup volume as well, provided you know the node and the member volume where the file lives (see “How to Map File and Folder Locations to NetApp ONTAP FlexGroup Member Volumes with XCP” for a way to figure that info out).

::*> node run -node node2 inodepath -v FG2__0007 5292
Inode 5292 in volume FG2__0007 (fsid 0x1639) has 1 name.
Volume UUID is: 87b14652-9685-11eb-81bf-00a0986b1223
[ 1] Primary pathname = /vol/FG2/copy-file-finder

There’s also a diag privilege command in ONTAP for that. The caveat is it can be dangerous to run, especially if you make a mistake in running it. (And when I say dangerous, I mean best case, it hangs your CLI session for a while; worst case, it panics the node.) If possible, use inodepath instead.

Here’s how we could use the volume name and inode number to find the file name. For a FlexVol volume, it’s simple:

cluster::*> vol explore -format inode -scope volname.inode -dump name

For example:

cluster::*> volume explore -format inode -scope files.15447 -dump name
name=/newapps/user1-file-smb

With a FlexGroup volume, however, it’s a little more complicated, as there are member volumes to take into account and there’s no easy way for ONTAP to discern which FlexGroup member volume has the file, since ONTAP inode numbers can be reused in different member volumes. This is because the file IDs presented to NFS clients are created using the inode numbers and things like the member volume’s MSID (which is different than the FlexGroup’s MSID).

To make this happen with volume explore, we’d be working in reverse – listing the contents of the volume’s files/folders, then using the inode number of the parent folder, listing those, etc. With high file count environments, this is basically an impossibility.

In that case, we’d need to use an NFS client to discover the file name associated with the inode number in question.

From the client, we have two commands to find an inode number for a file. In this case we know the file’s location and name:

# ls -i /mnt/client1/copy-file-finder
4133624749 /mnt/client1/copy-file-finder
#stat copy-file-finder
File: ‘copy-file-finder’
Size: 12 Blocks: 0 IO Block: 1048576 regular file
Device: 2eh/46d Inode: 4133624749 Links: 1
Access: (0555/-r-xr-xr-x) Uid: ( 1102/ prof1) Gid: (10002/ProfGroup)
Access: 2021-04-14 11:47:45.579879000 -0400
Modify: 2021-04-14 11:47:45.588875000 -0400
Change: 2021-04-14 17:34:07.364283000 -0400
Birth: -

In a packet trace, that inode number is “fileid” and found in REPLY calls, such as GETATTR:

If we only know the inode number (as if we got it from a packet trace), we can use the number on the client to find the file name. One way is with “find”:

# find /path/to/mountpoint -inum <inodenumber>

For example:

# find /mnt/client1 -inum 4133624749
/mnt/client1/copy-file-finder

“find” can take a while – especially in a high file count environment, so we could also use XCP.

# xcp -l -match 'fileid== <inodenumber>' server1:/export

In this case:

# xcp -l -match 'fileid== 4133624749' DEMO:/FG2
XCP 1.6.1; (c) 2021 NetApp, Inc.; Licensed to Justin Parisi [NetApp Inc] until Tue Jun 22 12:34:48 2021

r-xr-xr-x --- 1102 10002 12 0 12d23h FG2/copy-file-finder

Filtered: 8173 did not match

Xcp command : xcp -l -match fileid== 4133624749 DEMO:/FG2
Stats : 8,174 scanned, 1 matched
Speed : 1.47 MiB in (2.10 MiB/s), 8.61 KiB out (12.3 KiB/s)
Total Time : 0s.
STATUS : PASSED

Hope this helps you find files in your NFS filesystem! If you have questions or comments, leave them below.

How pNFS could benefit cloud architecture

** Edited on April 2, 2021 **
Funny story about this post. Someone pointed out I had some broken links, so I went in and edited the links. When I clicked “publish” it re-posted the article, which was actually a pointer back to an old DatacenterDude article I wrote from 2015 – which no longer exists. So I started getting *more* pings about broken links and plenty of people seemed to be interested in the content. Thanks to the power of the Wayback Machine, I was able to resurrect the post and decided to do some modernization while I was at it.

Yesterday, I was speaking with a customer who is a cloud provider. They were discussing how to use NFSv4 with Data ONTAP for one of their customers. As we were talking, I brought up pNFS and its capabilities. They were genuinely excited about what pNFS could do for their particular use case. In the cloud, the idea is to remove the overhead of managing infrastructure, so most cloud architectures are geared towards automation, limiting management, etc. In most cases, that’s great, but for data locality in NAS environments, we need a way to make those operations seamless, as well as providing the best possible security available. That’s where pNFS comes in.

18cloudy-600

So, let’s talk about what pNFS is and in what use cases you may want to use it.

What is pNFS?

pNFS is “parallel NFS,” which is a little bit of a misnomer in ONTAP, as it doesn’t do parallel reads and writes across single files (i.e., striping). In the case of pNFS on Data ONTAP, NetApp currently supports file-level pNFS, so the object store would be a flexible volume on an aggregate of physical disks.

pNFS in ONTAP establishes a metadata path to the NFS server and then splits off the data path to its own dedicated path. The client works with the NFS server to determine which path is local to the physical location of the files in the NFS filesystem via DEVICEINFO and LAYOUTGETINFO metadata calls (specific to NFSv4.1 and later) and then dynamically redirects the path to be local. Think of it as ALUA for NFS.

The following graphic shows how that all takes place.

pNFS

pNFS defines the notion of a device that is generated by the server (that is, an NFS server running on Data ONTAP) and sent to the client. This process helps the client locate the data and send requests directly over the path local to that data. Data ONTAP generates one pNFS device per flexible volume. The metadata path does not change, so metadata requests might still be remote. In a Data ONTAP pNFS implementation, every data LIF is considered an NFS server, so pNFS only works if each node owns at least one data LIF per NFS SVM. Doing otherwise negates the benefits of pNFS, which is data locality regardless of which IP address a client connects to.

The pNFS device contains information about the following:

  • Volume constituents
  • Network location of the constituents

The device information is cached to the local node for improved performance.

To see pNFS devices in the cluster, use the following command in advanced privilege:

cluster::> set diag
cluster::*> vserver nfs pnfs devices cache show

pNFS Components

There are three main components of pNFS:

  • Metadata server
    • Handles all nondata traffic such as GETATTR, SETATTR, and so on
    • Responsible for maintaining metadata that informs the clients of the file locations
    • Located on the NetApp NFS server and established via the mount point
  • Data server
    • Stores file data and responds to READ and WRITE requests
    • Located on the NetApp NFS server
    • Inode information also resides here
  • Clients

pNFS is covered in further detail in NetApp TRs 4067 (NFS) and 4571 (FlexGroup volumes)

How Can I Tell pNFS is Being Used?

To check if pNFS is in use, you can run statistics counters to check for “pnfs_layout_conversions” counters. If the number of pnfs_layout_conversions are incrementing, then pNFS is in use. Keep in mind that if you try to use pNFS with a single network interface, the data layout conversations won’t take place and pNFS won’t be used, even if it’s enabled. 

cluster::*> statistics start -object nfsv4_1_diag
cluster::*> statistics show -object nfsv4_1_diag -counter pnfs_layout_conversions


Object: nfsv4_1_diag
Instance: nfs4_1_diag
Start-time: 4/9/2020 16:29:50
End-time: 4/9/2020 16:31:03
Elapsed-time: 73s
Scope: node1

    Counter                                                     Value
   -------------------------------- --------------------------------
   pnfs_layout_conversions                                      4053

Gotta keep ’em separated!

One thing that is beneficial about the design of pNFS is that the metadata paths are separated from the read/write paths. Once a mount is established, the metadata path is set on the IP address used for mount and does not move without manual intervention. In Data ONTAP, that path could live anywhere in the cluster. (Up to 24 physical nodes with multiple ports on each node!)

That buys you resiliency, as well as flexibility to control where the metadata will be served.

The data path, however, will only be established on reads and writes. That path is determined in conversations between the client and server and is dynamic. Any time the physical location of a volume changes, the data path changes automatically, without need to intervene by the clients or the storage administrator. So, unlike NFSv3 or even NFSv4, you no longer would need to break the TCP connection to move the path for reads and writes to be local (via unmount or LIF migrations). And with NFSv4.x, the statefulness of the connection can be preserved.

That means more time for everyone. Data can be migrated in real time, non-disruptively, based on the storage needs of the client.

For example, I have a volume that lives on node cluster01 of my cDOT cluster:

cluster::> vol show -vserver SVM -volume unix -fields node
 (volume show)

vserver volume node
------- ------ --------------
SVM     unix   cluster01

I have data LIFs on each node in my cluster:

 cluster::> net int show -vserver SVM
(network interface show)

Logical     Status     Network                       Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
SVM
             data1      up/up     10.63.57.237/18    cluster01     e0c     true
             data2      up/up     10.63.3.68/18      cluster02     e0c     true
2 entries were displayed.

In the above list:

  • 10.63.3.68 will be my metadata path, since that’s where I mounted.
  • 10.63.57.237 will be my data path, as it is local to the physical node cluster02.

When I mount, the TCP connection is established to the node where the data LIF lives:

nfs-client# mount -o minorversion=1 10.63.3.68:/unix /unix

cluster::> network connections active show -remote-ip 10.228.225.140
Vserver     Interface              Remote
Name        Name:Local             Port Host:Port              Protocol/Service
---------- ---------------------- ---------------------------- ----------------
Node: cluster02
SVM         data2:2049             nfs-client.domain.netapp.com:912
                                                                TCP/nfs

My metadata path is established to cluster02, but my data volume lives on cluster01.

On a basic cd and ls into the mount, all the traffic is seen on the metadata path. (stuff like GETATTR, ACCESS, etc):

83     6.643253      10.228.225.140       10.63.3.68    NFS    270    V4 Call (Reply In 85) GETATTR
85     6.648161      10.63.3.68    10.228.225.140       NFS    354    V4 Reply (Call In 83) GETATTR
87     6.652024      10.228.225.140       10.63.3.68    NFS    278    V4 Call (Reply In 88) ACCESS 
88     6.654977      10.63.3.68    10.228.225.140       NFS    370    V4 Reply (Call In 87) ACCESS

When I start I/O to that volume, the path gets updated to the local path by way of new pNFS calls (specified in RFC-5663):

28     2.096043      10.228.225.140       10.63.3.68    NFS    314    V4 Call (Reply In 29) LAYOUTGET
29     2.096363      10.63.3.68    10.228.225.140       NFS    306    V4 Reply (Call In 28) LAYOUTGET
30     2.096449      10.228.225.140       10.63.3.68    NFS    246    V4 Call (Reply In 31) GETDEVINFO
31     2.096676      10.63.3.68    10.228.225.140       NFS    214    V4 Reply (Call In 30) GETDEVINFO
  1. In LAYOUTGET, the client asks the server “where does this filehandle live?”
  2. The server responds with the device ID and physical location of the filehandle.
  3. Then, the client asks “what devices to access that physical data are avaiabe to me?” via GETDEVINFO.
  4. The server responds with the list of available devices/IP addresses.

justin_getdevinfo

Once that communication takes place (and note that the conversation occurs in sub-millisecond times), the client then establishes the new TCP connection for reads and writes:

32     2.098771      10.228.225.140       10.63.57.237  TCP    74     917 > nfs [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=937300318 TSecr=0 WS=128
33     2.098996      10.63.57.237  10.228.225.140       TCP    78     nfs > 917 [SYN, ACK] Seq=0 Ack=1 Win=33580 Len=0 MSS=1460 SACK_PERM=1 WS=128 TSval=2452178641 TSecr=937300318
34     2.099042      10.228.225.140       10.63.57.237  TCP    66     917 > nfs [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=937300318 TSecr=2452178641

And we can see the connection established on the cluster to both the metadata and data locations:

cluster::> network connections active show -remote-ip 10.228.225.140
Vserver     Interface              Remote
Name        Name:Local             Port Host:Port              Protocol/Service
---------- ---------------------- ---------------------------- ----------------
Node: cluster01
SVM         data2:2049             nfs-client.domain.netapp.com:912
                                                               TCP/nfs
Node: cluster02 
SVM         data2:2049             nfs-client.domain.netapp.com:912 
                                                               TCP/nfs

Then we start our data transfer on the new path (data path 10.63.57.237):

38     2.099798      10.228.225.140       10.63.57.237  NFS    250    V4 Call (Reply In 39) EXCHANGE_ID
39     2.100137      10.63.57.237  10.228.225.140       NFS    278    V4 Reply (Call In 38) EXCHANGE_ID
40     2.100194      10.228.225.140       10.63.57.237  NFS    298    V4 Call (Reply In 42) CREATE_SESSION
42     2.100537      10.63.57.237  10.228.225.140       NFS    194    V4 Reply (Call In 40) CREATE_SESSION

157    2.106388      10.228.225.140       10.63.57.237  NFS    15994  V4 Call (Reply In 178) WRITE StateID: 0x0d20 Offset: 196608 Len: 65536
163    2.106421      10.63.57.237  10.228.225.140       NFS    182    V4 Reply (Call In 127) WRITE

If I do a chmod later, the metadata path is used (10.63.3.68):

341    27.268975     10.228.225.140       10.63.3.68    NFS    310    V4 Call (Reply In 342) SETATTR FH: 0x098eaec9
342    27.273087     10.63.3.68    10.228.225.140       NFS    374    V4 Reply (Call In 341) SETATTR | ACCESS

How do I make sure metadata connections don’t pile up?

When you have many clients mounting to an NFS server, you generally want to try to control which nodes those clients are mounting to. In the cloud, this becomes trickier to do, as clients and storage system management may be handled by the cloud providers. So, we’d want to have a noninteractive way to do this.

With ONTAP, you have two options to load balance TCP connections for metadata. You can use the tried and true DNS round-robin method, but the NFS server doesn’t have any idea what IP addresses have been issued by the DNS server, so as a result, there are no guarantees the connections won’t pile up.

Another way to deal with connections is to leverage the ONTAP feature for on-box DNS load balancing. This feature allows storage administrators to set up a DNS forwarding zone on a DNS server (BIND, Active Directory or otherwise) to forward requests to the clustered Data ONTAP data LIFs, which can act as DNS servers complete with SOA records! The cluster will determine which IP address to issue to a client based on the following factors:

  • CPU load
  • overall node throughput

This helps ensure that any TCP connection that is established is done so in a logical manner based on performance of the phyical hardware.

I cover both types of DNS load balancing in TR-4523: DNS Load Balancing in ONTAP.

What about that data agility?

What’s great about pNFS is that it is a perfect fit for storage operating systems like ONTAP. NetApp and RedHat worked together closely on the protocol enhancement, and it shows in its overall implementation.

In ONTAP, there is the concept of non-disruptive volume moves. This feature gives storage administrators agility and flexibility in their clusters, as well as enabling service and cloud providers a way to charge based on tiers (pay as you grow!).

For example, if I am a cloud provider, I could have a 24-node cluster as a backend. Some HA pairs could be All-Flash FAS (AFF) nodes for high-performance/low latency workloads. Some HA pairs could be SATA or SAS drives for low performance/high capacity/archive storage. If I am providing storage to a customer that wants to implement high performance computing applications, I could sell them the performance tier. If those applications are only going to run during the summer months, we can use the performance tier, and after the jobs are complete, we can move them back to SATA/SAS drives for storage and even SnapMirror or SnapVault them off to a DR site for safekeeping. Once the job cycle comes back around, I can nondisruptively move the volumes back to flash. That saves the customer money, as they only pay for the performance they’re using, and that saves the cloud provider money since they can free up valuable flash real estate for other customers that need performance tier storage.

What happens when a volume moves in pNFS?

When a volume move occurs, the client is notified of the change via the pNFS calls I mentioned earlier. When the file attempts to OPEN for writing, the server responds, “that file is somewhere else now.”

220    24.971992     10.228.225.140       10.63.3.68    NFS    386    V4 Call (Reply In 221) OPEN DH: 0x76306a29/testfile3
221    24.981737     10.63.3.68    10.228.225.140       NFS    482    V4 Reply (Call In 220) OPEN StateID: 0x1077

The client says, “cool, where is it now?”

222    24.992860     10.228.225.140       10.63.3.68    NFS    314    V4 Call (Reply In 223) LAYOUTGET
223    25.005083     10.63.3.68    10.228.225.140       NFS    306    V4 Reply (Call In 222) LAYOUTGET
224    25.005268     10.228.225.140       10.63.3.68    NFS    246    V4 Call (Reply In 225) GETDEVINFO
225    25.005550     10.63.3.68    10.228.225.140       NFS    214    V4 Reply (Call In 224) GETDEVINFO

Then the client uses the new path to start writing, with no interaction needed.

251    25.007448     10.228.225.140       10.63.57.237  NFS    7306   V4 Call WRITE StateID: 0x15da Offset: 0 Len: 65536
275    25.007987     10.228.225.140       10.63.57.237  NFS    7306   V4 Call WRITE StateID: 0x15da Offset: 65536 Len: 65536

Automatic Data Tiering

If you have an on-premises storage system and want to save storage infrastructure costs by automatically tiering cold data to the cloud or to an on-premises object storage system, you could leverage NetApp FabricPool, which allows you to set tiering policies to chunk off cold blocks of data to more cost effective storage and then retrieve those blocks whenever they are requested by the end user. Again, we’re taking the guesswork and labor out of data management, which is becoming critical in a world driven towards managed services.

For more information on FabricPool:

TR-4598: FabricPool Best Practices

Tech ONTAP Podcast Episode 268 – NetApp FabricPool and S3 in ONTAP 9.8

What about FlexGroup volumes?

As of ONTAP 9.7, NFSv4.1 and pNFS is supported with FlexGroup volumes, which is an intriguing solution.

Part of the challenge of a FlexGroup volume is that you’re guaranteed to have remote I/O across a cluster network when you span multiple nodes. But since pNFS automatically redirects traffic to local paths, you can greatly reduce the amount of intracluster traffic.

A FlexGroup volume operates as a single entity, but is constructed of multiple FlexVol member volumes. Each member volume contains unique files that are not striped across volumes. When NFS operations connect to FlexGroup volumes, ONTAP handles the redirection of operations over a cluster network.

With pNFS, these remote operations are reduced, because the data layout mappings track the member volume locations and local network interfaces; they also redirect reads/writes to the local member volume inside a FlexGroup volume, even though the client only sees a single namespace. This approach enables a scale-out NFS solution that is more seamless and easier to manage, and it also reduces cluster network traffic and balances data network traffic more evenly across nodes.

FlexGroup pNFS differs a bit from FlexVol pNFS. Even though FlexGroup load-balances between metadata servers for file opens, pNFS uses a different algorithm. pNFS tries to direct traffic to the node on which the target file is located. If multiple data interfaces per node are given, connections can be made to each of the LIFs, but only one of the LIFs of the set is used to direct traffic to volumes per network interface.

What workloads should I use with pNFS?

pNFS is leveraging NFSv4.1 and later as its protocol, which means you get all the benefits of NFSv4.1 (security, Kerberos and lock integration, lease-based locks, delegations, ACLs, etc.). But you also get the potential negatives of NFSv4.x, such as higher overhead for operations due to the compound calls, state ID handling, locking, etc. and disruptions during storage failovers that you wouldn’t see with NFSv3 due to the stateful nature of NFSv4.x.

Performance can be severely impacted with some workloads, such as high file count workloads/high metadata workloads (think EDA, software development, etc). Why? Well, recall that pNFS is parallel for reads and writes – but the metadata operations still use a single interface for communication. So if your NFS workload is 80% GETATTR, then 80% of your workload won’t benefit from the localization and load balancing that pNFS provides. Instead, you’ll be using NFSv4.1 as if pNFS were disabled.

Plus, with millions of files, even if you’re doing heavy reads and writes, that means you’re redirecting paths constantly with pNFS (creating millions of DEVICEINFO and LAYOUTGET calls), which may prove more inefficient than simply using NFSv4.1 without pNFS.

pNFS also would need to be supported by the clients you’re using, so if you want to use it for something like VMware datastores, you’re out of luck (for now). VMware currently supports NFSv4.1, but not pNFS (they went with session trunking, which ONTAP does not currently support).

File-based pNFS works best with workloads that do a lot of sequential IO, such as databases, Hadoop/Apache Spark, AI training workloads, or other large file workloads, where reads and writes dominate the IO.

What about the performance?

In TR-4067, I did some basic performance testing on NFSv3 vs. NFSv4.1 for those types of workloads and the results were that pNFS stacked up nicely with NFSv3.

These tests were done using dd in parallel to simulate a sequential I/O workload. This isn’t intended to show the upper limits of the system (I used an AFF 8040 and some VM clients with low RAM and 1GB networks), but instead were intended to show an apples to apples comparison of NFSv3 and NFS4.1 with and without pNFS, using different wsize/rsize values. Be sure to do your own tests before implementing in production.

Note that our completion time for this workload using pNFS was a full 5 minutes faster than NFSv3 using a 1MB wsize/rsize value.

Test (wsize/rsize setting)Completion Time
NFSv3 (1MB)15m23s
NFSv3 (256K)14m17s
NFSv3 (64K)14m48s
NFSv4.1 (1MB)15m6s
NFSv4.1 (256K)12m10s
NFSv4.1 (64K)15m8s
NFSv4.1 (1MB; pNFS)10m54s
NFSv4.1 (256K; pNFS)12m16s
NFSv4.1 (64K; pNFS)13m57s
NFSv4.1 (1MB; delegations)13m6s
NFSv4.1 (256K; delegations)15m25s
NFSv4.1 (64K; delegations)13m48s
NFSv4.1 (1MB; pNFS + delegations)11m7s
NFSv4.1 (256K; pNFS + delegations)13m26s
NFSv4.1 (64K; pNFS + delegations)10m6s

The IOPS were lower overall for NFSv4.1 than NFSv3; that’s because NFSv4.1 combines operations into single packets. Thus, NFSv4.1 will be less chatty over the network than NFSv3. On the downside, the payloads are larger, so the NFS server has more processing to do for each packet, which can impact CPU, and with more IOPS, you can see a drop in performance due to that overhead.

Where NFSv4.1 beat out NFSv3 was with the latency and throughput – since we can guarantee data locality, we get benefits of fastpathing the reads/writes to the files, rather than the extra processing needed to traverse the cluster network.

Test
(wsize/rsize setting)
Average Read Latency (ms)Average Read Throughput (MB/s)Average Write Latency (ms)Average Write Throughput (MB/s)Average Ops
NFSv3 (1MB)665427.91160530
NFSv3 (256K)1.47662.911092108
NFSv3 (64K).26952.211108791
NFSv4.1 (1MB)6.562736.81400582
NFSv4.1 (256K)1.47123.211602352
NFSv4.1 (64K).16061.213107809
NFSv4.1 (1MB; pNFS)3.684026.81370818
NFSv4.1 (256K; pNFS)1.18075.215602410
NFSv4.1 (64K; pNFS).18351.914908526
NFSv4.1
(1MB; delegations)
5.168432.91290601
NFSv4.1
(256K; delegations)
1.36483.311401995
NFSv4.1
(64K; delegations)
.16631.310007822
NFSv4.1
(1MB; pNFS + delegations)
3.894122.41110696
NFSv4.1
(256K; pNFS + delegations)
1.17953.311402280
NFSv4.1
(64K; pNFS + delegations)
.18151117011130

For high file count workloads, NFSv3 did much better. This test created 800,000 small files (512K) in parallel. For this high metadata workload, NFSv3 completed 2x as fast as NFSv4.1. pNFS added some time savings versus NFSv4.1 without pNFS, but overall, we can see where we may run into problems with this type of workload. Future releases of ONTAP will get better with this type of workload using NFSv4.1 (these tests were on 9.7).

Test (wsize/rsize setting)Completion TimeCPU %Average throughput (MB/s)Average total IOPS
NFSv3 (1MB)17m29s32%3517696
NFSv3 (256K)16m34s34.5%3728906
NFSv3 (64K)16m11s39%39413566
NFSv4.1 (1MB)38m20s26%1677746
NFSv4.1 (256K)38m15s27.5%1677957
NFSv4.1 (64K)38m13s31%17210221
NFSv4.1 pNFS (1MB)35m44s27%1718330
NFSv4.1 pNFS (256K)35m9s28.5%1758894
NFSv4.1 pNFS (64K)36m41s33%17110751

Enter nconnect

One of the keys to pNFS performance is parallelization of operations across volumes, nodes, etc. But it doesn’t necessarily parallelize network connections across these workloads. That’s where the new NFS mount option nconnect comes in.

The purpose of nconnect is to provide multiple transport connections per TCP connection or mount point on a client. This helps increase parallelism and performance for NFS mounts – particularly for single client workloads. Details about nconnect and how it can increase performance for NFS in Cloud Volumes ONTAP can be found in the blog post The Real Baseline Performance Story: NetApp Cloud Volumes Service for AWS. ONTAP 9.8 offers official support for the use of nconnect with NFS mounts, provided the NFS client also supports it. If you would like to use nconnect, check to see if your client version provides it and use ONTAP 9.8 or later. ONTAP 9.8 and later supports nconnect by default with no option needed.

Client support for nconnect varies, but the latest RHEL 8.3 release supports it, as do the latest Ubuntu and SLES releases. Be sure to verify if your OS vendor supports it.

Our Customer Proof of Concept lab (CPOC) did some benchmarking of nconnect with NFSv3 and pNFS using a sequential I/O workload on ONTAP 9.8 and saw some really promising results.

  • Single NVIDIA DGX-2 client
  • Ubuntu 20.04.2
  • NFSv4.1 with pNFS and nconnect
  • AFF A400 cluster
  • NetApp FlexGroup volume
  • 256K wsize/rsize
  • 100GbE connections
  • 32 x 1GB files

In these tests, the following throughput results were seen. Latency for both were sub 1ms.

TestBandwidth
NFSv310.2 GB/s
NFSv4.1/pNFS21.9 GB/s
Both NFSv3 and NFSv4.1 used nconnect=16

In these tests, NFSv4.1 with pNFS doubled the performance for the sequential read workload at 250us latency. Since the files were 1GB in size, the reads were almost entirely from the controller RAM, but it’s not unreasonable to see that as the reality for a majority of workloads, as most systems have enough RAM to see similar results.

David Arnette and I discuss it a bit in this podcast:

Episode 283 – NetApp ONTAP AI Reference Architectures

Note: Benchmark tests such as SAS iotest will purposely recommend setting file sizes larger than the system RAM to avoid any caching benefits and instead will measure the network bandwidth of the transfer. In real world application scenarios, RAM, network, storage and CPU are all working together to create the best possible performance scenarios.

pNFS Best Practices with ONTAP

pNFS best practices in ONTAP don’t differ much from normal NAS best practices, but here are a few to keep in mind. In general:

  • Use the latest supported client OS version.
  • Use the latest supported ONTAP patch release.
  • Create a data LIF per node, per SVM to ensure data locality for all nodes.
  • Avoid using LIF migration on the metadata server data LIF, because NFSv4.1 is a stateful protocol and LIF migrations can cause brief outages as the NFS states are reestablished.
  • In environments with multiple NFSv4.1 clients mounting, balance the metadata server connections across multiple nodes to avoid piling up metadata operations on a single node or network interface.
  • If possible, avoid using multiple data LIFs on the same node in an SVM.
  • In general, avoid mounting NFSv3 and NFSv4.x on the same datasets. If you can’t avoid this, check with the application vendor to ensure that locking can be managed properly.
  • If you’re using NFS referrals with pNFS, keep in mind that referrals establish a local metadata server, but data I/O still redirect. With FlexGroup volumes, the member volumes might live on multiple nodes, so NFS referrals aren’t of much use. Instead, use DNS load balancing to spread out connections.

Drop any questions into the comments below!

Behind the Scenes: Episode 228 – FlexPod for Industry Verticals – Healthcare

Welcome to the Episode 228, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

2019-insight-design2-warhol-gophers

This week on the podcast, we discuss FlexPod and the new initiative to create validated designs for industry verticals. First up – Healthcare and Epic software with NetApp Sr. Product Manager for Converged Infrastructure, Ketan Mota (ketan.mota@netapp.com) and NetApp Solutions Architect for Healthcare, Brian O’Mahony (omahony@netapp.com).

For links to the FlexPod technical reports:

FlexPod for Epic TRs

FlexPod for MEDITECH TRs

And general FlexPod information:

https://flexpod.com/

https://www.cisco.com/c/en/us/solutions/design-zone/data-center-design-guides/flexpod-design-guides.html

https://www.netapp.com/us/products/converged-systems/flexpod-converged-infrastructure.aspx

Podcast Transcriptions

We also are piloting a new transcription service, so if you want a written copy of the episode, check it out here (just set expectations accordingly):

Episode 228: FlexPod for Industry Verticals: Healthcare – Transcription

Just use the search field to look for words you want to read more about. (For example, search for “storage”)

transcript.png

Be sure to give us feedback on the transcription in the comments here or via podcast@netapp.com! If you have requests for other previous episode transcriptions, let me know!

Finding the Podcast

You can find this week’s episode here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

Behind the Scenes: Episode 227 – Pacific Biosciences, ONTAP and Unstructured NAS

Welcome to the Episode 227, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

2019-insight-design2-warhol-gophers

This week on the podcast, Adam Knight (@damknight) of Pacific Biosciences joins us to discuss how PacBio uses ONTAP for all of its unstructured NAS workload requirements, with a focus on FlexGroup volumes!

pacbio

Also, check out these other podcast episodes:

Behind the Scenes: Episode 126 – Komprise

Behind the Scenes: Episode 209 – Designing an End-to-End Genomics Solution Using NetApp

And if you want to review the Insight presentation that Adam and I did together, check it out here (requires login):

NetApp Insight 2019 Presentations

Podcast Transcriptions

We also are piloting a new transcription service, so if you want a written copy of the episode, check it out here (just set expectations accordingly):

Episode 227: Pacific Biosciences, ONTAP and Unstructured NAS – Transcription

Just use the search field to look for words you want to read more about. (For example, search for “storage”)

transcript.png

Be sure to give us feedback on the transcription in the comments here or via podcast@netapp.com! If you have requests for other previous episode transcriptions, let me know!

Finding the Podcast

You can find this week’s episode here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

Updated FlexGroup Technical Reports now available for ONTAP 9.7!

ONTAP 9.7 is now available, so that means the TRs need to get a refresh.

161212-westworld-news

There are some new features in ONTAP 9.7 for FlexGroup volumes, including:

The TRs cover those features, and there are some updates to other areas that might not have been as clear as they could have been. I also added some new use cases.

Also, check out the newest FlexGroup episode of the Tech ONTAP Podcast:

Behind the Scenes: Episode 219 – FlexVol to FlexGroup Conversion

TR Update List

Here’s the list of FlexGroup TRs that have been updated for ONTAP 9.7:

TR-4678: Data Protection and Backup – FlexGroup volumes

This covers backup and DR best practices/support for FlexGroup volumes.

TR-4557: FlexGroup Volume Technical Overview

This TR is a technical overview, which is intended just to give information on how FlexGroups work.

TR-4571-a is an abbreviated best practice guide for easy consumption.

TR-4571: FlexGroup Best Practice Guide

This is the best practices TR and also offers information on new features, including details on FlexVol to FlexGroup convert!

Most of these updates came from feedback and questions I received. If you have something you want to see added to the TRs, let me know!

Behind the Scenes: Episode 219 – FlexVol to FlexGroup Conversion

Welcome to the Episode 219, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

2019-insight-design2-warhol-gophers

This week on the podcast, we invite the NetApp FlexGroup Technical Director, Dan Tennant, and FlexGroup developer Jessica Peters, to talk to us about the ins and outs of converting a FlexVol to a FlexGroup in-place, with no copy and no outage!

I also cover the process in detail in this blog post:

FlexGroup Conversion: Moving from FlexVols to FlexGroups the Easy Way

Expect official documentation on it in the coming weeks.

For more information or questions about FlexGroup volumes, email us at flexgroups-info@netapp.com!

Podcast Transcriptions

We also are piloting a new transcription service, so if you want a written copy of the episode, check it out here (just set expectations accordingly):

Episode 219: FlexVol to FlexGroup Conversion Transcription

Just use the search field to look for words you want to read more about. (For example, search for “storage”)

transcript.png

Be sure to give us feedback on the transcription in the comments here or via podcast@netapp.com! If you have requests for other previous episode transcriptions, let me know!

Finding the Podcast

You can find this week’s episode here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

Behind the Scenes: Episode 217 – ONTAP 9.7

Welcome to the Episode 217, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

2019-insight-design2-warhol-gophers

This week on the podcast, we talk about the latest release of ONTAP, as well as the new All-SAN array!

Featured in this week’s podcast:

  • NetApp SVP Octavian Tanase
  • NetApp Director Jeff Baxter
  • NetApp Product Marketing Manager Jon Jacob
  • Netapp Technical Product Marketing Manager Skip Shapiro
  • NetApp TMEs Dan Isaacs and Mike Peppers

Podcast Transcriptions

We also are piloting a new transcription service, so if you want a written copy of the episode, check it out here (just set expectations accordingly):

Episode 217: ONTAP 9.7 Transcription

Just use the search field to look for words you want to read more about. (For example, search for “storage”)

transcript.png

Be sure to give us feedback on the transcription in the comments here or via podcast@netapp.com! If you have requests for other previous episode transcriptions, let me know!

Finding the Podcast

You can find this week’s episode here:

Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.

http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Our YouTube channel (episodes uploaded sporadically) is here:

FlexGroup Conversion: Moving from FlexVols to FlexGroups the Easy Way

UPDATED: Added 500 million file test to this to show that file count doesn’t matter. 🙂

NetApp announced ONTAP 9.7 at Insight 2019 in Las Vegas, which included a number of new features. But mainly, ONTAP 9.7 focuses on making storage management in ONTAP simpler.

recordable-easy-button-with-custom-voice-and-logo

One of the new features that will help make things easier is the new FlexGroup conversion feature, which allows in-place conversion of a FlexVol to a FlexGroup volume without the need to do a file copy.

Best of all, this conversion takes a matter of seconds without needing to remount clients!

I know it sounds too good to be true, but what would you rather do: spend days copying terabytes of data over the network, or run a single command that converts the volume in place without touching the data?

As you can imagine, a lot of people are pretty stoked about being able to convert volumes without copying data, so I wanted to write up something to point people to as the questions inevitably start rolling in. This blog will cover how it works and what caveats there are. The blog will be a bit long, but I wanted to cover all the bases. Look for this information to be included in TR-4571 soon, as well as a new FlexGroup conversion podcast in the coming weeks.

Why would I want to convert a volume to a FlexGroup?

FlexGroup volumes offer a few advantages over FlexVol volume, such as:

  • Ability to expand beyond 100TB and 2 billion files in a single volume
  • Ability to scale out capacity or performance non-disruptively
  • Multi-threaded performance for high ingest workloads
  • Simplification of volume management and deployment

For example, perhaps you have a workload that is growing rapidly and you don’t want to have to migrate the data, but still want to provide more capacity. Or perhaps a workload’s performance just isn’t cutting it on a FlexVol, so you want to provide better performance handling with a FlexGroup. Converting can help here.

When would I not want to convert a FlexVol?

Converting a FlexVol to a FlexGroup might not always be the best option. If you have features you require in FlexVol that aren’t available in FlexGroup volumes, then you should hold off. For example, SVM-DR and cascading SnapMirrors aren’t supported in ONTAP 9.7, so if you need those, you should stay with FlexVols.

Also, if you have a FlexVol that’s already very large (80-100TB) and already very full (80-90%) then you might want to copy the data rather than convert, as the converted FlexGroup volume would then have a very large, very full member volume, which could create performance issues and doesn’t really fully resolve your capacity issues – particularly if that dataset contains files that grow over time.

For example, if you have a FlexVol that is 100TB in capacity and 90TB used, it would look like this:

FV-100t

If you were to convert this 90% full volume to a FlexGroup, then you’d have a 90% full member volume. Once you add new member volumes, they’d be 100TB each and 0% full, meaning they’d take on a majority of new workloads. The data would not rebalance and if the original files grew over time, you could still run out of space with nowhere to go (since 100TB is the maximum member volume size).

fg-400t.png

Things that would block a conversion

ONTAP will block conversion of a FlexVol for the following reasons:

  • The ONTAP version isn’t 9.7 on all nodes
  • ONTAP upgrade issues preventing conversion
  • A FlexVol volume was transitioned from 7-Mode using 7MTT
  • Something is enabled on the volume that isn’t supported with FlexGroups yet (SAN LUNs, Windows NFS, SMB1, part of a fan-out/cascade snapmirror, SVM-DR, Snapshot naming/autodelete, vmalign set, SnapLock, space SLO, logical space enforcement/reporting, etc.)
  • FlexClones are present (The volume being converted can’t be a parent nor a clone)
  • The volume is a FlexCache origin volume
  • Snapshots with snap-ids greater than 255
  • Storage efficiencies are enabled (can be re-enabled after)
  • The volume is a source of a snapmirror and the destination has not been converted yet
  • The volume is part of an active (not quiesced) snapmirror
  • Quotas enabled (must be disabled first, then re-enabled after)
  • Volume names longer than 197 characters
  • Running ONTAP processes (mirrors, jobs, wafliron, NDMP backup, inode conversion in process, etc)
  • SVM root volume
  • Volume is too full

You can check for upgrade issues with:

cluster::*> upgrade-revert show
cluster::*> system node image show-update-progress -node *

You can check for transitioned volumes with:

cluster::*> volume show -is-transitioned true
There are no entries matching your query.

You can check for snapshots with snap-ids >255 with:

cluster::*> volume snapshot show -vserver DEMO -volume testvol -logical-snap-id >255 -fields logical-snap-id

How it works

To convert a FlexVol volume to a FlexGroup volume in ONTAP 9.7, you run a single, simple command in advanced privilege:

cluster::*> volume conversion start ?
-vserver <vserver name> *Vserver Name
[-volume] <volume name> *Volume Name
[ -check-only [true] ] *Validate the Conversion Only
[ -foreground [true] ] *Foreground Process (default: true)

When you run this command, it will take a single FlexVol and convert it into a FlexGroup volume with one member. You can even run a validation of the conversion before you do the real thing!

The process is 1:1, so you can’t currently convert multiple FlexVols into a single FlexGroup. Once the conversion is done, you will have a single member FlexGroup volume, which you can then add more member volumes of the same size to increase capacity and performance.

convert.png

Other considerations/caveats

While the actual conversion process is simple, there are some considerations to think of before converting. Most of these considerations will go away in each ONTAP release as support is added for features, but it’s still prudent to call them out here.

Once the initial conversion is done, ONTAP will unmount the volume internally and remount it to get the new FlexGroup information into the appropriate places. Clients won’t have to remount/reconnect, but will see a disruption that last less than 1 minute while this takes place. Data doesn’t change at all – filehandles all stay the same.

  • FabricPool doesn’t need anything. It just works. No need to rehydrate data on-prem.
  • Snapshot copies will remain and available for clients to access data from, but you won’t be able to restore the volume using them via snaprestore commands. Those snapshots get marked as “pre-conversion.”
  • SnapMirrors will pick up where they left off without needing to rebaseline, provided the source and destination volumes have both been converted. But no snapmirror restores of the volume; just file retrieval from clients. Snapmirror destinations need to be converted first.
  • FlexClones will need to be deleted or split from the volume to be converted.
  • Storage efficiencies will need to be disabled during the conversion, but your space savings will be preserved after the convert
  • FlexCache instances with an origin volume being converted will need to be deleted.
  • Space guarantees can impact how large a FlexGroup volume can get if they’re set to volume guarantee. New member volumes will need to be the same size as the existing members, so you’d need adequate space to honor those.
  • Quotas are supported in FlexGroup volumes, but are done a bit differently than in FlexVol volumes. So, while the convert is being done, quotas have to be disabled (quota off) and then re-enabled later (quota on).

Also, conversion to FlexGroup volumes is a one way street after you expand it, so be sure you’re ready to make the jump. If anything goes wrong during the conversion process, there is a “rescue” method that support can help you use to get out of the pickle, so your data will be safe.

When you expand the FlexGroup to add new member volumes, they will be the same size as the converted member volume, so be sure there is adequate space available. Additionally, the existing data that resides in the original volume will remain in that member volume. Data does not re-distribute. Instead, the FlexGroup will favor newly added member volumes for new files.

Nervous about convert?

Well, ONTAP has features for that.

If you don’t feel comfortable about converting your production FlexVol to a FlexGroup right away, you have options.

First of all, remember that we have the ability to run a check on the convert command with -check-only true. That tells us what pre-requisites we might be missing.

cluster::*> volume conversion start -vserver DEMO -volume flexvol -foreground true -check-only true

Error: command failed: Cannot convert volume "flexvol" in Vserver "DEMO" to a FlexGroup. Correct the following issues and retry the command:
* The volume has Snapshot copies with IDs greater than 255. Use the (privilege: advanced) "volume snapshot show -vserver DEMO -volume flexvol -logical-snap-id >255 -fields logical-snap-id" command to list the Snapshot copies
with IDs greater than 255 then delete them using the "snapshot delete -vserver DEMO -volume flexvol" command.
* Quotas are enabled. Use the 'volume quota off -vserver DEMO -volume flexvol' command to disable quotas.
* Cannot convert because the source "flexvol" of a SnapMirror relationship is source to more than one SnapMirror relationship. Delete other Snapmirror relationships, and then try the conversion of the source "flexvol" volume.
* Only volumes with logical space reporting disabled can be converted. Use the 'volume modify -vserver DEMO -volume flexvol -is-space-reporting-logical false' command to disable logical space reporting.

Also, remember, ONTAP has the ability to create multiple storage virtual machines, which can be fenced off from network access. This can be used to test things, such as volume conversion. The only trick is getting a copy of that data over… but it’s really not that tricky.

Option 1: SnapMirror

You can create a SnapMirror of your “to be converted” volume to the same SVM or a new SVM. Then, break the mirror and delete the relationship. Now you have a sandbox copy of your volume, complete with snapshots, to test out conversion, expansion and performance.

Option 2: FlexClone/volume rehost

If you don’t have SnapMirror or want to try a method that is less taxing on your network, you can use a combination of FlexClone (instant copy of your volume backed by a snapshot) and volume rehost (instant move of the volume from one SVM to another). Keep in mind that FlexClones themselves can’t be rehosted, but you can split the clone and then rehost.

Essentially, the process is:

  1. FlexClone create
  2. FlexClone split
  3. Volume rehost to new SVM (or convert on the existing SVM)
  4. Profit!

Sample conversion

Before I converted a volume, I added around 300,000 files to help determine how long the process might take with a lot of files present.

cluster::*> df -i lotsafiles
Filesystem iused ifree %iused Mounted on Vserver
/vol/lotsafiles/ 330197 20920929 1% /lotsafiles DEMO

cluster::*> volume show lots*
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
DEMO      lotsafiles   aggr1_node1  online     RW         10TB     7.33TB    0%

First, let’s try out the validation:

cluster::*> volume conversion start -vserver DEMO -volume lotsafiles -foreground true -check-only true

Error: command failed: Cannot convert volume "lotsafiles" in Vserver "DEMO" to a FlexGroup. Correct the following issues and retry the command:
* SMB1 is enabled on Vserver "DEMO". Use the 'vserver cifs options modify -smb1-enabled false -vserver DEMO' command to disable SMB1.
* The volume contains LUNs. Use the "lun delete -vserver DEMO -volume lotsafiles -lun *" command to remove the LUNs, or use the "lun move start" command to relocate the LUNs to other
FlexVols.
* NFSv3 MS-DOS client support is enabled on Vserver "DEMO". Use the "vserver nfs modify -vserver DEMO -v3-ms-dos-client disabled" command to disable NFSv3 MS-DOS client support on the
Vserver. Note that disabling this support will disable access for all NFSv3 MS-DOS clients connected to Vserver "DEMO".

As you can see, we have some blockers, such as SMB1 and the LUN I created (to intentionally break conversion). So, I clear them with the recommendations and run it again and see some of our caveats:

cluster::*> volume conversion start -vserver DEMO -volume lotsafiles -foreground true -check-only true
Conversion of volume "lotsafiles" in Vserver "DEMO" to a FlexGroup can proceed with the following warnings:
* After the volume is converted to a FlexGroup, it will not be possible to change it back to a flexible volume.
* Converting flexible volume "lotsafiles" in Vserver "DEMO" to a FlexGroup will cause the state of all Snapshot copies from the volume to be set to "pre-conversion". Pre-conversion Snapshot copies cannot be restored.

Now, let’s convert. But, first, I’ll start a script that takes a while to complete, while also monitoring performance during the conversion using Active IQ Performance Manager.

The conversion of the volume took less than 1 minute, and the only disruption I saw was a slight drop in IOPS:

cluster::*> volume conversion start -vserver DEMO -volume lotsafiles -foreground true

Warning: After the volume is converted to a FlexGroup, it will not be possible to change it back to a flexible volume.
Do you want to continue? {y|n}: y
Warning: Converting flexible volume "lotsafiles" in Vserver "DEMO" to a FlexGroup will cause the state of all Snapshot copies from the volume to be set to "pre-conversion". Pre-conversion Snapshot
         copies cannot be restored.
Do you want to continue? {y|n}: y
[Job 23671] Job succeeded: success
cluster::*> statistics show-periodic
cpu cpu total fcache total total data data data cluster cluster cluster disk disk pkts pkts
avg busy ops nfs-ops cifs-ops ops spin-ops recv sent busy recv sent busy recv sent read write recv sent
---- ---- -------- -------- -------- -------- -------- -------- -------- ---- -------- -------- ------- -------- -------- -------- -------- -------- --------
34% 44% 14978 14968 10 0 14978 14.7MB 15.4MB 0% 3.21MB 3.84MB 0% 11.5MB 11.6MB 4.43MB 1.50MB 49208 55026
40% 45% 14929 14929 0 0 14929 15.2MB 15.7MB 0% 3.21MB 3.84MB 0% 12.0MB 11.9MB 3.93MB 641KB 49983 55712
36% 44% 15020 15020 0 0 15019 14.8MB 15.4MB 0% 3.24MB 3.87MB 0% 11.5MB 11.5MB 3.91MB 23.9KB 49838 55806
30% 39% 15704 15694 10 0 15704 15.0MB 15.7MB 0% 3.29MB 3.95MB 0% 11.8MB 11.8MB 2.12MB 4.99MB 50936 57112
32% 43% 14352 14352 0 0 14352 14.7MB 15.3MB 0% 3.33MB 3.97MB 0% 11.3MB 11.3MB 4.19MB 27.3MB 49736 55707
37% 44% 14807 14797 10 0 14807 14.5MB 15.0MB 0% 3.09MB 3.68MB 0% 11.4MB 11.4MB 4.34MB 2.79MB 48352 53616
39% 43% 15075 15075 0 0 15076 14.9MB 15.6MB 0% 3.24MB 3.86MB 0% 11.7MB 11.7MB 3.48MB 696KB 50124 55971
32% 42% 14998 14998 0 0 14997 15.1MB 15.8MB 0% 3.23MB 3.87MB 0% 11.9MB 11.9MB 3.68MB 815KB 49606 55692
38% 43% 15038 15025 13 0 15036 14.7MB 15.2MB 0% 3.27MB 3.92MB 0% 11.4MB 11.3MB 3.46MB 15.8KB 50256 56150
43% 44% 15132 15132 0 0 15133 15.0MB 15.7MB 0% 3.22MB 3.87MB 0% 11.8MB 11.8MB 1.93MB 15.9KB 50030 55938
34% 42% 15828 15817 10 0 15827 15.8MB 16.5MB 0% 3.39MB 4.10MB 0% 12.4MB 12.3MB 4.02MB 21.6MB 52142 58771
28% 39% 11807 11807 0 0 11807 12.3MB 13.1MB 0% 2.55MB 3.07MB 0% 9.80MB 9.99MB 6.76MB 27.9MB 38752 43748
33% 42% 15108 15108 0 0 15107 15.1MB 15.5MB 0% 3.32MB 3.91MB 0% 11.7MB 11.6MB 3.50MB 1.17MB 50903 56143
32% 42% 16143 16133 10 0 16143 15.1MB 15.8MB 0% 3.28MB 3.95MB 0% 11.8MB 11.8MB 3.78MB 9.00MB 50922 57403
24% 34% 8843 8843 0 0 8861 14.2MB 14.9MB 0% 3.70MB 4.44MB 0% 10.5MB 10.5MB 8.46MB 10.7MB 46174 53157
27% 37% 10949 10949 0 0 11177 9.91MB 10.2MB 0% 2.45MB 2.84MB 0% 7.46MB 7.40MB 5.55MB 1.67MB 31764 35032
28% 38% 12580 12567 13 0 12579 13.3MB 13.8MB 0% 2.76MB 3.26MB 0% 10.5MB 10.6MB 3.92MB 19.9KB 44119 48488
30% 40% 14300 14300 0 0 14298 14.2MB 14.7MB 0% 3.09MB 3.68MB 0% 11.1MB 11.1MB 2.66MB 600KB 47282 52789
31% 41% 14514 14503 10 0 14514 14.3MB 14.9MB 0% 3.15MB 3.75MB 0% 11.2MB 11.2MB 3.65MB 728KB 48093 53532
31% 42% 14626 14626 0 0 14626 14.3MB 14.9MB 0% 3.16MB 3.77MB 0% 11.1MB 11.1MB 4.84MB 1.14MB 47936 53645
ontap9-tme-8040: cluster.cluster: 11/13/2019 22:44:39
cpu cpu total fcache total total data data data cluster cluster cluster disk disk pkts pkts
avg busy ops nfs-ops cifs-ops ops spin-ops recv sent busy recv sent busy recv sent read write recv sent
---- ---- -------- -------- -------- -------- -------- -------- -------- ---- -------- -------- ------- -------- -------- -------- -------- -------- --------
30% 39% 15356 15349 7 0 15370 15.3MB 15.8MB 0% 3.29MB 3.94MB 0% 12.0MB 11.8MB 3.18MB 6.90MB 50493 56425
32% 42% 14156 14146 10 0 14156 14.6MB 15.3MB 0% 3.09MB 3.68MB 0% 11.5MB 11.7MB 5.49MB 16.3MB 48159 53678

This is what the performance looked like from AIQ:

convert-perf.png

And now, we have a single member FlexGroup volume:

cluster::*> volume show lots*
Vserver Volume Aggregate State Type Size Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
DEMO lotsafiles - online RW 10TB 7.33TB 0%
DEMO lotsafiles__0001
aggr1_node1 online RW 10TB 7.33TB 0%
2 entries were displayed.

And our snapshots are still there, but are marked as “pre-conversion”:

cluster::> set diag
cluster::*> snapshot show -vserver DEMO -volume lotsafiles -fields is-convert-recovery,state
vserver volume snapshot state is-convert-recovery
------- ---------- -------- -------------- -------------------
DEMO lotsafiles base pre-conversion false
DEMO lotsafiles hourly.2019-11-13_1705
pre-conversion false
DEMO lotsafiles hourly.2019-11-13_1805
pre-conversion false
DEMO lotsafiles hourly.2019-11-13_1905
pre-conversion false
DEMO lotsafiles hourly.2019-11-13_2005
pre-conversion false
DEMO lotsafiles hourly.2019-11-13_2105
pre-conversion false
DEMO lotsafiles hourly.2019-11-13_2205
pre-conversion false
DEMO lotsafiles clone_clone.2019-11-13_223144.0
pre-conversion false
DEMO lotsafiles convert.2019-11-13_224411
pre-conversion true
9 entries were displayed.

Snap restores will fail:

cluster::*> snapshot restore -vserver DEMO -volume lotsafiles -snapshot convert.2019-11-13_224411

Error: command failed: Promoting a pre-conversion Snapshot copy is not supported.

But we can still grab files from the client:

[root@centos7 scripts]# cd /lotsafiles/.snapshot/convert.2019-11-13_224411/pre-convert/
[root@centos7 pre-convert]# ls
topdir_0 topdir_14 topdir_2 topdir_25 topdir_30 topdir_36 topdir_41 topdir_47 topdir_52 topdir_58 topdir_63 topdir_69 topdir_74 topdir_8 topdir_85 topdir_90 topdir_96
topdir_1 topdir_15 topdir_20 topdir_26 topdir_31 topdir_37 topdir_42 topdir_48 topdir_53 topdir_59 topdir_64 topdir_7 topdir_75 topdir_80 topdir_86 topdir_91 topdir_97
topdir_10 topdir_16 topdir_21 topdir_27 topdir_32 topdir_38 topdir_43 topdir_49 topdir_54 topdir_6 topdir_65 topdir_70 topdir_76 topdir_81 topdir_87 topdir_92 topdir_98
topdir_11 topdir_17 topdir_22 topdir_28 topdir_33 topdir_39 topdir_44 topdir_5 topdir_55 topdir_60 topdir_66 topdir_71 topdir_77 topdir_82 topdir_88 topdir_93 topdir_99
topdir_12 topdir_18 topdir_23 topdir_29 topdir_34 topdir_4 topdir_45 topdir_50 topdir_56 topdir_61 topdir_67 topdir_72 topdir_78 topdir_83 topdir_89 topdir_94
topdir_13 topdir_19 topdir_24 topdir_3 topdir_35 topdir_40 topdir_46 topdir_51 topdir_57 topdir_62 topdir_68 topdir_73 topdir_79 topdir_84 topdir_9 topdir_95

Now, I can add more member volumes using “volume expand”:

cluster::*> volume expand -vserver DEMO -volume lotsafiles -aggr-list aggr1_node1,aggr1_node2 -aggr-list-multiplier 2

Warning: The following number of constituents of size 10TB will be added to FlexGroup "lotsafiles": 4. Expanding the FlexGroup will cause the state of all Snapshot copies to be set to "partial".
Partial Snapshot copies cannot be restored.
Do you want to continue? {y|n}: y

Warning: FlexGroup "lotsafiles" is a converted flexible volume. If this volume is expanded, it will no longer be able to be converted back to being a flexible volume.
Do you want to continue? {y|n}: y
[Job 23676] Job succeeded: Successful

But remember, the data doesn’t redistribute. The original member volume will keep the files in place:

cluster::*> df -i lots*
Filesystem iused ifree %iused Mounted on Vserver
/vol/lotsafiles/ 3630682 102624948 3% /lotsafiles DEMO
/vol/lotsafiles__0001/ 3630298 17620828 17% /lotsafiles DEMO
/vol/lotsafiles__0002/ 96 21251030 0% --- DEMO
/vol/lotsafiles__0003/ 96 21251030 0% --- DEMO
/vol/lotsafiles__0004/ 96 21251030 0% --- DEMO
/vol/lotsafiles__0005/ 96 21251030 0% --- DEMO
6 entries were displayed.

cluster::*> df -h lots*
Filesystem total used avail capacity Mounted on Vserver
/vol/lotsafiles/ 47TB 2735MB 14TB 0% /lotsafiles DEMO
/vol/lotsafiles/.snapshot
2560GB 49MB 2559GB 0% /lotsafiles/.snapshot DEMO
/vol/lotsafiles__0001/ 9728GB 2505MB 7505GB 0% /lotsafiles DEMO
/vol/lotsafiles__0001/.snapshot
512GB 49MB 511GB 0% /lotsafiles/.snapshot DEMO
/vol/lotsafiles__0002/ 9728GB 57MB 7505GB 0% --- DEMO
/vol/lotsafiles__0002/.snapshot
512GB 0B 512GB 0% --- DEMO
/vol/lotsafiles__0003/ 9728GB 57MB 7766GB 0% --- DEMO
/vol/lotsafiles__0003/.snapshot
512GB 0B 512GB 0% --- DEMO
/vol/lotsafiles__0004/ 9728GB 57MB 7505GB 0% --- DEMO
/vol/lotsafiles__0004/.snapshot
512GB 0B 512GB 0% --- DEMO
/vol/lotsafiles__0005/ 9728GB 57MB 7766GB 0% --- DEMO
/vol/lotsafiles__0005/.snapshot
512GB 0B 512GB 0% --- DEMO
12 entries were displayed.

Converting a FlexVol in a SnapMirror relationship

Now, let’s take a look at a volume that is in a SnapMirror.

cluster::*> snapmirror show -destination-path data_dst -fields state
source-path destination-path state
----------- ---------------- ------------
DEMO:data   DEMO:data_dst    Snapmirrored

If I try to convert the source, I get an error:

cluster::*> vol conversion start -vserver DEMO -volume data -check-only true

Error: command failed: Cannot convert volume "data" in Vserver "DEMO" to a FlexGroup. Correct the following issues and retry the command:
       * Cannot convert source volume "data" because destination volume "data_dst" of the SnapMirror relationship with "data" as the source is not converted.  First check if the source can be converted to a FlexGroup volume using "vol
       conversion start -volume data -convert-to flexgroup -check-only true". If the conversion of the source can proceed then first convert the destination and then convert the source.

So, I’d need to convert the destination first. To do that, I need to quiesce the snapmirror:

cluster::*> vol conversion start -vserver DEMO -volume data_dst -check-only true

Error: command failed: Cannot convert volume "data_dst" in Vserver "DEMO" to a FlexGroup. Correct the following issues and retry the command:
* The relationship was not quiesced. Quiesce SnapMirror relationship using "snapmirror quiesce -destination-path data_dst" and then try the conversion.

Here we go…

cluster::*> snapmirror quiesce -destination-path DEMO:data_dst
Operation succeeded: snapmirror quiesce for destination "DEMO:data_dst".

cluster::*> vol conversion start -vserver DEMO -volume data_dst -check-only true
Conversion of volume "data_dst" in Vserver "DEMO" to a FlexGroup can proceed with the following warnings:
* After the volume is converted to a FlexGroup, it will not be possible to change it back to a flexible volume.
* Converting flexible volume "data_dst" in Vserver "DEMO" to a FlexGroup will cause the state of all Snapshot copies from the volume to be set to "pre-conversion". Pre-conversion Snapshot copies cannot be restored.

When I convert the volume, it lets me know my next steps:

cluster::*> vol conversion start -vserver DEMO -volume data_dst

Warning: After the volume is converted to a FlexGroup, it will not be possible to change it back to a flexible volume.
Do you want to continue? {y|n}: y
Warning: Converting flexible volume "data_dst" in Vserver "DEMO" to a FlexGroup will cause the state of all Snapshot copies from the volume to be set to "pre-conversion". Pre-conversion Snapshot copies cannot be restored.
Do you want to continue? {y|n}: y
[Job 23710] Job succeeded: SnapMirror destination volume "data_dst" has been successfully converted to a FlexGroup volume. You must now convert the relationship's source volume, "DEMO:data", to a FlexGroup. Then, re-establish the SnapMirror relationship using the "snapmirror resync" command.

Now I convert the source volume…

cluster::*> vol conversion start -vserver DEMO -volume data

Warning: After the volume is converted to a FlexGroup, it will not be possible to change it back to a flexible volume.
Do you want to continue? {y|n}: y
Warning: Converting flexible volume "data" in Vserver "DEMO" to a FlexGroup will cause the state of all Snapshot copies from the volume to be set to "pre-conversion". Pre-conversion Snapshot copies cannot be restored.
Do you want to continue? {y|n}: y
[Job 23712] Job succeeded: success

And resync the mirror:

cluster::*> snapmirror resync -destination-path DEMO:data_dst
Operation is queued: snapmirror resync to destination "DEMO:data_dst".

cluster::*> snapmirror show -destination-path DEMO:data_dst -fields state
source-path destination-path state
----------- ---------------- ------------
DEMO:data DEMO:data_dst Snapmirrored

While that’s fine and all, the most important part of a snapmirror is the restore. So let’s see if I can access files from the destination volume’s snapshot.

First, I mount the source and destination and compare ls output:

# mount -o nfsvers=3 DEMO:/data_dst /dst
# mount -o nfsvers=3 DEMO:/data /data
# ls -lah /data
total 14G
drwxrwxrwx 6 root root 4.0K Nov 14 11:57 .
dr-xr-xr-x. 54 root root 4.0K Nov 15 10:08 ..
drwxrwxrwx 2 root root 4.0K Sep 14 2018 cifslink
drwxr-xr-x 12 root root 4.0K Nov 16 2018 nas
-rwxrwxrwx 1 prof1 ProfGroup 0 Oct 3 14:32 newfile
drwxrwxrwx 5 root root 4.0K Nov 15 10:06 .snapshot
lrwxrwxrwx 1 root root 23 Sep 14 2018 symlink -> /shared/unix/linkedfile
drwxrwxrwx 2 root bin 4.0K Jan 31 2019 test
drwxrwxrwx 3 root root 4.0K Sep 14 2018 unix
-rwxrwxrwx 1 newuser1 ProfGroup 0 Jan 14 2019 userfile
-rwxrwxrwx 1 root root 6.7G Nov 14 11:58 Windows2.iso
-rwxrwxrwx 1 root root 6.7G Nov 14 11:37 Windows.iso
# ls -lah /dst
total 14G
drwxrwxrwx 6 root root 4.0K Nov 14 11:57 .
dr-xr-xr-x. 54 root root 4.0K Nov 15 10:08 ..
drwxrwxrwx 2 root root 4.0K Sep 14 2018 cifslink
dr-xr-xr-x 2 root root 0 Nov 15 2018 nas
-rwxrwxrwx 1 prof1 ProfGroup 0 Oct 3 14:32 newfile
drwxrwxrwx 4 root root 4.0K Nov 15 10:05 .snapshot
lrwxrwxrwx 1 root root 23 Sep 14 2018 symlink -> /shared/unix/linkedfile
drwxrwxrwx 2 root bin 4.0K Jan 31 2019 test
drwxrwxrwx 3 root root 4.0K Sep 14 2018 unix
-rwxrwxrwx 1 newuser1 ProfGroup 0 Jan 14 2019 userfile
-rwxrwxrwx 1 root root 6.7G Nov 14 11:58 Windows2.iso
-rwxrwxrwx 1 root root 6.7G Nov 14 11:37 Windows.iso

And if I ls to the snapshot in the destination volume…

# ls -lah /dst/.snapshot/snapmirror.7e3cc08e-d9b3-11e6-85e2-00a0986b1210_2163227795.2019-11-15_100555/
total 14G
drwxrwxrwx 6 root root 4.0K Nov 14 11:57 .
drwxrwxrwx 4 root root 4.0K Nov 15 10:05 ..
drwxrwxrwx 2 root root 4.0K Sep 14 2018 cifslink
dr-xr-xr-x 2 root root 0 Nov 15 2018 nas
-rwxrwxrwx 1 prof1 ProfGroup 0 Oct 3 14:32 newfile
lrwxrwxrwx 1 root root 23 Sep 14 2018 symlink -> /shared/unix/linkedfile
drwxrwxrwx 2 root bin 4.0K Jan 31 2019 test
drwxrwxrwx 3 root root 4.0K Sep 14 2018 unix
-rwxrwxrwx 1 newuser1 ProfGroup 0 Jan 14 2019 userfile
-rwxrwxrwx 1 root root 6.7G Nov 14 11:58 Windows2.iso
-rwxrwxrwx 1 root root 6.7G Nov 14 11:37 Windows.iso

Everything is there!

Now, I expand the FlexGroup source to give us more capacity:

cluster::*> volume expand -vserver DEMO -volume data -aggr-list aggr1_node1,aggr1_node2 -aggr-list-multiplier 

Warning: The following number of constituents of size 30TB will be added to FlexGroup "data": 4. Expanding the FlexGroup will cause the state of all Snapshot copies to be set to "partial". Partial Snapshot copies cannot be restored.
Do you want to continue? {y|n}: y
[Job 23720] Job succeeded: Successful

If you notice, my source volume now has 5 member volumes. My destination volume… only has one:

cluster::*> vol show -vserver DEMO -volume data*
Vserver Volume Aggregate State Type Size Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
DEMO data - online RW 150TB 14.89TB 0%
DEMO data__0001 aggr1_node2 online RW 30TB 7.57TB 0%
DEMO data__0002 aggr1_node1 online RW 30TB 7.32TB 0%
DEMO data__0003 aggr1_node2 online RW 30TB 7.57TB 0%
DEMO data__0004 aggr1_node1 online RW 30TB 7.32TB 0%
DEMO data__0005 aggr1_node2 online RW 30TB 7.57TB 0%
DEMO data_dst - online DP 30TB 7.32TB 0%
DEMO data_dst__0001
aggr1_node1 online DP 30TB 7.32TB 0%
8 entries were displayed.

No worries! Just update the mirror and ONTAP will fix it for you.

cluster::*> snapmirror update -destination-path DEMO:data_dst
Operation is queued: snapmirror update of destination "DEMO:data_dst".

The update will initially fail with the following:

Last Transfer Error: A SnapMirror transfer for the relationship with destination FlexGroup "DEMO:data_dst" was aborted because the source FlexGroup was expanded. A SnapMirror AutoExpand job with id "23727" was created to expand the destination FlexGroup and to trigger a SnapMirror transfer for the SnapMirror relationship. After the SnapMirror transfer is successful, the "healthy" field of the SnapMirror relationship will be set to "true". The job can be monitored using either the "job show -id 23727" or "job history show -id 23727" commands.

The job will expand the volume and then we can update again:

cluster::*> job show -id 23727
Owning
Job ID Name Vserver Node State
------ -------------------- ---------- -------------- ----------
23727 Snapmirror Expand cluster
node1
Success
Description: SnapMirror FG Expand data_dst


cluster::*> snapmirror show -destination-path DEMO:data_dst -fields state
source-path destination-path state
----------- ---------------- ------------
DEMO:data DEMO:data_dst Snapmirrored

Now both FlexGroup volumes have the same number of members:

cluster::*> vol show -vserver DEMO -volume data*
Vserver Volume Aggregate State Type Size Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
DEMO data - online RW 150TB 14.88TB 0%
DEMO data__0001 aggr1_node2 online RW 30TB 7.57TB 0%
DEMO data__0002 aggr1_node1 online RW 30TB 7.32TB 0%
DEMO data__0003 aggr1_node2 online RW 30TB 7.57TB 0%
DEMO data__0004 aggr1_node1 online RW 30TB 7.32TB 0%
DEMO data__0005 aggr1_node2 online RW 30TB 7.57TB 0%
DEMO data_dst - online DP 150TB 14.88TB 0%
DEMO data_dst__0001
aggr1_node1 online DP 30TB 7.32TB 0%
DEMO data_dst__0002
aggr1_node1 online DP 30TB 7.32TB 0%
DEMO data_dst__0003
aggr1_node2 online DP 30TB 7.57TB 0%
DEMO data_dst__0004
aggr1_node1 online DP 30TB 7.32TB 0%
DEMO data_dst__0005
aggr1_node2 online DP 30TB 7.57TB 0%
12 entries were displayed.

So, there you have it… a quick and easy way to move from FlexVol volumes to FlexGroups!

Addendum: Does a High File Count Impact the Convert Process?

Short answer: No!

In a comment a few weeks ago, someone pointed out, rightly, that my 300k file volume wasn’t a *true* high file count. So I set out to create 500 million files and convert that volume. The hardest part was creating 500 million files, but I finally got there:

cluster::*> vol show -vserver DEMO -volume fvconvert -fields files,files-used,is-flexgroup
vserver volume    files      files-used is-flexgroup
------- --------- ---------- ---------- ------------
DEMO    fvconvert 2040109451 502631608  false

Since it took me so long to create that many files, I went ahead and created a FlexClone volume of it and split it, so I could keep the origin volume intact, because who doesn’t need 500 million files laying around?

Fun fact: That process *did* take a while – about 30 minutes:

cluster::*> vol clone split start -vserver DEMO -flexclone fvconvert -foreground true

Warning: Are you sure you want to split clone volume fvconvert in Vserver DEMO ? {y|n}: y
[Job 24230] 0% inodes processed.

cluster::*> job history show -id 24230 -fields starttime,endtime
node               record  vserver         endtime        starttime
------------------ ------- --------------- -------------- --------------
node1              2832338 cluster         12/09 10:27:08 12/09 09:58:16

After the clone split, I ran the check to see if we were good to go. I did have to run a “volume clone sharing-by-split undo” to get rid of shared FlexClone blocks which took a while, but after that:

cluster::*> volume conversion start -vserver DEMO -volume fvconvert -foreground true -check-only true
Conversion of volume "fvconvert" in Vserver "DEMO" to a FlexGroup can proceed with the following warnings:
* After the volume is converted to a FlexGroup, it will not be possible to change it back to a flexible volume.

I went ahead and ran the same script I was running earlier to generate load and watched the statistics on the cluster to see if we hit any outage. Again, the convert took seconds (with 500 million files) and there was just a small blip.

cluster::*> volume conversion start -vserver DEMO -volume fvconvert -foreground true

Warning: After the volume is converted to a FlexGroup, it will not be possible to change it back to a flexible volume.
Do you want to continue? {y|n}: y
[Job 24259] Job succeeded: success

convert-blip.png

Then, as the job was running I added new member volumes to the FlexGroup volume – again, no disruption.

cluster::*> volume expand -vserver DEMO -volume fvconvert -aggr-list aggr1_node1 -aggr-list-multiplier 3 -foreground true

Info: Unable to get information for Snapshot copies of volume "fvconvert" on Vserver "DEMO". Reason: No such snapshot.

Warning: The following number of constituents of size 40TB will be added to FlexGroup "fvconvert": 3.
Do you want to continue? {y|n}: y
[Job 24261] Job succeeded: Successful

Then 4 more member volumes:

cluster::*> volume expand -vserver DEMO -volume fvconvert -aggr-list aggr1_node2 -aggr-list-multiplier 4

Info: Unable to get information for Snapshot copies of volume "fvconvert" on Vserver "DEMO". Reason: No such snapshot.

Warning: The following number of constituents of size 40TB will be added to FlexGroup "fvconvert": 4.
Do you want to continue? {y|n}: y
[Job 24264] Job succeeded: Successful

Plus, I started to see more IOPs for the workload, and the job itself took much less time overall than when I ran it on a FlexVol.

convert-member-add.png

For a video of the capture, check it out here:

This was the job on the FlexVol:

# python file-create.py /fvconvert/files
Starting overall work: 2019-12-09 10:32:21.966337
End overall work: 2019-12-09 12:11:15.990707
total time: 5934.024611

This is how long it took on the FlexVol converted to a FlexGroup (with added member volumes):

# python file-create.py /fvconvert/files2
Starting overall work: 2019-12-10 11:02:28.621532
End overall work: 2019-12-10 12:22:48.523772
total time: 4819.95753193

This was the file distribution:

cluster::*> volume show -vserver DEMO -volume fvconvert* -fields files,files-used
vserver volume files files-used
------- --------- ---------- ----------
DEMO fvconvert 8160437804 502886230
DEMO fvconvert__0001
2040109451 502848737
DEMO fvconvert__0002
2040109451 12747
DEMO fvconvert__0003
2040109451 12749
DEMO fvconvert__0004
2040109451 12751

At the end of the job:

cluster::*> volume show -vserver DEMO -volume fvconvert* -fields files,files-used
vserver volume files files-used
------- --------- ----------- ----------
DEMO fvconvert 16320875608 530132794
DEMO fvconvert__0001
2040109451 506770209
DEMO fvconvert__0002
2040109451 3345330
DEMO fvconvert__0003
2040109451 3345330
DEMO fvconvert__0004
2040109451 3345319
DEMO fvconvert__0005
2040109451 3331657
DEMO fvconvert__0006
2040109451 3331635
DEMO fvconvert__0007
2040109451 3331657
DEMO fvconvert__0008
2040109451 3331657

And, for fun, I kicked it off again on the new FlexGroup. This time, I wanted to see how much faster the job ran, as well as how the files distributed on the more empty FlexVol members.

Remember, we started out with the newer member volumes all at less than 1% of files used (3.3 million of 2 billion possible files). The member volume that was converted from a FlexVol was using 25% of the total files (500 million of 2 billion).

After the job ran, we saw a ~3.2 million file count delta on the original member volume and a ~3.58 million file count delta on all the other members, which means we’re still balancing across all member volumes, but favoring the less full ones.

cluster::*> volume show -vserver DEMO -volume fvconvert* -fields files,files-used
vserver volume files files-used
------- --------- ----------- ----------
DEMO fvconvert 16320875608 557633288
DEMO fvconvert__0001
2040109451 509958440
DEMO fvconvert__0002
2040109451 6808792
DEMO fvconvert__0003
2040109451 6809225
DEMO fvconvert__0004
2040109451 6806843
DEMO fvconvert__0005
2040109451 6798959
DEMO fvconvert__0006
2040109451 6800054
DEMO fvconvert__0007
2040109451 6849375
DEMO fvconvert__0008
2040109451 6801600

With the new FlexGroup, converted from a FlexVol, our job time dropped from 5900 seconds to 4656 seconds and we were able to push 2x the amount of IOPs:

# python file-create.py /fvconvert/files3
Starting overall work: 2019-12-10 13:14:26.816860
End overall work: 2019-12-10 14:32:03.565705
total time: 4656.76723099

convert-newjob.png

As you can see, there’s an imbalance of files and data in these member volumes (way more in the original FlexVol), but performance still blows away the previous FlexVol performance because we are doing more efficient work across multiple nodes.

Not too shabby!