One of Clustered Data ONTAP’s Best Features That No One Knows About

Some questions I’ve gotten a few times go like this:

OMG, I deleted my volume. How do I get it back?

Or:

I deleted my volume and I’m not seeing the space given back to my aggregate. How do I fix that?

These questions started around clustered Data ONTAP 8.3. This is not a coincidence.

A little backstory

Back in my support days, we’d occasionally get an unfortunate call from a customer where they accidentally deleted a volume/the wrong volume and were frantically trying to get it back. Luckily, if you caught it in time, you could power down the filers and have one of our engineering wizards work his magic and recover the volume, since deletes take time as blocks are freed.

This issue came to a head when we had a System Manager design flaw that made deleting a volume *way* too easy and did not prompt the user for confirmation. Something had to be done.

Enter the Volume Recovery Queue

As a way to prevent catastrophe, clustered Data ONTAP 8.3 introduced a safety mechanism called the “volume recovery queue.” This feature is not entirely well known, as it’s buried in diag level, which means it doesn’t get documented in official product docs. However, I feel like it’s a cool feature that people need to know about, and one that should help answer questions like the ones I listed above.

Essentially, the recovery queue will take a deleted volume and keep it in the active file system (renamed and hidden from normal viewing) for a default of 12 hours. That means you have 12 hours to recover the deleted volume. It also means you have 12 hours until that space is reclaimed by the OS.

From the CLI man pages:

cluster::*> man volume recovery-queue
volume recovery-queue Data ONTAP 8.3 volume recovery-queue
NAME
 volume recovery-queue -- Manage volume recovery queue
DESCRIPTION
 The recovery-queue commands enable you to manage volumes that are deleted and kept in the recovery queue.
COMMANDS
 modify - Modify attributes of volumes in the recovery queue
purge-all - Purge all volumes from the recovery queue belonging to a Vserver
purge - Purge volumes from the recovery queue belonging to a Vserver
recover-all - Recover all volumes from the recovery queue belonging to a Vserver
recover - Recover volumes from the recovery queue belonging to a Vserver
show - Show volumes in the recovery queue

The above commands, naturally, should be used with caution, especially the purge commands. And the modify command should not be used to change the retention hours to delete things too aggressively. Definitely don’t set it to zero (which disables it).

How it works

When a volume is deleted, the volume gets renamed with the volume’s unique data set ID (DSID) appended and removed from the replicated database volume table. Instead, it’s viewable via the recovery queue for the 12 hour default retention period. During that time, space is not reclaimed, but the volume is still available to be recovered.

For example, my volume called “testdel” has a DSID of 1037:

cluster::*> vol show testdel -fields dsid
vserver volume  dsid
------- ------- ----
nfs     testdel 1037

When I delete the volume, we can’t see it in the volume table, but we can see it in the recovery queue, renamed to testdel_1037 (recall 1037 is the volume DSID):

cluster::*> vol offline testdel
Volume "nfs:testdel" is now offline.
cluster::*> vol delete testdel
Warning: Are you sure you want to delete volume "testdel" in Vserver "nfs" ? {y|n}: y
[Job 490] Job succeeded: Successful
cluster::*> vol show testdel -fields dsid
There are no entries matching your query.
cluster:*> volume recovery-queue show
Vserver   Volume       Deletion Request Time     Retention Hours
--------- ------------ ------------------------- ---------------
nfs       testdel_1037 Fri Mar 11 19:02:40 2016  12

That volume will be in the system for 12 hours unless I purge it out of the queue. That will free space up immediately, but will also remove the chance of being able to recover the volume. Run this command only if you’re sure the volume should be deleted.

cluster::*> volume recovery-queue purge -volume testdel_1037
Initializing
cluster::*> volume recovery-queue show
This table is currently empty.

Pretty straightforward, eh?

Pretty cool, too. I am a big fan of this feature, even if it means an extra step to delete a volume quickly. Better safe than sorry and all.

There is also a KB article on this, with a link to a video. It requires a valid NetApp support login to view:

https://kb.netapp.com/support/index?page=content&id=1014958

This KB shows how to enable it (if it’s somehow disabled):

https://kb.netapp.com/support/index?page=content&id=1015626

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s