Welcome to the Episode 157, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”
This week on the podcast, we welcome Mr. Performance himself, Tony Gaddis (gaddis@netapp.com) to give us a tutorial on easily finding performance issues using OnCommand Unified Manager, as well as some common “rules of thumb” when it comes to how much latency and node utilization is too much.
Also, check out Tony’s NetApp Insight 2018 session in Las Vegas and Barcelona:
1181-1 – ONTAP Storage Performance Design Considerations for Emerging Technologies
Podcast listener Mick Landry was kind enough to document the “rules of thumb” that I forgot to add to the blog in the comments. Here they are:
- Performance utilization on a node > 85% points to latency issue on the node (broad latency for volumes on the node)
- Performance capacity used on a node > 100% points one or more volumes on the node that have latency due to CPU resources running out.
- This is not an indicator of CPU headroom.
- 100% is “optimal” – below is wiggle room.
- Spinning disk
- Aggregate performance utilization – not capacity.
- > 50% relates to disk latency impact will increase.
- When queueing starts will double or triple latency on slow platters.
- Performance utilization of the disk drive.
- Fragmented free space on spinning disk
- Increases CP processing time
- 85% utilization of capacity of aggregate, this will become a problem.
- > 90% will impact heavy workloads
- Node utilization from an HA point of view
- Keep the sum on the node utilizations less than 100% and will be okay.
- For “user hours”, on “revenue generating systems”
- Disk
- Spinning disk utilization < 50%
- Aggregate latency expectations
- SATA latency < 12ms
- SAS latency < 8ms
- SSD latency < 2ms
Finding the Podcast
You can find this week’s episode here:
Also, if you don’t like using iTunes or SoundCloud, we just added the podcast to Stitcher.
http://www.stitcher.com/podcast/tech-ontap-podcast?refid=stpr
I also recently got asked how to leverage RSS for the podcast. You can do that here:
http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss
Our YouTube channel (episodes uploaded sporadically) is here:
You noted that you were going to post the rules of thumb in the podcast.
This is what I captured:
1. Performance utilization on a node > 85% points to latency issue on the node (broad latency for volumes on the node)
2. Performance capacity used on a node > 100% points one or more volumes on the node that have latency due to CPU resources running out.
a. This is not an indicator of CPU headroom.
b. 100% is “optimal” – below is wiggle room.
3. Spinning disk
a. Aggregate performance utilization – not capacity.
b. > 50% relates to disk latency impact will increase.
c. When queueing starts will double or triple latency on slow platters.
d. Performance utilization of the disk drive.
4. Fragmented free space on spinning disk
a. Increases CP processing time
b. > 85% utilization of capacity of aggregate, this will become a problem.
c. > 90% will impact heavy workloads
5. Node utilization from an HA point of view
a. Keep the sum on the node utilizations less than 100% and will be okay.
b. For “user hours”, on “revenue generating systems”
6. Disk
a. Spinning disk utilization < 50%
7. Aggregate latency expectations
a. SATA latency < 12ms
b. SAS latency < 8ms
c. SSD latency < 2ms
LikeLike
Sonofa… thanks! Forgot all about that! will copy/paste and add to blog (with credit to you!)
LikeLike