Behind the Scenes Episode 380: NetApp in the Gaming Industry with Justin Monast (former Naughty Dog)

Welcome to the Episode 380, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”

2019-insight-design2-warhol-gophers

NetApp ONTAP offers a wide range of features for enterprise workloads that boost performance, disaster protection, resiliency and much more. Many industries rely on that broad feature set for their most important business use cases, including the video game industry.

This week, a former director of IT at Naughty DogJustin Monast – stops by to discuss how NetApp ONTAP powered some of the most influential games in gaming history.

Finding the Podcast

You can find this week’s episode here:

I’ve also resurrected the YouTube playlist. You can find this week’s episode here:

You can also find the Tech ONTAP Podcast on:

I also recently got asked how to leverage RSS for the podcast. You can do that here:

http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss

Transcription

The following transcript was generated using Descript’s speech to text service and then further edited. As it is AI generated, YMMV.

Tech ONTAP Podcast Episode 380 – NetApp in Gaming with Justin Monast, former Naughty Dog
===

Justin Parisi: This week on the Tech ONTAP podcast, we talk to a gaming industry veteran about how NetApp powers some of the most famous games in the world.

Podcast Intro/Outro: [Intro]

Justin Parisi: Hello and welcome to the Tech ONTAP podcast. My name is Justin Parisi. I’m here in the basement of my house and with me today, I have a special guest to talk to us all about NetApp and the gaming industry and storage management and probably cloud. To do that, Justin Monast is here. Justin, what do you do and how do we reach you?

Justin Monast: Hey, Justin, how are you doing?

Justin Parisi: Good. Justin, how are you doing? We can do this all day.

Justin Monast: That’d be awesome. Again, my name is Justin Monast. I’ve just left Naughty Dog actually in April of this year. I’d been with the company for 28 years. I started off as a kind of an IT jack of all trades, just doing anything that needs to be done.

And at the time, the company was maybe four people in an office in the back lot of Universal Studios. And I had been in the game industry two years prior to that as actually a game designer, and I had just picked up learning how to build PCs and I had the opportunity to work at Naughty Dog, which took about 28 years and loved working there.

And I just decided it was time for a new journey and adventures and obviously been blessed with working with some of the best people in the gaming industry and I wish nothing but the best for Naughty Dog in its future

Justin Parisi: Yea it’s pretty much a dream come true right… Wanting to be in the gaming industry and then landing at a company like Naughty Dog.

Justin Monast: Absolutely I think my very first interview I looked at the team and I listened to them talk and I didn’t even know, at that time, the Crash game. They didn’t show it to me back then it was called Willie the Wombat. And just from talking to Jason Rubin and Andy Gavin, I was just enthralled. I was just like, you know, I want to work with you guys. I love your attitude. I love the direction you want to go to and how you want to make games.

And it just started from there.

Justin Parisi: This is Crash Bandicoot, right?

Justin Monast: Crash Bandicoot, that was released in 1996. I started at Naughty Dog in February of 1995, and we spent a good amount of time figuring out what that game was, because it kind of was a little bit slightly open world at that time.

But obviously the PlayStation hardware is fairly restrictive and the easiest and best way to actually get a product done was to simplify the gameplay. And as you see, it’s more into the screen tracking.

Justin Parisi: So Crash Bandicoot is I guess the flagship for Naughty Dog, but there’ve been plenty of other games, including some of the game of the year games, right?

So tell me a little bit about things that you’ve worked on or that you’ve been involved with at Naughty Dog in terms of gaming titles and that sort of thing.

Justin Monast: Within the 1st couple of years at Naughty Dog obviously everyone kind of just has their hands into everything. Right? So very small company.

And at that point, I started actually doing artwork on the Crash Bandicoot, which was the game lighting. That was actually done by hand. Every vertices control by RGB colors. I also helped out with gameplay, specifically the boss rounds on Crash 1. I think by the time we started doing other games like Jak and Daxter, I basically cemented myself to focus on IT driven technologies.

And especially when the company was growing to about, I think, 25 people, then I decided that it was enough time that I needed to focus just on that. Up until 4 years ago, running with just 2 other people, which is actually a pretty small team when you think about handling all your own internal NetApp, network switches, Active Directory domain controllers, exchange servers. We were basically just a very small lean team up until I think was about 2018, 2019. And then we had done Uncharted, which is now a movie and the critically acclaimed Last of Us series.

Justin Parisi: Yeah. And those are like two of the most revolutionary games I think of the last, I don’t know, 20 years or so, 40 years, right?

Last of Us basically to me was almost like an immersive movie. Right. You basically are in the game, you’re actually a part of the story more or less.

Justin Monast: No, absolutely. It’s interesting because for many, many years, the video game industry has been trying to get more movie like and trying to get into that space of telling a story but also having gameplay involved at the same time or having some kind of interactive without it just being kind of like Dragon’s Lair-ish, right? which is I don’t people remember that game from the 1980s, which is just a series of videos and then combinations of joysticks to play different video streams. But yeah Naughty Dog really hit it out of the park with that one and even so, the Last of Us series on HBO, which was phenomenal.

I think it’s one of the few times that a game IP gets translated to television or big screen and just hits it out of the park.

Justin Parisi: Yeah. They didn’t mess with the story. That’s usually what happens. These movies or these TV shows say, okay, we’re going to change this part and this part, for whatever reason, whether it’s cost or special effects, but they were just like, let’s just remake the game story. And that’s basically what they did. And you don’t mess with a good thing, right? You kind of stick with. You mentioned working on Crash Bandicoot and doing lighting and having a lot done by hand and things have changed quite a bit since you started doing that.

So talk to me about the evolution of that type of work. Tell me how you would do it back then and how they’re doing it now.

Justin Monast: So the original development of Crash Bandicoot was actually done on Silicon Graphics workstations. Which we were kind of enamored of from watching Jurassic Park and said, well, we want this type of hardware because that was the only thing you can really utilize if you wanted to do animation or modeling.

I mean, there was no PC graphics cards at that time or anything powerful enough to do what we needed to do. So we were always kind of on the forefront of technology and hardware, but at the same time, this required us to do a lot of proprietary tools running on the SGI. And so we had Andy Gavin, the co-president of the company wrote a program called "Neaten" – to neaten something.

And what it did, it allowed us to import a 3D model and add textures to it and actually per vertice – and a vertice is, just imagine when you’re looking at the polygonal thing. There’s just boxes, right? They’re either squares or triangles. And from what I remember, the PlayStation rendered out in triangles, but the power modeler was in square, so we were able to import that into the SGIs, and a custom built piece of software and literally take the vertice and say what RGB value you would have, right? So you’d have 3 sliders and you pick a color and you can save it. And then you can also have that vertice just color within that triangle itself or anything that touches other triangles on a vertice around it.

So you can do soft or hard shading. Just little touches like that I think make Crash Bandicoot one of the best looking games, right? Cause we really had that ability, that fine fidelity back then to paint on the screen per se.

Justin Parisi: So as you’re painting on the screen, I would imagine there’s a lot of compute happening in the background and that’s taking place on the desktop PC you’re working on. Is that accurate?

Justin Monast: Yeah. Well, that would be on the SGI and it would take several hours to render out a single level in Crash Bandicoot. Iteration was quite slow. But we did have a distributed build system back in the day in 1995, where we would actually use all six SGIs on the system on the network and try to do as much stuff as we could back then.

Justin Parisi: I imagine the bottleneck then was the CPU.

Justin Monast: CPU and memory, right? And then once we started working on the Optanes, we had about four megs of texture memory, which made things a little bit better.

Justin Parisi: And I guess where I’m going with this is, as things evolve, like as compute evolves, as you get GPUs, your bottlenecks start to shift back and forth to other things.

Justin Monast: Game lighting in today’s games, and let’s say Last of Us or Uncharted, there was some baking of lighting within it itself, right? And so that compute would take quite a bit of time to actually be done on a separate set of servers, that pre compute light.

When we were talking about PlayStation 3, PlayStation 4, PlayStation 5 games, and the tech that went along to making them, by that time we had some really, Amazing pieces of software that would help us with lighting that would have close to thousands of lights within the scene in itself, which would take hours and hours to render out. Because there’s 2 different ways of doing game lighting in games themselves.

And that’s baked in lighting where you can precompute the lighting of a room, right? As long as you don’t have a lot of moving lights within it, right? If everything’s pretty static, it’s simple. And a lot of that stuff you want to precompute, there’s no point of having your game console, PlayStation or XBox do any of that for you, right?

It’s just wasted compute cycles. Now, when you have live lighting, then that’s when you can obviously have someone walking around with a flashlight. That’s very useful and helpful. But then the cycles are actually dedicated to the hardware itself. So the better the game console, the more live lighting that you have. And with fancier graphics cards these days, you can actually use ray tracing to really do some amazing light effects. Also, let alone doing some good audio stuff through ray tracing, which a lot of people don’t think about. There’s so many different things besides just live lighting that you can do with that.

Justin Parisi: And when you’re doing these types of lighting renders, how many machines might drive that render workload.

Justin Monast: Oh, we’re talking about not so much machines, but cores. Hundreds and hundreds, thousands of cores sitting in racks within the office itself. I always try to keeping everything centralized. So I’m a firm believer that as much as cloud services are quite amazing and needed, if you lost your connection to the outside world, you cannot keep iterating. And so for doing a lot of this live lighting or even just iterations of levels, we would have thousands and thousands of CPU cores. Usually from 1 to 2U machines. Towards the end of the last two years, we were actually grabbing some AMD stuff that had some like 256 cores within a 1U configuration.

It’s amazing . But once you go to that density, you’re starting to talk about AC and power being an issue. So then you start looking like, well, okay, co-location might be best.

Justin Parisi: I imagine gets pretty warm in those rooms.

Justin Monast: Yeah, pretty warm. I ended up learning quite a bit about facilities work over the years at Naughty Dog based on having to deal with air conditioning problems. If you have a room that doesn’t have a hot and cold aisle, then how are you going to make that work? And then what do you do about power backups? Are you going to do battery? You’re going to do generator? How long do you want to keep an office running for?

Those days prior to COVID, we didn’t really have a lot of people working from home. It was mostly the IT department if something needed to be taken care of. Right? And that was like, 3 people at the most. And then you maybe had some of the co presidents or a couple of leads that would ask for that.

Justin Parisi: When you were putting these assets out there, I’m guessing there’s millions of files. Terabytes to petabytes of capacity, is that accurate?

Justin Monast: Yeah, I would say that by the time I had left, we had about 3. 5 petabytes of object storage – StorageGrid, which was amazing. And we had about maybe one and a half, two petabytes from spinning discs to NVMEs I think we were getting rid of some 8060s that we had and converting completely over to the A700 and A800 systems.

Justin Parisi: You mentioned object storage. When would you use object storage in these workflows versus a NAS environment?

Justin Monast: When we first brought that in about, I think it was around 2019. It was more of a way of regaining some space off of the NVMe and retiring some things off the 8060s on the spinning discs.

So it was kind of like a 2 fold attack on how to handle something that’s been growing since our 1st NetApp at the 3050 right? And I think the 3050 was probably around 15 years ago when we 1st had that. And so, in order for us to take an environment that really, there’s not much shifting or changing that you could do rapidly.

That’s where the StorageGrid came in. It allowed us to move a volume, do a volume move, which was an amazing piece of tech that can do that flawlessly. And then basically state, if the data is like 30 days old, then you just start moving things over to there. Or if we had some other volumes that were retired on spinning disk was like, okay, now we can just put those on the A700 and only have just a little bit of it live on an NVME metadata, basically, which was great and then have everything put back to StorageGrid.

It really allowed us to pivot and really focus on where we’re going to spend our money and how our data is going to be protected and used.

Justin Parisi: Would you leverage things like the the FabricPool technology where it would automatically tier off to StorageGrid or would you do that from a manual perspective?

Justin Monast: I did it manually because I’d like to have a little bit more control of my environment. And if something would have happened and someone forgot, like, oh, I set up this particular rule to do this, it probably causes problems.

Justin Parisi: And as far as your DR and backup and recovery, how did you handle those? With all those files that tends to be very time consuming to back it up and recover it.

So what were you using as a backup strategy?

Justin Monast: So we were using Networker and I know some people might snicker when they hear this, but it worked. And leveraging NDMP, we would be able to back up 100 terabyte volumes with millions and millions of files.

I think we had about 700 million files in one of our volumes, which is quite a bit of files, very small files. We can discuss this a little bit later about the data sets, but yeah, we leverage NDMP, which again, worked out really, really well. Probably the biggest thing is that normally, I mean, knock on wood for a DR set up, you’re not going to have a lot of times where you’re going to lose almost everything. It’s more for the one offs. And this is where snapshots really just took care of 99.99 percent of the time where someone needed something back. And my rules was at least 3 weeks of keeping files. Which again, took care of most of the problems. Now if you ever needed to go off tape, then that’s when you got a bigger issue.

And I think maybe less than a handful of times I ever had to come back and just say, I need to have a couple files from like two or three years ago. Because one of the other things I would do is make a SnapMirror of all the main volumes that were being used. So once a day they were being snapped off to another cluster set, with its own drive array. So if I lost a head unit and a whole shelf of disks, I could just repoint using the NetApp and just saying, okay, well, the data is here now. Now, it could be older by 24 hours or by now. Doesn’t matter. Luckily, I never even had to go that direction either. But with NetApp ONTAP I was able to do all these multiple little levels before it even went to tape. So the tape was more of like the office burnt down. Those tapes are offsite with an Iron Mountain type of company..

Justin Parisi: Yeah, and I know we’ve had instances in not necessarily the gaming industry, but I know there’s the famous Toy Story situation where they lost their entire movie and somebody found the backup and like, yay! save the day. But yeah, that’s not a situation you want to be in.

Justin Monast: Knock on wood. We never had a situation like that. And that’s kind of one of the reasons why I kind of retired from IT. It is not for the faint of heart, to be in these positions of CIO, vice president or director of IT. For me, that was it. I was the one running it. And I look back and I loved it because there’s so much cool stuff every day that you were learning to do. But someone says a volume is missing, your heart sinks. You’re like, wait, what do you mean a volume is missing? Often that’s just someone’s not looking in the right place. But,

Justin Parisi: yeah. It’s the phrasing, right?

Justin Monast: Yeah. I didn’t particularly purchase this but one of our first rate units was more of like kind of an off the shelf system that really had a problem.

And so I started looking into, I think our first real big boy RAID array was an SGI TP9300S. Which is a great system. Because again, we were kind of an SGI house, right? If it wasn’t for workstations, we actually used it as servers. And I actually got rid of my NTP server running an Origin 200 about 10 years ago.

Like, it was still running. I mean, it was a 12 year old piece of hardware, but it was still giving us our time sync.

Justin Parisi: Was there more than one server? Or is it just that one?

Justin Monast: It’s just that one. And to go off to that question, when you create a volume within NetApp, you have the option to do a mixed, Windows or NFS and we were always NFS. I think we felt that the permission sets were a little bit easier to do. The metadata was, it was much more simple. Plus at that time when we started, we were still Windows NT driven, right?

So there was not a lot of good integration. And if anything, it would have been maybe a NetApp appliance that was running through Samba at that time. But I digress. That was a long time ago.

Justin Parisi: So you were using NFS and you said there was Windows NT base. So how did you work that out? Were you using like Cygwin or something? Were you using Windows NFS?

Justin Monast: So Cygwin, we definitely started using Cygwin, but we did have at one point Sun. Sun Microsystems had a PC2NFS client.

Justin Parisi: Oh, yeah, I remember that.

Justin Monast: That was between that and Samba. But once we went to the NetApp, Samba disappeared from our environment because you guys were running SMB version 2.0. And it was definitely stable and strong enough to do everything we needed to do. Obviously Samba at that time was not necessarily hailed or liked by Microsoft. So basically, they were reverse engineering the Active Directory structure system, which was awesome. And I commend them even to this day to what we have done.

But it was just easier at that point. Once we got the 3050 to just say, okay, this is going to be our central way of doing things. And I think by then we probably had Windows server 2000, that had a little bit better Active Directory integration.

Justin Parisi: Yeah. Having all that in one place makes it a lot easier to manage. Honestly, Cygwin and stuff like that, they’re great for what they are, but they are kind of kludgy hacks.

I mean, they’re not built for that, right? The windows machines are not built to do NFS. So you’re going to run it as some weirdness there. But once you start to integrate more native SMB and CIFS, now that kind of streamlines things, makes things easier to manage, probably performs a bit better, probably see a lot fewer weird errors popping up.

Justin Monast: Absolutely. it was great doing a translation from NFS because NFS really has some very simple permission sets, right? It’s either you have access to it or you don’t, or you’re within a group, right? And so that allowed on the backend to say, well, this group can have access or that group can’t, and they’re just cut and dry.

Justin Parisi: Yeah. And you’re letting the NetApp handle that authentication and permission negotiations sort of stuff.

Justin Monast: Yes. That’s when they used to have a translation file.

Justin Parisi: This is back 7 mode days, right?

Justin Monast: And in 7. 0 days, they had a I think it was called sys translation.conf or something like that.

Justin Parisi: Yeah, yeah.

It actually got a little bit easier later on because there’s native one to one mappings, right? So if you have the same usernames, then they’ll map automatically. And then you have your name services like LDAP and NIS, right? So you can leverage those and then Active Directory, of course.

So that all is built in. The only time it gets tricky is when you start to have this mismatch of usernames and then you got to start getting tricky with the name mapping rules and that sort of thing. But overall, it’s the bread and butter of what we’ve always done.

Justin Monast: Oh, absolutely. And NIS is actually nice. I really like that. Pretty simple directory name lookups and whatnot.

Justin Parisi: Yeah, it’s pretty easy and simple. I think the downside of NIS is it’s not as secure and not as scalable. So LDAP, you have like that built in replication with AD. You can do the extensions into AD and leverage the Unix attributes and if you lose a domain controller, you don’t lose your name services. But NIS, you lose that NIS service. Like, Oh, can’t resolve names anymore.

Justin Monast: Yeah. Yeah. So there’s that one time our NIS boxes were hardware, or physical. And then we had one that was virtual and somehow someone had put the virtual NIS box as the master. Then we reboot the NetApp, which the NetApp was hosting VMware’s .

Justin Parisi: Oh, the circular dependency thing where it’s like it’s on the VMs….

Justin Monast: yep.

Justin Parisi: Like, where’s my NIS server? Same thing with DNS.

Justin Monast: Exactly. And the reason we were just running on VMware, was just as a backup. Because we were just testing some of that stuff out. But I think during the testing, we were just pushing it back and forth to say, okay, if it’s master, how it’s response times, right? Because even 20 years ago, PC stuff was not the greatest.

So we’re like, okay, what’s our worst case scenario? And really, I love working on game preservations and stuff like that. So it’s how much can we virtualize some of our services that if we ever had to go back and convert a game, I can give this to another company and saying, here’s a virtualization of, what our NAS or some of our workstations so we can give it out to someone.

And, but yeah, we had a chicken or the egg type of thing. And we spent a couple of hours on the weekend trying to scratch our head and it was just first trying to figure out what happened and then realizing. Oh, it’s just a simple fix and just reboot this one thing and everything’s good to go.

But yes, you’re right. You can get in trouble because even NIS plus has not grown.

Justin Parisi: Yeah. And it’s legacy, right? It’s never probably going to be fully gone. I mean, you still have, I don’t know, IBM mainframes running out there. You still have, yeah. Solaris boxes running out there. So if there’s a need for something it’ll stick around, but it is, I think, definitely going away in favor of using more LDAP servers that have more functionality, more robustness, more security.

Cause you can encrypt those packets. NIS plus and NIS, I don’t think you really have that encryption functionality unless you’re doing like IPsec and that sort of thing.

Justin Monast: Yeah. And with that, Microsoft has really embraced Linux environments, knowing that things are going to be mixed. I mean, I’m running WSL 2 and Ubuntu on my Windows 11 PC, and I love it, right?

It allows me to just do some simple little things, some simple little tools that I need, so I don’t have to open up a whole virtual machine. I do have a separate PC that is an Ubuntu Box – 22.04 LTS that I like, but it’s like, I can do this quick little test or just run an application that way.

Justin Parisi: Yeah. They definitely have stepped up their game because they’ve kind of had to, right? Even moving SQL server to Linux support, being able to use that type of stuff. So they understand the landscape is changing.

Justin Monast: Yes. And not fighting it, which is nice.

Justin Parisi: Yeah, they fought it for a while, but now it’s I guess getting a little better there.

The windows subsystem for Linux is basically like a virtual box, though. Like, it’s a VM kind of thing, right?

Justin Monast: Yes. It’s very simple and there’s very low overhead, especially when you’re having an operating system running on the same CPU.

Justin Parisi: All right. So let’s take a step back where we were talking about what you’re using your NetApp for and all those files. In the older days, because you’ve used this since 7-mode and you moved on to the clustered ONTAP, you had this concept of FlexVols, and they were very good for the most part, but you had this serialness of the workload, right?

Especially when you’re dealing with high file counts. So talk to me about the challenges that you had at Naughty Dog with that and how you moved past that with burgeoning technologies within ONTAP.

Justin Monast: A lot of people in the game industry or in film, have other versioning systems, right?

Especially paid ones. You can use Perforce. And then there was AlienBrain or something else that escaped my mind. And for us, it was a proprietary thing. It was a very simple solution where basically you just have a repository folder that have the actual data that you’re working on. And then that is symbolically linked to every single user.

So the user gets added to a system that says, anytime that file gets created, it makes a link in everyone’s personal directory. So you have a copy of that said, like, Photoshop file. Now, when someone’s accessing it on windows. They don’t see it as symbolic link. They actually see it as a real file, which is great.

But the scaling factor of this is that let’s say if the project itself has a 1,000,000 files in it well, you have 1,000,000 symbolic links for every single directory. And then we can get a little bit further on and how each individual used it. But what we need to discuss really is the fact that we have volumes with hundreds and millions of files.

Now, we can offset once we get a new project and use junction point and basically pull the project out of our main volume that everyone accesses, but that can get to a pretty large amount of data, not so much large data sets, but just number of files. And so we have a challenge where you had a volume that basically had tons of 1k files to several 50 gig files and it was a mixture, so it was not more of saying, well, your data sets like very small and you can need to focus on this and this is the type of hardware to get. What really got us was when we were running from, like I said at the beginning, we had a 3050 NetApp, and then we had a 6040, which was a great piece of hardware.

I think that was the last one that was ONTAP 7. Then we moved over to an 8060. Now, our particular problem was within a 6000 series, which was running a 2.4 gigahertz CPU, or even a little bit faster. We ended up downgrading, not realizing, to a 2.1 gigahertz server, which was the 8060, although it had quite a bit of cores.

And that’s what that NetApp was telling us. It’s like, well, you can do so much and ONTAP’s getting more distributed as far as it’s threading goes and right. Great. Okay. It sounds like a good next step after having a system for about four years. We ended up getting in trouble and realizing we’re having a 20 percent drop in performance.

And that’s where working with NetApp support to realize that since we’re really write heavy on our volumes, well, when you have a single threaded operation on a write, that’s the only thing you can be doing. So if you have 100 people wanting to write a file at specific time, only one write can be done no matter how many cores that you have on the system itself.

And so that’s where we started seeing slowdowns on a brand new 8060. And you’re coming in from a 6000 line, you’re expecting at least 50 percent performance and something to be faster. But that’s not what we got. And I think that’s when we started working with NetApp and saying, there has to be a better solution. FlexGroups were a great solution for this, where then we can transition a volume over, have a constant sub volumes within it and say, okay, now we can start distributing the load a little bit. Now it’s not obviously as a smart back in the day, which is like wherever the data was, but it gave us the ability to leverage newer NetApp hardware, especially the A700, which I think is just a beast of a system from when it came out.

Justin Parisi: Yeah, and a lot of that has to do with the serial processing of that write metadata. And that’s all done at an affinity level per volume. And the reason why FlexGroup helps that is because you have just more volumes to work with, but it’s all masked, so you don’t have to deal with more volumes. You’re dealing with a single namespace. Say, Hey, I want to point to this share, just dump the data in there. Then ONTAP does the rest, figures out where things are going to go, places things across volume, balances that across. And that’s where the performance gains really come from because you’re not necessarily changing ONTAP.

You’re just taking what ONTAP has and then scaling it out and turning it into more parallel processing, which is really where everything’s going today, anyway. More parallelism because you have limits on anything. Hardware, CPU, RAM, you expect systems to be able to do more. And if you’re throwing more at a system, you expect it to be able to do more.

But if there’s a software problem, or if there’s a scale problem, or there’s a bottleneck somewhere, then that’s where you have to…

Justin Monast: Well, it just didn’t scale for single volume.

Justin Parisi: Well, I think what was happening was for years, the single volume was enough because systems couldn’t push them hard enough.

Right now, systems start to push things harder and you hit those bottlenecks because I think the FlexVol has like 60,000 IOP limit or something right around there. Right. And once you hit that, you stop. And then you look at the serial processing things like you mentioned in the high file counts, then it becomes a problem.

And now that systems can go further to the edge and scale out faster, you’re now dealing with the storage as the bottleneck. And you take that approach of parallelism with the volumes. And that’s how you get around that because now you can scale because you can just add more volumes to the workload, more cores.

Justin Monast: Oh, absolutely. No, I mean, absolutely. And as much as when you guys were clustering hardware. I mean, you had to start clustering, you know, in essence volumes in order to basically get that same kind of game because, you know, with Moore’s law, there’s only so much CPU you can actually put in the machine. And even then you can make that next step of being all envy me, but, you know, what’s what’s next after that to get performance gains?

And I think it’s great. The direction that you guys were heading.

Justin Parisi: Yeah, there’s, there’s also, you know, network constraints, there’s memory constraints, there’s disc constraints. I mean, there’s a lot of bottlenecks that can happen in any storage system, net up or not, that you just work around the best you can by spreading it, spreading the love across multiple nodes.

Justin Monast: Absolutely. And there are still some cases where you can make a case for spinning disks working in an environment that would be better than NVMe.

Justin Parisi: Yeah, I mean, cost wise, especially because you’re dealing with much more effective costs for things like archival… like object storage. You necessarily wouldn’t put all your archival stuff in object storage on flash because maybe that’s just overkill for it.

Justin Monast: 100%. And I think with NetApp’s tech with the compaction, compression and deduplication was just amazing. That was really a game changer. What we were able to get from 100 terabyte volume down to 60 terabytes. And we’re talking about mixed data from video to disk images to small individual, like inode files.

It was astonishingly good. And then on top of that, you add StorageGrid on the back end and you recovered so much space. Within the last 5 years with the tech that NetApp had come out with ONTAP and hardware, it’s the only thing that makes me wish to stay in IT, is to keep playing around with NetApp.

Justin Parisi: Yeah. Yeah. I think you would have liked playing with the FlexCache because that’s another way to scale further, right? So yeah, you have a limit of FlexGroups. You can only get a certain capacity. Well FlexCache, you can parallelize that performance more for your read workloads, right?

Because you can now add more cores And more nodes and more clusters. You could be in the cloud. So you have a lot of options there with the FlexCache to scale things out even more.

Justin Monast: I would like to talk about one of the reasons why I’ve been using NetApp for over 15 years is I’ve just really loved the stability that had been built in. From the HA pair configurations to double parity, just always have been leading the charge on making sure data is safe.

Justin Parisi: Right. Yeah. As a customer that’s worked with NetApp for probably over 20 years, how many times have you had an HA pair completely go out on you? And I’m sure it’s happened, but percentage wise, is it 1 percent of the time, 0 percent of the time?

What have you seen where you have an entire HA pair go down?

Justin Monast: Never.

Justin Parisi: That’s actually surprising. I would expect it at least to happen once because I mean, it’s hardware, right? Hardware fails. And sometimes…

Justin Monast: Well, I mean, this is where I go back to the sleepless nights of like, knowing that you have one head unit running out of two because the other one failed and it happened at one o’clock in the morning and you know you’re not going to get a piece of hardware back until that afternoon.

Justin Parisi: Yeah. You definitely are kind of holding your breath there.

Justin Monast: Yeah. I mean, there’s always Murphy’s law. I mean, no matter how good stuff is, but from my recollection, we never had a double head unit fail. And I don’t think we ever even had a double disk failure within a RAID group itself. Now, I’ve had close calls because once you’re into a situation like that, you go into the CLI and start looking at the disks within the RAID group itself and saying, okay, let me go take a look with, I forgot which command it was.

You could really get into the weeds on the CLI side of things with NetApp and really see what’s going on underneath it. But then you would say, Oh, okay, so the rebuild is going good. But now it’s showing that there’s a particular drive that it looks like it’s going to fail because it’s having a higher latency than other things.

One of the beautiful things we had done was set up Grafana connectivity and polling the NetApp like once a minute and dumping out all these commands, and then we would throw it into Grafana and basically say we could look at every 200 spinning disks and see the latency for each 1 for every minute. And now you guys have something very close to that with Prometheus. I believe, right? Did you offer, which we were using just as a couple of years ago, which would help…

Justin Parisi: yea Grafana or harvest or whatever.

Justin Monast: Yeah, harvest. And that’s great. The more information, the better, especially someone like me, because as much as as good as NetApp was and saying, well, we can kind of tell when a disk is going to fail, I could see it even sooner than that, by saying look at the latency on this one particular disk and I could go fail that one or do a background move and be really ahead of the game. We don’t leave things be. We don’t set it up and rack and stack it and like, okay, well, whenever it sends an email, then we’ll give it some attention. I was probably logged into that machine every single day, looking at it.

Justin Parisi: All right. So you mentioned FlexGroups, what other ONTAP features were you using at Naughty Dog to enhance your workloads?

Justin Monast: SAN. And lots of it.

Justin Parisi: Oh, what did you do with SAN?

Justin Monast: We probably had, oh boy, I think about 250 terabytes of SAN storage of 16 terabyte, individual volumes on Linux machines.

Justin Parisi: And what were the workloads you were throwing at those?

Justin Monast: They were basically large hash files.

Justin Parisi: Okay. And why would you choose SAN for that workload? I think I know the answer for this, but go ahead, .

Justin Monast: We realized that NFS and CIFS was not fast enough for us.

Justin Parisi: Yeah. I’m guessing these are very metadata heavy workloads.

Is that, is that correct?

Justin Monast: Extremely metadata heavy. And that’s when we went really deep with NetApp engineers a couple years ago, realizing what our workload was. We’re metadata heavy and we do iterations constantly. There’s some game companies that are different studios that basically iterate once a day, right?

Everyone does all their work and then they throw it into a central server. Does its computational stuff and in the morning they get a new version of a new level or a new version of the game. We strive to have as many iterations of what you’re working on constantly. If not live within the game itself, which allows you to make quicker, better, easier games. Now, obviously, NFS is very, very fast, especially when it’s configured correctly, but it just wasn’t doing what we needed it to. So we basically started having a bunch of SAN volumes running hash files, and hash directory structures. So you had 0 to ff and from there on. Very simple because the name of the file basically is the directory path.

Justin Parisi: Really the genesis of why that would be better is because with NFS, rather than sending a million GETATTRs back and forth over a network. Each one probably comes back very fast, but if there’s a million of them, that’s a lot.

Whereas with SAN, you’re doing all that pretty much on the client side. You’re not really working so much with that high metadata workload over a network. It’s happening all within the system itself. And it’s basically like a local disk. And that’s why it’s so much faster. You’re not traversing the network every time.

Justin Monast: Oh, absolutely. And before I had left, I was actually working on a system where everything was going to be SAN. I guess I should explain that we were using iSCSI for these volumes and that was pretty darn fast anyway over TCP/IP. But the long term solution was to move everything to a own private SAN solution or doing, I think what it was the…

Justin Parisi: NVMe over TCP.

Justin Monast: NVMe over TCP, or NVMe over SAN, I think that was the first test that I was going to do. And then maybe when we felt that our network was going to be able to handle it, then go over to TCP/IP. SAN still has a place. I remember 10 years ago, someone told me, it’s like, I don’t know, Justin, why are you using SAN? It’s just dead. It’s gone. I’m like, no, no, no, no. And it’s the same thing, actually, believe it or not with LTO drives. I was talking to a manufacturer when we were getting some LTO 7s just a couple years ago, and they said that the sales had rocketed because cloud based storage was getting too expensive. And it was cheaper for people to go back to tape, which is interesting because everyone said tape was too expensive. Cloud was cheap. Cloud gets expensive. People go back to tape.

Justin Parisi: Tape is dead. No, it’s not.

Justin Monast: Tape’s not dead. SAN’s not dead. Tape’s not dead.

Justin Parisi: I think what happens is… People get creative and they find new use cases and new technologies come out and networks get better And you find new life with these old things

Justin Monast: No, absolutely. And at a certain point I think something’s gonna have to happen, because there’s only so much density and disk that you can actually squeeze out. And the same thing with tape. I think I remember hearing LTO 9 was going to have to have a very specific temperature set for the library itself when it’s running because it would be the tolerances for the temperature to rise by a couple of degrees, would change the size of the tape itself, or it would cause expansion, which would then change where the data is supposed to be written or read from the head unit itself. That’s how tight the tolerances were going to be.

Justin Parisi: That sounds pretty stressful. Worrying about whether your tape is going to expand or shrink while it’s backing up.

Justin Monast: Yeah. But then you would have half a terabyte. Or half a petabyte on a tape drive, or something like on the tape. So you balance that out.

Justin Parisi: What about doing things like SnapMirror to object storage? What sort of use cases do you see for that? Or do you see anything valuable there?

Justin Monast: So we talked about what I would love to see NetApp have is to be able to actually have clones of StorageGrid devices. So just as a StorageGrid device itself, it’s just an object and you say, well, I have one on prem and I have one somewhere else, and they both make copies of themselves or that your primary object storage can then back itself up to another 1 somewhere else.

And I don’t think that technology was there yet when I was talking to my salesperson, but that’s where I’d like to have gone. I want to have as much on prem, but then clone as much to somewhere else. To some other data center.

Justin Parisi: Yeah. Your 3:2:1 backup strategy type of stuff.

Justin Monast: Yeah. And really if we ever go back to Networker, that wasn’t really a good long term solution anyway, because really what it comes down to is that you need to basically start taking a look at all the data that you have as an IT manager or director. We have Petabytes and petabytes of stuff, and very few people remember where that stuff is located and I believe NetApp is doing some AI stuff, right? That might help with finding out where things are, or am I incorrect on that?

Justin Parisi: So there’s BlueXP, which is kind of like the front end to the System Manager type of stuff. So it’s a new interface into the system where you can manage multiple systems and it’s cloud based. And I know that has a backup solution that also does indexing and I think that’s where you were talking about was fast finding of files. Being able to have an index of files, So when I do a search it doesn’t take years because of the metadata instead it knows where it is right away And that’s one of the main challenges of restores. Where is that file? Well, let me do a find *. No, don’t do that.

Justin Monast: I can’t imagine really thinking about all the data that’s around this world, right? Where it’s going to be located..

Justin Parisi: Yeah. And that’s problem. Even at your house. Right? Like I know that I have tons of photos. I am terrible at naming them, organizing them.

If I want to find a photo from a trip five years ago, I’m never going to find it. All right, Justin, thanks so much for joining us and talking to me about your NetApp experience and your Naughty Dog experience. There’s lots of good information there. If you haven’t worked in the gaming industry, you don’t necessarily know all the ins and outs.

So it’s interesting to learn that. If we wanted to get in touch with you to ask you more questions, how do we do that?

Justin Monast: LinkedIn is the best way to do that.

Justin Parisi: Okay. And you mentioned your last name was Mona, but I guess you’re dropping the S T there, so it’s spelled Monast.

Justin Monast: Yeah, it’s one of those weird French things.

It was originally Mona, then they added an S T, but they still pronounce it as Mona. So I go as either way Monast or Mona. So it doesn’t really matter. But if you were to look for me on LinkedIn, it’d be M-O-N-A-S-T .

Justin Parisi: Cool. And we’ll include that link in the blog that we accompany this podcast. So again, thank you so much for joining us and talking to us all about your experiences in the gaming industry, as well as the storage aspect.

Justin Monast: Right. It’s been wonderful. From one Justin to another. Thank you very much.

Justin Parisi: That’s right. Dual Justins.

Alright, that music tells me it’s time to go. If you’d like to get in touch with us, send us an email to podcast@netapp.com or send us a tweet @NetApp. As always, if you’d like to subscribe, find us on iTunes, Spotify, Google Play, iHeartRadio, SoundCloud, Stitcher, or via techontappodcast.com. If you liked the show today, leave us a review. On behalf of the entire Tech ONTAP Podcast team, I’d like to thank Justin Mona for joining us today. As always, thanks for listening.

Podcast Intro/Outro: [Outro]

 

 

Leave a comment