Welcome to the Episode 367, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”
!!WARNING: OLD SCHOOL IT AHEAD!!
Back in the early 2000s, if you were a sysadmin that had to set up desktops or servers, you probably used Norton Ghost to create OS images.
When virtualization became big, that process moved to creating VM templates.
But there was still often a lot of manual intervention in these tasks – password setting, BIOS updates, vCenter installations and more. As the IT world moves more and more towards automating software tasks, what about automating infrastructure provisioning? Sure, there’s cloud, but what if you’re still on-prem?
This week, Rob Hirschfeld (@zehicle) of RackN joins us to discuss how RackN helps automate the previously un-automate-able.
For more information on Generative DevOps, see:
Finding the Podcast
You can find this week’s episode here:
I’ve also resurrected the YouTube playlist. You can find this week’s episode here:
You can also find the Tech ONTAP Podcast on:
I also recently got asked how to leverage RSS for the podcast. You can do that here:
http://feeds.soundcloud.com/users/soundcloud:users:164421460/sounds.rss
Transcription
The following transcript was generated using Descript’s speech to text service and then further edited. As it is AI generated, YMMV.
Episode 367: Infrastructure Automation with RackN
===
Justin Parisi: This week on the Tech ONTAP Podcast, Rob Hirschfeld of RackN joins us to talk to us all about the joys of automation.
Podcast Intro/Outro: [Intro]
Justin Parisi: Hello and welcome to the Tech ONTAP podcast. My name is Justin Parisi. I’m here in the basement of my house, and with me today I have a special guest to talk to us all about cloudy stuff. So we have Rob Hirschfeld. So Rob, what do you do and how do I reach you?
Rob Hirschfeld: I don’t know if I can match your energy level, but I am going to try.
My name is Rob Hirschfeld.
Justin Parisi: I bring the energy, you bring the knowledge.
Rob Hirschfeld: I am CEO and co-founder of RackN. I’ve been in the cloud and related tech data center automation space for 25 years. Going back to an early startup, I did where we literally filed the first patents on cloud technology. Early ESX, Xen users, incredibly early virtualization users.
And then in my journey, I’ve been involved in API design automation, data center automation at RackN, which is going on almost nine years now. We started with this idea of making automation much more like a software product, commoditizing a lot of that knowledge and how people use automation and reuse automation because what we saw going on in the cloud was all this cool infrastructure and all this API driven everything, but people kept reinventing the wheel when it came to building the automation to drive it.
And that to us was really frustrating. It’s sort of something that I look back on my whole career, going back to the first times we were doing cloud work and trying to standardize and create more repeatable, more usable patterns out of all that. So that’s what RackN is.
Justin Parisi: Yeah, it feels like the automation stuff has leveled off a bit, but for a while there it felt very much like BetaMAX versus VHS or laser disc versus DVD, right? Where you had a lot of options. But you didn’t really know what the best option was. But today, you see more standardization and fewer overall options.
Mm-hmm. But they still are floating out there and there’s still a lot of companies that do it as a roll your own type of situation. So from your experience, what are you seeing out there for automation/standardization?
Rob Hirschfeld: Unfortunately I don’t. I mean, what RackN does, tries to flip some of these stories on our heads. I think that we have seen a lot of standardization, like around Ansible being a dominant configuration tool. And Terraform being a dominant provisioning tool is certainly not the only tools that do it, but neither of those tools were really designed to have the automation be reusable in them.
And if I’m writing a Terraform plan, it’s really, really hard to make that plan usable across multiple teams in my organization, or pulling a plan and have it just work from the community. And Ansible playbooks are a little better, but most people, as soon as they get a playbook, they look at it and they either tear out the things they don’t need or they make a copy and then it forks.
And then what happens with all that stuff, and this is where I don’t think we have, the standardization I want, is if I fix a playbook or a plan, and I make it better, it’s not composable, it’s not broken into reusable pieces. So I can’t share that with anybody easily.
And that to me, that’s a tools problem. We have a lot of desire to do shared work. Some desire to do shared work, maybe. But it’s been really, really hard to create repeatable operation success in any real scalable way.
Justin Parisi: Okay. Well, tell me about RackN and how you accomplish that.
How do you get that repeatability for those automation tasks and are you automating the automation?
Rob Hirschfeld: We do automate the automation, trying to create a new version of Ansible or a new version of Terraform. Or for us, we do a lot of bare metal also. So a new version of the Dell, HP, Lenovo, Cisco tooling just makes no sense.
But what we do see is that a lot of the things that people build are very similar. Like they’re building the same components but in slightly different ways. And what we looked at was how do you, you make those tools, Connect together better. So think of it this way, like if you’re building a Terraform plan, there’s a whole bunch of information that you put into that plan that you could actually templatize and you could abstract as variables or state information.
And once that Terraform plan has been built, there’s a lot of information that comes out of that plan that you want to be able to then pass into an Ansible playbook or into another script, right? Or into a inventory tracking system. What we actually work to do is make the things that call the tools standardized, and then make it really easy to inject information into that action and then pull information out of that action. Most people would call a workflow. That lets you connect all these pieces together and that creates in itself a ton of magic. And then we went a step further, and any one of those tasks is actually designed to be a immutable object and then versioned and then connected together as separate pieces.
So all the Terraform automation that we build is actually in a Terraform content pack. And it’s versioned, and it can be upgraded and patched and changed in a standardized way, but the templates that get called, you could put in a different content pack and then make those just your own.
So you’re bringing in community stuff, but also bringing in your own custom pieces. And then the system is designed to let those fit together, like snapping together Lego blocks in a lot of ways. then once they’re in production there, it’s all immutable. So you don’t have somebody sneaking in behind the scenes and tweaking a script.
You have a dev process that lets you create a template, create a content pack, bring that into a system to test, pick it up, move it to a production system, copy production sites to production sites, and know exactly what versions of all the automation is that’s been deployed.
Justin Parisi: So give me an example of a workflow that would fit into this type of automation process. I understand that automation in general is for eliminating or reducing the mundane, tedious tasks that administrators loathe doing. Right? But gimme some examples of workflows of where RackN can help improve the lives of these people.
Rob Hirschfeld: Yeah. I’ll start with a sophisticated one, but it’s one that has a lot of people flummoxed, and we do a lot of, which is installing ESX i servers and then building them into vCenter clusters. In order to do that work, to get a working vCenter cluster at the end of your journey. You actually start from a raw bare metal server, right? When that thing first boots, you have to inventory it, check that it got shipped to you correctly, scan it, make sure everything’s right, that it’s connected to all the ports and switches that it’s supposed to be connected to. That’s a workflow, a standardized task. But then that automatically connects into a BIOS and RAID configuration test. So you detect what piece of hardware you have, you look up the profiles that you’re supposed to have on that machine, you apply that configuration to the machine for its use. So all that’s driven by configuration at the start or discovered and then pulled from a library to figure out exactly what it should be set to. So that’s a whole bunch of automation, but you’ll notice that the discovery piece earned information that got fed into that next piece downstream, the BIOS and setup. Then you chain to another piece of automation that talks to the out-of-band management system and maybe sets credentials in. It sets certificates so it can be managed, like builds all that stuff up to the spec and then sends that information out to a CMDB so it’s all tracked and updated in the right places. Maybe brings in credentials from Active Directory so that you’re getting the right credentials on that box. Still haven’t even gotten to installing ESXi. Then you have to lay down the ESXi pieces. To make ESXi work. You actually have to log into that system and you have to set credentials for it, passwords. You might have to change the quality of the passwords that are required. We deal with banks a lot and they actually have to reset it. But that command can’t be done by remote. It has to be done on the system. All of these individual actions have to all be added up.
All of the things I’ve described, you are out of the box functionality for RackN. So you can start a pipeline that connects these workflows together. You get all that behavior. Once ESXi gets set up, the system actually is aware of that, it checks in, and it can then run a process that does Cloud Builder and takes all the machines, all the information you’ve discovered, and then at the cluster level, it waits until the machines are ready and then it starts a process that says, oh, I know what my machines are. I know their MAC addresses and their passwords and all this stuff. And then it injects that into Cloud Builder which then builds the vCenter cluster.
None of that is particularly differentiated. Certainly not value adding for any company. It’s not that different, but I can tell you every customer has slightly different requirements. And so what we’ve been able to do is take those standard processes and allow customers to inject, "oh, I need to do this one step," replaces inside these pipelines to inject custom steps, but then stay within the whole pipeline. So as we upgrade the pipelines and we constantly improve and add and evolve or VMware changes versions or the hardware changes constantly happens, their processes, you can bring in the appropriate components for each one of those, plug ’em all together and still have this working end-to-end process.
The customer doesn’t have to do anything, but take care of the little pieces they need to add. It’s not exactly abstract cuz it’s not an abstraction. But the way the automation connects together, all of those tools work in concert and they flow information across them.
Justin Parisi: That’s almost like a better, more robust Norton Ghost. Remember Ghost,
Rob Hirschfeld: I do remember Ghost. We used to use Ghost back, boy, this is going back 20 years ago, to build that first image and lay down the machines.
Yeah, no, it’s huge. But even with Ghost, when we would do that, somebody would have to log into each system and reset the password, and so all that stuff, you have to automate every one of those processes and steps.
Justin Parisi: Yeah. Yeah. It’s, it’s like I said, a better, more robust version of that.
But, you know, it sounds simple at it’s root, but it’s very important because I remember doing that and it sucked.
Rob Hirschfeld: The thing that we’ve learned is that a lot of companies, it’s 15 to 20 systems that were protocols and APIs that have to be coordinated, really orchestrated to perform these operations.
And if you can’t go through one of those steps, or if it’s not reliable to go through those steps and you can’t synchronize them correctly, then the whole system falls apart. So it’s been one of those things of just constantly extending the map, if you will, of all those systems that you have to coordinate and orchestrate.
We’ve been doing this long enough that we’ve gotten to a point where I think we’ve covered pretty much everything and sometimes people show up with some new, interesting hardware widget requirement, then we incorporate that into the product and it becomes a standardized piece, and the next person to ask, it just works out of the box.
Justin Parisi: Repeatability is one aspect of that. The other aspect of that is gonna be scale, right? So when you’re dealing with yes, 10 servers, that’s not so bad. But when you’re dealing with a thousand servers, then that gets a little hairier. So what kind of scale does RackN provide? Like what sort of benefits can it give you when you have a very large deployment to deal with?
Rob Hirschfeld: Yeah, we have customers that are in the 25,000 machines plus from a deployment scale. So we hit this quite a bit. And the funny thing is, even at 10, there’s two dimensions to scale with this. There is sheer numbers, but there’s also what I consider to be a even harder problem, which is your churn rate. You could have 10 machines and if you’re re-imaging them every day to get a new image or they’re part of a CI/CD cycle, those 10 machines are going through as much activity as a thousand machine data center, maybe more. And so the scale dimension is two things, right? It’s says, can I do the work across a lot of machines?
But it’s also, can I do it reliably in a highly repeatable system. That becomes a measure for it. And we see both. One of the things that we find as critical is if you don’t have a high reliability factor in what you build, then everything else you do falls apart. So the number one thing for scale, more than any other factor is the reliability of the automation that you build.
And there’s a great story of one of our early customers and they came to us at a 10,000 machine HPC cluster. They had a two hour window to reset the machines in that cluster on a rolling basis. So every weekend, they would do a quarter of the cluster, they had a two hour window they were supposed to do this in cuz the system’s supposed to be busy. And they had an 20% failure rate, 80% success rate on the automation in that two hour window. So they would have to reset the machines and bring them back up to standard, but two out of 10 of the machines didn’t complete in that timeframe, and it would take two hours before they’d find out there was a failure.
So, they were never making their window and worse, that meant they had to have a person watching to see what would happen, fixing that 20% dropout rate and it was a nightmare for them, and it meant the operators were working every weekend. Right.
There was always a catastrophe. There was always issues. We were able to take that reset window and get the systems down into a 30 minute reset process because we were able to sequence operations a little bit more dynamically. That’s an important thing. But even more importantly, is that we eliminated retries, and we actually fixed how things worked and then got the whole system operating up into a 99 plus percent changed.
So not only faster, but the reliability meant that you were gonna hit that window every time, and that the people were gonna be willing to do the resets and push the button to make things go. Without focusing on reliability, you never get to scale. And one of the things that surprises people in this is that part of getting to that reliability is eliminating automatic retries in the system.
Justin Parisi: How do you do that? How do you identify that and how do you eliminate them?
Rob Hirschfeld: It took me a while to get used to how important this was and we had to see it in production cuz we’re really used to building automation and if it fails, we just like, oh, I’ll protect myself by trying it again. Especially in infrastructure automation, that pattern is very destructive because a lot of times a retry will put a system into an even worse state than if it hit place where it didn’t know what to do. We just stop. So when we’re building automation, if the system gets into a state where the code isn’t working, instead of trying to fix it or just banging on it, the automation stops.
Now, one of the things that’s really nice is that we’ve made it so that if you fix whatever the issue is, you can continue from where you are. You don’t have to rerun in Ansible. You have to rerun the whole playbook. For us, since we store all the state, you can just pick up at that one spot, that one task, and then continue on from there.
So we built a whole bunch of infrastructure to make it very easy to have systems stop when they don’t know what they’re doing or when they get confused. But what that also does is it means that you fix problems. So if you’re constantly running automation and it breaks at step six, Rather than just keep trying step six or having the system retry it for you, we stop.
We say, wait, you better fix whatever caused this thing to break. And that actually builds up defense mechanisms in the automation really quickly over time. And so you start actually fixing root causes or you get information or you handle the exception correctly. And when you do that, then when things stops, it’s stopping when you need it to stop.
And telling an operator, Hey, I can’t continue because your network is misconfigured here, and I ran into something I didn’t expect, or I didn’t pass a check. We have a ton of diagnostics and checks that are now routine parts of how the system operates, and they’re incredibly valuable because that means that the system is gonna stop before it breaks something.
And with infrastructure, you do break things. You know, you can take a system off a network or put a password on that nobody can access, or you could actually break it if you put the wrong BIOS on it. So there are consequences to not fixing that, but that is the secret to scale, Having the discipline to make sure that your automation is doing the right thing, and as you improve it, continues to do the right thing.
Justin Parisi: Is there any sort of machine learning involved here? Are you leveraging that type of technology to help your automation succeed even more by learning the mistakes that you might make down the line.
Rob Hirschfeld: Not as much yet. We are expecting some pretty radical things to start showing up as some of the new AI techniques come in. One of the things that makes it very hard to do machine learning here is one is a lot of people don’t wanna share the data to do the machine learning.
And very hard to build a close loop feedback system to train the models because there is a lot of bespoke information here. So, people get very excited about AIOps, which can help you with the system. With the stuff we’re doing where you’re actually building the configuration, we haven’t yet found a lot of machine learning that can do that tuning for you. But I think we’re on the eve of having some more, what I would call generative DevOps capabilities across the industry, not just for RackN. We’re very excited about it.
Justin Parisi: It’s like the natural progression, right? We go back to our Norton Ghost example. Norton Ghost wasn’t gonna stay Norton Ghost forever. And now we have more automation built in that takes care of a lot of the stuff that we didn’t like about that. So now we need to automate the un-automateable, the things that might break that we don’t realize are breaking constantly, that the machine can find faster than we can and then store it into memory and say, okay, I recognize this pattern.
Let me go ahead and address that for you.
Rob Hirschfeld: This is where just even finding the patterns gets to be really powerful. The thing that made this hard before was that we kept rewriting the automation. So it was very hard to create a learning pattern if your playbook was not the same as somebody else’s playbook, right?
The first thing that we wanted to do is make sure that we could actually reuse each one of those pieces. And there’s a machine learning piece here, yes, but I would just put human learning as the higher priority. What we focused on first was being able to say, If I fixed a bug with a script that did an install, right, and made it more reliable, which is our goal. What I need to be able to do is I wanna be able to take that, put it back into the shared library, version it, and then make it so that everybody who’s using that code could confidently take that update and pull it into their systems also.
That’s human learning. That’s not even machine learning, it’s just being able to get the advantage of a community effect where as the code improves across all of the community, we actually get at, you know, this is the all bugs are shallow type of idea. If we can go in and say, oh, you know what, this is a bug in how we set up ESX, or an improvement to make it faster. If you can come back and say, for one person, one customer, we’ve improved things, pull it back into the code base, put it in the next version, and when you upgrade to the new version of the automation, you will get that benefit. That is the thing that really changes the dynamics here, because you’re getting the benefit of other people exercising the code. One of my favorite things to talk about is, complexity is the fact of life for infrastructure. We’re not gonna get out of infrastructure complexity by having less infrastructure or having less types of infrastructure. That cat’s already out of the bag. And so the only thing you can do with complexity is accept it and then defend against it.
And the defense for complexity is exercise, meaning the more you use the automation, the more you set up and tear down and go through that process. The stronger and more resilient your infrastructure is gonna be. And by making code reusable across different sites, across different companies, you actually get a community effect on the exercise.
And that’s incredibly powerful.
Justin Parisi: So that brings us to another aspect of this, and you kind of got me thinking about it when you started talking about the impact on the end user and the impact on the admin. So there’s a human element here. Right. And the human element is admins have a job, and now when you automate, they don’t have as much of a job.
Like, and that’s the way they feel. Right. But the reality is not that. So talk to me about that.
Rob Hirschfeld: Yeah. One of the things that, especially as we talk about some of these new generative AI capabilities and AI in general, There’s so much expertise in the operators and the admins that they don’t get to leverage because they end up spending a lot of time writing a script or you know, just doing toil work. Our experience has been very definitively no company has looked at us and said, we have enough operators. Right. They’re always, especially as you get higher in the architectural level skills those people have so many more projects, so much more work than they can do.
If you could take out writing scripts to set up BIOS or install operating systems or deal with a provisioning operation, or not even that, but just make it more reliable so they’re spending less time troubleshooting that work. Then they get to actually look at how the systems work.
They could focus on security, they could focus on performance improvements. There’s so much opportunity for operators to move up stack if they can be removed of that burden of, "oh wait a second, the script I have keeps breaking. I’m gonna have to log in." That’s hugely disruptive work. And so, if we can take some of that away from people, and you don’t even need AI to do this. This is what’s so funny. Just the software without the AI has so much embedded knowledge and expertise that you can actually elevate people’s roles to being much more effective and thinking about architecture and performance and AI, and even just helping customers and listening to what customers need, internal customers or external customers.
I’m really not seeing that type of thing. I was joking with somebody just this morning about this idea of a 10x operator. We’ve all heard about 10x developers who are way more productive than average. I do think AI might unlock this idea of a 10x operator, where a lot of things that used to take a ton of time to build – Ansible playbooks or, debugging something. All of that work we might actually speed up dramatically and let people really focus on less of the toil of those jobs.
Justin Parisi: Yeah. And earlier we talked about addressing the mundane tasks like the system set up, the machine set up. It’s also mundane to create Ansible recipe, right? It’s mundane to do the automation tasks. So if you can take away that mundane task, now you free them up for other less mundane, more interesting tasks, more architecture based stuff.
Rob Hirschfeld: Even doing things like starting to get some advice on, I wanna build Ansible.
What are some alternatives? Is this well structured? I’ve been playing with this generative DevOps concept where I will ask ChatGPT for this. There’s a lot of alternatives, but I’ll say, Hey, build me a Ansible playbook that does this. And then I’m like, yeah, but now I want you to put those things in roles. And it does that. And so you can go through the process and really change it. It was fun. Part of this exercise, I actually asked it to convert a Terraform plan into CLI calls. And it did a really nice job of converting Terraform into direct bash and CLI calls to make everything go.
So it opens up all these interesting ways that somebody with operational expertise instead of being replaced is actually coming back and being like, oh, wait a second. I can review my playbooks. I could refactor things more effectively. I could look at inverting something from one platform to another as a test criteria. One of the things I’m excited about, but we haven’t had a chance to play with yet, is actually letting the AI do some of the exercise. So you could take your infrastructure and say, my infrastructure’s idle at night, I want a script that deploys all the variants of the operating system and runs the security scan on them and then tears them down. We’re getting to that point. But that’s an example where a human could actually look at the system and say, "Wait a second. I actually want more scanning here and more security." And you still have to prompt these systems to know that needs to be done. That knowledge is it’ll probably be augmented if you’re willing to ask the AIs for some help or you’re coming back to vendor like RackN and being like, I wanna follow best practice. What do you see as best practice? And there’s so much expertise around, we just don’t do a good job tapping into it in infrastructure.
Justin Parisi: So what are some of the best practices that you see when you’re trying to be effective at automation? What are you telling your end users and your customers of the best ways of doing this whole process?
Rob Hirschfeld: The number one ROI we get is a dev/test prod cycle. So, this idea of testing in production is bad. But a lot of people don’t have ways to do automation dev/test/prod, and actually go through that cycle. And the reason they don’t have it is cuz it’s been very hard to lift automation reliably from one environment to another.
So you end up with bespoke dev, bespoke test, bespoke prod. But what we’ve been able to do with the composability of the automation is you can actually reuse the same workflows in each environment. And that means that you can test, you can build a test cycle, you can put things in Git, you can actually do sort of a real test process.
And then take that code and then move it into a pre-prod environment, test it, make sure everything runs, and then move that into production. And then what’s even better is then your production sites become high fidelity between each other and you have a lot less variation between the production sites also.
That makes a huge difference in sleep time for operators, which actually have reasonable KPI for people to be measuring. Right. How many times are you waking up operators at night? Or how stressed are they when they go into production that things aren’t gonna go. And the only way you fix that is by testing and exercising it.
So, that’s a top one for us. There’s another one that I think is really important for people to think about, which is transparency in how things operate. And this is a little bit of a catch- 22. You know, ideally once the system’s running, a lot of times you can boil it down to on/off, green, yellow, red type thinking.
But for any system, as it’s running and if it has errors, how easily it is to find and see what the system’s doing makes a huge difference in how well you can maintain and support it and keep it going. And so you really need to think through, can I watch the heartbeat of my system?
Can I see those logs? Can I watch stuff happen? Do I get good feedback as things are going on? Does it break a big task into small tasks that I can then track and then put through a performance monitoring? Right? That type of transparency in how the system works is really, really helpful for having systems that are maintainable and scalable and ultimately, That may feel a little bit more complex if you’re new to that system, but if you’re walking up to a system and having to maintain and fix it, it’s a breath of fresh air to have that type of transparency.
Justin Parisi: Yeah, it’s good to be able to see what you’re getting yourself into before you get yourself into it.
Rob Hirschfeld: I don’t know how many people like a tachometer nowadays, right? With a automatic car. But if you’re driving and you’re getting data back from your engine, it can really help you know if your engine’s performing well or not.
It’s the same thing with any infrastructure and operations tool. Unfortunately, we haven’t built a lot of that into most of the tools that we have especially the ones that are sort of desktop tools that have morphed into services. And so we end up with a lot of black box. I start a process, and if it fails, I know that it failed, but very hard to figure out what’s happening in the middle. It goes back to the whole retry thing, right? Stop in the middle, tell somebody it broke, show them where it broke, let them fix that problem and then continue.
If you presented with this oh, it didn’t work, let me try and tear things down. Hope I tear everything down and then restart. There’s a lot of orphaned infrastructure because of that. And that’s not good. It’s very hard on operators. They spend a lot of time chasing down partial success or partial failure.
Justin Parisi: So you touched on ChatGPT and OpenAI earlier. What is your experience with that so far? I know there’s a lot of mixed feelings about how well it works, how accurate it is. Does it do a pretty good job with this particular field? Or is it something that still needs some work?
Rob Hirschfeld: You know, this idea of generative DevOps and applying AI to the automation and infrastructure, I really do not see a lot of people talking about. Which surprises me. But it’s a lot more marketing copy and dev coding and things like that. I think infrastructure is harder from that perspective because you’re actually building something and so you can write a script, then you have to go test it to see if it actually works.
If systems actually come up, if it’s like Ansible and you need the machine to go run that Ansible on. And so I haven’t seen people playing with that as much. I know we are seeing this is a really interesting thing. Oh boy. It’s hard to break down. I definitely believe that we will see these tools used to write automation, but I don’t think that without some of the framework pieces to test that automation and exercise that automation, I think there’s actually risk there. Let me be specific. I asked ChatGPT to write an Ansible playbook. Great. Right? And this is happening. If you’re listening to this and you have operators and DevOps people or coders, they are using these tools, right? In our company we said, look, don’t do this under the surface.
We want to know when you’re doing it and how it’s working. I think companies that are not embracing of these technologies are having people use them under the covers. And I think that’s absolutely worse. But here’s the scenario. So I say, developer, I need you to go set up a web server backed by MySQL on this Amazon site. Go. And they’re gonna plug that into to ChatGPT or Bard or whatever, and they’re gonna get back a playbook that is going to look fine. It’ll probably test it and it may work with only a little bit of tweaking, but fundamentally? All set. The challenge here is that you now have a bespoke playbook, right?
Just like you had before, it was written by human. This one actually might be written better than the person would’ve written. But it still doesn’t necessarily conform to your standards or your compliance, you aren’t now reusing that code or that code might not be reusable. Now what I haven’t figured out is if you care. If ChatGPT can generate workable Ansible code, you’re gonna have every person on your team generating workable Ansible code.
No idea of sharing it, checking it. And I guess eventually you’ll have an AI that’s gonna come and conform that maybe? We’re not there yet. And so we run this risk of people using these tools not understanding what it’s really doing behind the covers. And I do not mean not understanding the model.
It means you’re building a playbook. Auditing that playbook is something that humans do not necessarily do well. So they’re gonna start taking that work, putting it in production, and then being like, oh, I’ve finished my task, I’m gonna move on. And there’s no institutional knowledge of how it was built, why it was built that way, what was factored into it.
Probably not maybe a bad thing? It’s hard for me to judge, right? You have an operator who just took an hour to complete a task instead of two days to complete a task. That’s remarkable. That’s a 10x operator right there. But I don’t know how those pieces fit into your operations strategy overall.
And I think that that’s missing. On the flip side, I think that it might have companies completely rethink some of their infrastructure purchase decisions because all of a sudden you might have somebody who’s like, oh, I never could finish that project and I needed to get done. Now they can knock out the project in a day and start the next project.
And so the operators that you have might actually be able to be much more effective. Or even more interestingly, you could be like, oh, I’d never go back to private cloud because I don’t have the security expertise that Amazon has and I don’t know how to scale it. And I never install operating systems as effectively as Amazon.
And all of a sudden you’re like, wait a second, I can type in a couple of commands and get a really high quality automation system up and running. And then actually use it to review the code and help maintain things. So it could actually re-tip some of those moats. That Cloud providers depend on their expertise moats, right? We’re better operators than you ever will be. And suddenly, enterprise operators might be like, oh, wait a second. I have access to that type of expertise? I don’t need to know it. Does that make sense?
Justin Parisi: Yeah, that makes sense. And it kind of makes you think about, what does this mean for the future of public cloud or hybrid cloud or both?
And how does the automation piece compare to doing something in the cloud? Is it still viable to do something in the cloud if you can just do it on your own on-prem? Or is there the consideration of CapEx and opex that really comes into play here?
Rob Hirschfeld: You know, CapEx and servers are relatively cheap.
All things considered renting a server, you rent them, they cost in about a year or less, you get back the money that it would cost to buy it. So, people are discovering cloud is pretty expensive. But that expertise barrier will make people be like, oh, wait a second.
I could spend that money, buy servers, and my one now 10x operator can manage my fleet. Where before, they were too busy re-flashing the BIOS, when I never got those jobs done. The thing I don’t think it does is it doesn’t turn us into super cloud users necessarily.
So maybe it turns us into more effective cloud users. But I haven’t done the math to figure out what a 10x cloud operator looks like. Maybe there is one also where you’re writing scripts and building clusters and making things go more effectively with the cloud. But that could actually take away, I don’t need to use Lambda anymore.
I could install my own Lambda service on the VMs I have. This is we’re opening up a whole bunch of cans of worms and it’s not clear yet where things are gonna go. Except what I’ve been starting to do is look at where there’s expertise based barriers for things and then starting to look at those as much more fragile towers than they used to be.
Justin Parisi: Yeah, I think cloud will be around, I think you’ll still see the burst workloads there cuz it is cheaper probably to burst to cloud than to on-prem.
Rob Hirschfeld: If you need fast bursty work, cloud is amazing. Don’t get me wrong. What we see is that our best users, users for on-prem automation are the ones who are very successful in the cloud and have great automation. Great automation expectations know how to consume APIs.
Those are the ones that show up with us, and they’re like rockets going through. They’re like, oh, okay. I can automate all this infrastructure on premises and keep control of it. I’m delighted. And then they just take their working cloud strategies and move it on premises and they go like gangbusters.
The ones who are trying to stay out of cloud for us are not as effective. Right? Because we’re all about infrastructure as coding dev processes and APIs and that’s all cloud disciplines. And if you don’t have that, you need to build that up or you’re not gonna stay on premises successfully. The people who are effective on cloud are gonna race circles around you, and they’re gonna do it on cloud, or they’re gonna be able to start doing it on premises.
Justin Parisi: As far as NetApp goes, we do both, so I don’t really care one way or the other. I mean, like, as long as you’re using our stuff, if it’s cloud on-prem, I don’t care.
Rob Hirschfeld: And things that NetApp does definitely stay relevant and high value, right? Because, AI is not gonna help you write proven battle tested storage infrastructure or a virtual machine management infrastructure, right? There’s a lot of these pieces that AI might help you write a script.
It’s not gonna help you build battle tested infrastructure tooling. You still need those things. And that’s part of what the whole value proposition is.
Justin Parisi: But what things like RackN can do is take storage solutions or any vendor, right? I’m guessing you’re agnostic to vendors, and you can do everything under a single umbrella.
So tell me more about this. It’s called Digital Rebar.
Rob Hirschfeld: So Digital Rebar provides that framework to connect these infrastructure pipelines together. Fundamentally we are allowing people to build that end-to-end process, very declaratively, start a workflow, and then connect all those pieces together.
And you’re right, we have a lot of abstractions around different storage vendors, different compute vendors, different switch vendors, different operating systems. We’ve been able to make those workflows have places where you can substitute in those pieces or switch to a different workflow that’s appropriate.
So if it’s Windows or Linux or mainframe – amazing what we bump into – the main thing that we’re helping you do is connect all those pieces together, because that’s where we find that the individual tools are great individual pockets of infrastructure types. People are always like, ah, bare metal vendors, they’re not doing anything well.
And I’m like, no, they’re doing great stuff. They just don’t also do your IP, and DNS. They do their thing and then you need to connect everything else together. That’s where RackN comes in and adds a lot of value. And that’s all very day one sounding.
There’s a lot of stuff we do to take that same automation and make it day two so that you can apply snippets of the automation into running system connect things together, do security scans, do patch, do audit with the same automation, which is really, really helpful to people, so they keep building on that reuse story that’s so important.
Justin Parisi: Where would I find it and do I have access for free demos or being able to use it to try and buy type of situation?
Rob Hirschfeld: It is software. We are unique in that we actually sell Digital Rebar as a software product.
So our customers want the control of running things themselves. And with automation, that’s really important because you don’t want somebody else to change your automation. You need that control. But yeah, it’s very easy. You can download Digital Rebar. There’s an integrated self trial in the system and, most people in an hour are booting and provisioning systems are connecting to the cloud and driving our out-of-the-box Terraform stuff to build machines and clouds.
So we’ve made it very, very easy for people to do self trials. No money. usually anonymous. Operators like to come in, kick the tires a little bit and see where things go. And we have a ton of videos and documentation. Tons of ways to get started running at that.
And just RackN.com and slash try it. For Digital Rebar.
Justin Parisi: All right, Rob. Well, thanks so much for joining us today and talking to us all about RackN, and how you can help us automate things a lot better and a lot easier and get rid of that Norton Ghost instance we still have floating out there.
So Rob, how do we reach you?
Rob Hirschfeld: Justin, it’s been a pleasure to come in. These questions have been amazing, right? I love connecting all of the automation and the pieces together. I am online, Zehicle, Z-E-H-I-C-L-E, goes back to my early electric car building days. And so that handle has been with me for quite a long time.
If people wanna reach out to me individually, that’s how you can find me on most platforms. And as always, RackN.com is a great front door to do all things RackN.
Justin Parisi: All right, cool. We’ll include all those links in the show notes and a link to RackN and your social media handles. And again, thanks for joining us and talking to us all about automation.
All right. That music tells me it’s time to go. If you’d like to get in touch with us, send us an email to podcast@netapp.com or send us a tweet @NetApp. As always, if you’d like to subscribe, find us on iTunes, Spotify, GooglePlay. iHeartRadio, SoundCloud, Stitcher, or via techontappodcast.com.
If you liked the show today, leave us a review.
On behalf of the entire Tech ONTAP podcast team, I’d like to thank Rob Hirschfeld for joining us today. As always, thanks for listening.
Podcast Intro/Outro: [Outro]