Behind the Scenes Episode 356: Kubernetes, Prometheus and ChatGPT w/ Natan Yellin

Welcome to the Episode 356, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”



ChatGPT is the latest in a series of AI tools available to help people get their work done, but can it help you manage your Kubernetes cluster? Natan Yellin (, Natan’s LinkedIn) from the Robusta Open Source project joins us to help us find out!

For a video created by Natan on this:

For the Robusta project:

For ChatGPT integration for Robusta:

Tech ONTAP Community

We also now have a presence on the NetApp Communities page. You can subscribe there to get emails when we have new episodes.

Tech ONTAP Podcast Community


Finding the Podcast

You can find this week’s episode here:

I’ve also resurrected the YouTube playlist. You can find this week’s episode here:

You can also find the Tech ONTAP Podcast on:

I also recently got asked how to leverage RSS for the podcast. You can do that here:


The following transcript was generated using Descript’s speech to text service and then further edited. As it is AI generated, YMMV.

Episode 356: Kubernetes, Prometheus and ChatGPT w/ Natan Yellin

Justin Parisi: This week on the Tech ONTAP podcast, Nate Yellin joins us to talk to us about Kubernetes administration Prometheus observability, as well as ChatGPT integration .

Podcast Intro/Outro: [Intro]

Justin Parisi: Hello and welcome to the Tech ONTAP podcast. My name is Justin Parisi. I’m here in the basement of my house and with me today I have Nate Yellin. So Nate, what do you do? How do I reach you?

Natan Yellin: Hi. I guess I’m most known on LinkedIn as that guy who posts a lot about Kubernetes and all things observability and Prometheus related. And you can find me on LinkedIn, you can find me on Twitter, @aantn and you can also just send me an email, natan –

Justin Parisi: So yeah, I did find you on LinkedIn. I saw the posts that you had, and we’ll talk more about those posts. So let’s just start off with, where did you first get into Kubernetes?

Like, what made you interested in it and how did you learn more about it?

Natan Yellin: So I come from a cybersecurity background. I worked at companies like Checkpoint, worked on firewalls, worked on a lot of cybersecurity stuff, been a programmer for many years. And a few years ago I went to a startup to work on the average cybersecurity project.

And we wrote a firewall for Kubernetes. And at the time, despite having a lot of knowledge about network security and low level stuff, I actually knew very, very little about the cloud. And I remember I did something very embarrassing that I probably shouldn’t admit to, given that I’m now someone who writes a lot about Kubernetes.

So I probably shouldn’t admit to this, but I had a very embarrassing moment in my first week or so of the work where I was running this thing on Kubernetes and I had some pod I had deployed and I wanted to stop running it. So I kept on trying to delete the pod. And of course Kubernetes went and they kept on recreating that pod cuz there was deployment that was controlling that.

And I couldn’t figure out what was going on. I was like trying to delete this. And then Kubernetes would bring it back up. And I realized I didn’t actually know that much about how Kubernetes worked. And if I wanted to do my job and even to write network software and to write a firewall and do the security related stuff, I actually had to dive in and really learn Kubernetes first.

And that started along an interesting journey.

Justin Parisi: Yeah, when I was working more with Kubernetes, I was doing similar things, like trying new commands, trying to do things that are pretty straightforward and simple in most environments. But , what I realized was I, I just need to build one of these, and just see how it ticks, and I wanted to do it from scratch and not use one of the cloud engines because that’s just too easy.

You just basically just run a command and it’s done. But yeah, really doing it from scratch, I think, teaches you a lot about how it works, how all the pieces fit together. So did you take a similar approach? Did you just build a cluster and just start pounding on it? Where did you find your information?

Natan Yellin: So I started reading through a lot of the documentation and I started to play around with the APIs. So I’ve always approached it kind of from a programmer perspective, right? Okay, what’s the api? What happens when they call this api? And then what is the API actually doing behind the scenes as well? And, how can you stretch this now, where are the limits and where are the breaking points.

And then when you’re writing software also that uses the Kubernetes API and you’re interacting with API server, so then you also start to hit various boundaries performance wise. And then you have to really start learning the internals. So I guess I approached it not from the principled bob and mob approach that you described, but more from the approach of someone who’s writing software for this and then kept reaching different places where that software was breaking. Not due to my own software, but due to the pieces I was interacting with.

Justin Parisi: Okay. So there’s a lot of different ways to kind of skin that cat, right? Kubernetes itself, learning it can be as complex as the infrastructure piece.

So with that said, it sounds like it’s something that’s kind of hard for admins that are running this to get their heads around. So what would you say are some of the biggest challenges for Kubernetes admins?

Natan Yellin: So the first thing is that it’s fundamentally different than things that you’re used to using before.

I think Kubernetes is worth learning for everyone. And I argue about this sometimes with people, whether Kubernetes is a good choice for small startups or not. And I think by learning Kubernetes, You do have to learn a lot of new stuff, but it really simplifies all these common patterns about how you deploy your software, how you do now like green, blue deployments and how you do canaries and how you do all these sorts of things. You suddenly gain this power to do all these sorts of things that previously were very hard to do your outside of your designed custom tooling for. But the challenge with all that is there’s so much to learn at the beginning, and I think the two major challenges for people are getting started with Kubernetes, is one, to figure out the right path to learn that incrementally. So to not drown in too much information, too much new stuff to learn at once, to be able to get up and start with like good, simple basic thing that’s working and that satisfies your company’s business needs. And then to learn more as you go on as it’s appropriate.

And then the second challenge is also around the observability and the monitoring of all that. And then of course, the third challenge in the world that I came from is Kubernetes security.

Justin Parisi: The monitoring piece is interesting cuz if you think about Kubernetes, it’s built for scale, being able to deploy very large applications.

And not only that, they can auto scale . So now as an admin, you’re kind of chasing the rabbit down the hole. You’re having a hard time finding where problems might exist because there’s just so much out there. There’s so many pods, there’s so many nodes, there’s so many logs. So tell me a little bit about how that challenge is solved by using something like Prometheus.

Natan Yellin: Before I address the solution, the challenge here is really… it’s twofold. One, the only thing that’s worse than a little bit of data, a little bit of observability of data is the volume of observability data and then finding the right data within that, understanding the big picture and understanding what you should be looking at.

And two, Kubernetes took away this linear cause and effect that we used to have. So it used to be you looked at a server and had like a high CPU, right? So back in the day, the solution was simple. You got a bigger server with a bigger CPU, or you optimize your application may use less CPU. But now when you have high CPU, it doesn’t just have to do with the size of the node.

It has to do with why their stuff is running on that. It has to do with how you set your request amendments. So suddenly, there’s this nonlinear cause and effect. You have a problem, but finding what actually caused that problem is no longer, oh, I just need a bigger server. And then the way that Prometheus fits into all that is when you have something that happens and one you need to able to identify the problem, and Prometheus is really the code to standard for defining alerts, and sending those alerts out and firing them. And then two, once an alert fires, then you need to pull in the relevant data to understand why that alert fired and you’re pivoting there. So maybe alert, fired on one metric, but then when you go and you investigate, you’re actually pulling another data and looking at their data.

And Prometheus is also used there as the storage for all this different observability data.

Justin Parisi: You said it acts as the storage for those observability pieces?

Natan Yellin: Yeah. So Prometheus at its heart is a time series database. Okay. And what I mean by that is if you think of a traditional database, then you have a bunch of columns and you’re querying like select star from wherever and you’re filtering and so on and doing all that.

And Prometheus is kind of like that, but not quite. It’s built for a specific use case that it does really well, which is when you have a set of numbers that changes over time. And there are different types of series like that. Like some only go up all the time, right?

When you think about the number of packets that has reached a machine, that’s the number that is constantly going up. It can never go down. So Prometheus has stuff that’s optimized around that specific type of time series. If you think of tracking the latency in an application, then you’re taking the latency and you’re saying, okay, what’s the P-99, the 99 percentile, and like how much latency’s between zero milliseconds, and 10 milliseconds? Prometheus is really optimized to store data that answers that type of question.

Justin Parisi: So does Prometheus also keep track of deltas and trends? Can you go to a wider view where you can see how things have happened over the course of time?

Or is it very limited to a certain subset of events in a certain time period?

Natan Yellin: We’re looking at the trends over time. Prometheus doesn’t store events. It stores essentially graphs, right? Like you think of a graph back from mathematics class in high school or university, right? You’ve got a graph that’s moving over time, you have an X-axis and you have a Y-axis… that is Prometheus. Just the X-axis is always time, and the Y-axis is some arbitrary number, and then you can have multiple lines on that graph, and Prometheus knows how to track all those. But Prometheus is not for tracking discrete events, like at this point in time, this occurred and then this occurred, and then this occurred. That’s not Prometheus. Prometheus is for saying, here’s a number, like number of pods running my environment, like the number of customers connecting you right now, like the number of applications that I’ve crashed in the past second, and for tracking that numerical value over time.

That’s what Prometheus is really excellent at.

Justin Parisi: Okay, so walk me through how finding that information can lead me to finding a problem in the Kubernetes cluster. Can it get me all the way to where that problem exists, or do I need something else there?

Natan Yellin: So it’s good at identifying the problem then, and I’ll give an example.

So let’s say you wanna identify a job that failed in Kubernetes. There are two ways that you can do that. The first way is you do it by looking at Prometheus time series. And that time series would be like number of failed jobs. And then you do write a Prometheus alert that says when number of failed Kubernetes jobs goes up by one, then issue another, and that’s the identification. So that’s the identification that you had a job that you ran on Kubernetes. And by job I mean the Kubernetes object, that’s literally called "job," which is kind of like could think of as a traditional cron job, although it could be a one-off job as well. And you can now identify someone who ran this job on Kubernetes and has failed.

And then the other way that you can identify that is you also identify that not with Prometheus, but with a project like Kubewatch, just by listening to API server directly. So you have some triggering condition, right? Like a job on Kubernetes failed. And for that sort of stuff, Prometheus is really good.

And then comes the investigation part. You wanna say, okay, well why did this alert fail? And now you wanna pull in all this other data. So there, too, Prometheus is very useful if you know where to look. So you’re gonna wanna pull in a time series of what was the CPU usage, what was the memory usage of that job that ran and then if it got killed cuz it ran out of memory, you would see that. So Prometheus is useful for pulling in all the different graphs that then show you why that might have failed. And then you often wanna correlate it with other data as well, to pull in the logs from that job or to pull in other data that’s structured.

That’s not just time series data.

Justin Parisi: You mentioned that Prometheus doesn’t really do event tracking. Can it interact with the Kubernetes logs from Prometheus or do you have to go into the nodes themselves or some other monitoring tool to do that?

Natan Yellin: So we have an open source project called Robusta that does that.

And what we do is we take Prometheus as the trigger. So Prometheus is the trigger that fires an alert. And then using the metadata oh, this alert occurred on this node, on this pod. And then we have an observability engine where there are rules defined that say, okay, well when this happens, go correlate with the logs.

And then we pull in the logs and we pull in other stuff. So we do that and then there are other open source tools that you could use to do that as well. You could also do it manually. For example you could set up Loki or you could set up with ElasticSearch, and then when the Prometheus alert fires, and you could do that whole process yourself and go and look at the logs yourself and pull those in.

So there are different approaches that you can take, but Prometheus itself does not store the logs, and Prometheus itself does not store discrete events. It only stores that time series. So it only stores what you would essentially view as a graph over time. It doesn’t store a set of events and doesn’t store logs, and it doesn’t store that type of structured data.

Justin Parisi: So this open source Robusta thing, does it have pre-canned scripts or pre-canned API module interactions that people can use without having to create their own? I know some people wanna do their own, but it’s nice to have examples. That kind of thing is what people really use when they’re trying to discover how to do their management of their environment. So does that exist out there?

Natan Yellin: Yeah, so you can install Robusta in two ways. The first way is you install it as an all in one solution.

So we install then Prometheus, alongside Robusta using a Helm chart called Kube Prometheus Stack, which is the most popular way to install that. So we give that as an all in one package, and then it’s all preconfigured. And then the other way is you could take your existing alerts and you could just send that to Robusta in your cluster by web hook.

And then we have predefined rules for all of the default Prometheus alerts that you would use on Kubernetes. So for example, let’s say a pod crashes, then we would automatically pull in the logs from that crashing pod and then we would forward that to Stack or to PagerDuty or OpsGenie or wherever you’re consuming alerts today.

Or if you had a Prometheus alert that fired because a node ran out of disk space at that site, then we would pull in data with a graph about why that node ran out of disk space and attach an analysis to that using the pre-built rules.

Justin Parisi: Yeah. That’s one of those scenarios where you spend a lot of time trying to find out what’s wrong and then you discover it’s something like disk failed or out of space and then you kick yourself cuz you spent so much time doing that, and it sounds like Prometheus can really eliminate that kicking yourself phase because it can find those types of really low hanging fruit pretty easily.

Natan Yellin: Yeah, the data is all there. I mean, very likely if you’re running Prometheus today on Kubernetes, then you are probably already running what’s called a node exporter.

And probably whenever you have an error like that occurs, all the data is actually right there. So you just have to know where to go and look. But if you know where to go and look then really solving a common error like this, like pod has some issue where there’s an issue on the node. Yeah. Really, it’s not so hard. You have all the data you need to do so.

Justin Parisi: The natural inclination of an admin is not to think, oh, it’s something really simple. It’s always like the hardest thing, like, no, it’s something’s horrible’s broken. But if you can knock out those really simple things pretty quickly, you save yourself a lot of time and it really just enhances the overall experience.

Natan Yellin: Yeah, yeah. I’m constantly surprised in my career by how many errors are actually very simple things.

Justin Parisi: Yeah. And no matter how many times it happens, I always get caught doing the same thing. It’s like, oh man, this has gotta be hard. It can’t be so simple. But it really is, and this is where automation and taking things like Prometheus and Robusta really come into play because that’s going to kill a lot of those really dumb mistakes that we make.

You know, we’re not stupid people, but we do make stupid mistakes.

Natan Yellin: Yeah. I mean, it’s not just about not making mistakes. We’re all really, really busy in that previous companies, like I had a Slack channel with way too many Prometheus alerts that firing there, I wanna know faster which of those actually matter and which are important to be able to solve that.

And Prometheus has all the data to do so. So I wanna be able to make that data more accessible faster. And if a thousand other companies have had the same alert for a deployment that does not have the right number of pods, then I wouldn’t be able to use their knowledge in order to fix this alert a lot faster and ideally to do so without meaning to even generate new observability data.

Justin Parisi: So Prometheus itself, is it Kubernetes aware? I mean, it sounds like it can find pods. Is it running actually within the Kubernetes cluster as a pod itself? Is it installed separately? Does it run in the cloud? Tell me more about how that all works.

Natan Yellin: So there are a few different setups, and you can set this up multiple different ways.

The most common thing that people do is they install Prometheus using Kube Prometheus Stack. That’s just the name of a Helm chart that’s very popular that people use, and that essentially installs two different things. So it installs Prometheus itself and with Prometheus other manager and Grafana and all the other standard parts of the stack.

And none of those are Kubernetes aware. And then it installs alongside that, something that’s called the Prometheus operator. And what that does is it adds on custom resource definitions or CRDs for Kubernetes, which is a fancy way of saying it essentially installs a Kubernetes extension that makes Prometheus native in Kubernetes.

So instead of adding some config map somewhere and defining my rules in there, I could run "kubectl edit prometheus rule," and I can define my Prometheus rules and my alerts and all of that as if it was a native Kubernetes object, just the way I would define the deployment, or I would define the service or some other Kubernetes object, and that makes it a little more native when you set it up and when you define it.

And then the one challenge though is when alert’s fired and sometimes like the alerts aren’t really Kubernetes aware from Prometheus’s perspective, let’s say you have an alert that’s firing on a pod that’s running on a certain node. I isn’t aware of the topology. Prometheus is aware that you have a time series like the number of pods that crashed in the last 10 minutes, right?

And it’s looking at that time series and now it notes that that went up by one. And then on that they’re labels, they’re key values, like node equals and the name of the node, their pod equals and the name of the pod. But Prometheus itself doesn’t have that contextual awareness to understand, okay, this is a pod, which is part of this deployment, which is running on that node.

From its perspective, it’s all just metadata.

Justin Parisi: Okay. And does this mean that you can run Prometheus in the cloud or on-prem? Can you run it against a GKE or an AKS? .

Natan Yellin: Yeah, you can run it against anything. GKE now has managed Prometheus as well. So you can use the managed Prometheus that you can add on to GKE. You can just take a regular GKE cluster and install Prometheus yourself there. There are options for running Prometheus outside of the cluster. You can use different cloud solutions that will run Prometheus for you. So there are lot different options available.

Justin Parisi: Okay. So it sounds like Prometheus can do certain things and it can’t do certain things, so it’s not actually like a sentient AI, like it can’t solve problems for you. Is there anything out there that Prometheus can use that does that, you mentioned Robusta, does that tie in here any at all?

Natan Yellin: So about the sentient AI…

So Prometheus is definitely not a sentient AI. What Prometheus is, like I said earlier, it’s just a time series database and when an alert fires from that time series database though, then you can forward that to sentient A I like ChatGPT or perhaps not sentient, but you can forward that to different destinations.

And where Robusta comes into this picture is we essentially have a web hook receiver, so an alert fired in Prometheus, and Prometheus can now can send that alert to different destinations. So imagine that you have a Prometheus set up and you’re monitoring your cluster, and now you get a message in Slack. So an alert fired, and then Prometheus just sent that message to Slack by webhook and then arrived in your Slack application. And what we do with the Robusta is it’s an open source project that runs inside your cluster. And when that Prometheus alert fires, then instead of sending that message directly to Slack, Prometheus sends that message to Robusta running inside your cluster.

And then the Robusta open source takes those Prometheus alerts. It maps them onto the relevant Kubernetes object using the metadata, and then it pulls in extra data. So imagine an alert fired on a pod. Then it can pull in the pod docs. Imagine that an alert fired on a Kubernetes node, they can pull in a graph with a memory usage of that node over the last half hour if it’s relevant to that alert.

And we have a ChatGPT integration now that’s a separate open source project and it can actually go and it can ask ChatGPT how does ChatGPT think you should solve this alert? And then it can send to Slack that alert along with the advice of ChatGPT on how to fix that.

Justin Parisi: Yeah. And that’s really where I found you was the discussion about ChatGPT, because it’s the hot buzz thing now. Everybody’s playing with it and everybody’s creating different use cases for it. It’s kind of like they’re presenting it as a panacea of things to reduce the amount of work you have to do or to create code for you.

Like all things. It doesn’t seem like that’s gonna be the reality. So where do you see ChatGPT going? Do you see it as something that’s gonna eventually replace the need to do your own Kubernetes administration? Or is it something that’s just gonna kind of act as a tool to enhance the experience?

Natan Yellin: So it’s just a tool and I think the number one limitation of ChatGPT, the strength of it and the weakness of it is that it’s very good at doing what you ask it to do, even when you’re asking it to do nonsense. So as an example, if I said to ChatGPT explain to me why this alert is firing. It might gimme a reasonable explanation. If I said to ChatGPT, explain to me how this alert is firing because one of the developers on my team just did something awful, it’ll give me a viable explanation for that too. So, The data that you get from it, I like to say you have to view the ChatGPT answer the same way you would view an answer from someone during the job interview if you were interviewing them. If they’re good at interviewing, then the answer will sound good, but you still need to see whether that answer is factually correct or not.

Justin Parisi: You mean I shouldn’t just take it at face value?

Natan Yellin: You shouldn’t take it at face value.

So we’ve done some testing around this. I did a video on YouTube, where I took a specific Kubernetes alert, and then I ran it through the ChatGPT integration we built.

And then I looked over each of the things there that I recommended. Some of them made sense. Some of them were good advice, and some of them were generalities, like that kind of looked like they made sense, but didn’t actually have much meaning. It the same way that I might call up a pizza and I might say like, can I order a pepperoni pizza, and ChatGPT, if it was replacing that person, it would say, yeah, when you order pizza, it’s generally possible to order pepperoni pizza. Like, well, of course that’s true, but that’s not what I’m asking. Can I order a pepperoni pizza? How much will it cost? Well, the cost will generally be between $2 per topping and $3 per topping. That might be generally true, but it’s not what I’m asking.

Justin Parisi: So it sounds like a really frustrating friend , like a very, where do you wanna go? I don’t know. Where do you wanna go to eat?

Natan Yellin: Yeah. Yeah. I think what’s interesting is that if you had asked me five years ago what AI would replace in terms of jobs, then I would’ve said AI is going to replace like maybe some of the hard sciences, or it’ll be better at mathematics, or it’ll be better at physics, or it’ll be better at very well defined problems.

And I think actually what we’ve seen is that ChatGPT and the other AI stuff is really good at replacing the creative stuff because with the creative stuff, then it’s harder for something to be like completely unfactual. Rather there’s more creative flexibility.

Justin Parisi: That’s interesting cuz I mean, it’s built on data that is very rigid, so it has to kind of figure out an answer from multiple data sets.

So I guess to me that’s not so much creative and it’s more of just, I, I don’t know, there’s so much information , which is really back where you are when you started as a person. You’re like, I don’t know. There’s so much information. So you mentioned ChatGPT is great if you know how to interview it.

So, What questions do you think you would need to ask ChatGPT to help you narrow down an issue that you have with your Kubernetes cluster while you’re integrating Prometheus?

Natan Yellin: So, I can tell you how I know one company is using ChatGPT actually. There’s a company I know of where there’s a Slack channel that the platform team uses and they use that channel to communicate with the developers. And every now and then developers come along and they ask a question that’s typically a fairly trivial question. Like, how can I access the logs for this pod? Or what do I need to do in order to connect the production and run this or to get this data? And it used to be that they would come along and they would say that and then the platform engineers or the DevOps team would like say, oh, it’s this question again. And they’d go with some wiki or they’d copy paste an answer. Cuz people ask it all the time or they’d write a one liner if it’s a new question. And now they told me that they go ChatGPT, and they just plug in developers question to ChatGPT, and then they get back a whole big answer.

They glance it over to see that it’s actually correct, that it makes sense. And then they just copy paste on, send to the developer and the developer goes, oh man, you guys took all the time to write that like and it’s written in perfect English, and it’s very long and like step by step. So it’s actually very useful in that case, but is being verified there by a human.

Justin Parisi: So that sounds like it’s really good for taking the emotion out of things, cuz I run into the same stuff. You get asked a question eight times, you get a little frustrated, you’re a little annoyed, and your answers become shorter and shorter and shorter. But ChatGPT don’t care.

Natan Yellin: it’s good at generating plausible outputs. You still often need a human to say whether that outplay is correct or not.

Justin Parisi: Yeah, absolutely. And that’s the key for most things that are like that, right? I mean, you need to verify, but it stops you from giving that one line answer that isn’t gonna be helpful.

They’re just gonna keep asking the question. Ultimately what might happen though is maybe they don’t need to ask Slack anymore. Maybe they need to just ask ChatGPT. But then there’s a danger there because if you’re asking the question that you don’t know and you need to verify the answer, there’s a disconnect there, right?

Natan Yellin: Yeah, exactly. If I look at now this from my hat as part of the Robusta team, then let’s say you have a Prometheus alert that fires and that goes to our open source and then we add on data about why that occurs. We won’t add on to that other raw ChatGPT output and say this is how you fix this problem.

And like, we would never do that. But what we would do is we have a set of data that we know about that’s based on rules that we wrote, that we verify that pose an extra data. I could potentially see us using ChatGPT to generate an explanation and then having a human look at that or tweak that and see that that’s correct.

So I mean, maybe that would make sense. And then of course we have the integration where you can push a button and you can ask ChatGPT, but you have to take the output with the grain of salt and interpret that for yourself.

Justin Parisi: So I would imagine you could eventually have an AI that can look at ChatGPT answers and say, oh, this is wrong, this is not correct.

Or, oh, this is, you know, problem number 17 that we’ve already resolved here. So is that something that you see potentially happening where an AI is interacting with an AI, which is interacting with an AI and then we just kind of sit back and watch the chaos? Or is there still gonna need to be some human interaction there?

Natan Yellin: So both of those have actually happened a little bit under the hood in how they train ChatGPT, I believe. What they did, if I recall correctly, what they did when they trained ChatGPT, was they had human generate a bunch of questions and then they had the precursor to ChatGPT generate a bunch of different plausible answers.

And then they had humans go and they had humans actually rank which of those answers were good proper answers and actually answer the question were factual and so on. And then they use that to generate another machine learning model that is actually used to supervise the ChatGPT output.

Justin Parisi: It’s interesting where this AI machine learning stuff is taking us. There’s some really interesting use cases. So where do you see Prometheus in the future? Where could you see Prometheus enhancing the overall Kubernetes observability experience?

Natan Yellin: Prometheus today is the de facto standard really for how you monitor Kubernetes and I’m not an official Prometheus maintainer. We do a lot of work with Prometheus, but I’m not one of the maintainers, so I can’t speak on that aspect. I’m not the right authority for that. But I think we’re going to see more and more tooling around Prometheus, about taking the data in Prometheus and making that better, and about surfacing the right data at the right time, and probably more also around making Prometheus more native on Kubernetes itself, and tying it in with logging and with other stuff as well.

Justin Parisi: So actually becoming an official part of the Kubernetes deployment stack, kind of like any CSI driver might be. It might have a Prometheus pod that spins up in addition with everything else.

Natan Yellin: Yeah. So we have some of that today.

I mean it is a big part of that standard deployment for many companies. And then there’s also stuff like the metric server. There’s various stuff also around bringing Prometheus metrics into Kubernetes. Think about in auto scaling scenario, then the Kubernetes API server actually has to be aware of in the Prometheus metrics cuz it’s going to scale the deployment and scale up and down number of pods based on some Prometheus metric.

So Prometheus has to be Kubernetes aware.

Justin Parisi: Yeah. If it doesn’t know that someone has auto scaled , you might see that as a problem.

Natan Yellin: Yep.

One interesting thing that we see sometimes is that people will open up ChatGPT when they have a question instead of going directly to Google, which is interesting because ChatGPT in theory has more potential to be incorrect. Like that’s, you have some issue and now you go and you Google that issue and then you find a bunch of results on Google.

you go and you look it up in ChatGPT, and then you get an AI generated answer like, how do I fix this problem that occurred? Both of those actually could be incorrect, right? Because when you Google the question, you’re also just arriving on a random blog post. Who said the blog post is correct either right?

Justin Parisi: Yeah, that’s kind of a buyer beware scenario. This stuff is where you really want to not do it in production first. You want to kind of vet the answer, try it out in a lab somewhere. Don’t just start monkeying with the pods because ChatGPT told you to that it was good to do.

Natan Yellin: Yeah. I think one of the challenges is how we can also capture that tribal knowledge and then make that tribal knowledge more accessible, right? Like today, the tribal knowledge is all in a way that’s human understandable. A human can go and they can read Stack Overflow, they can read stuff on Google, but that tribal knowledge about what to solve and like what it means if this alert fires or how you should fix it, that’s not machine understandable.

Justin Parisi: All right, Nate, so Prometheus sounds pretty cool for the observability aspect of Kubernetes administration.

I understand that there’s a SubStack out there where you have contributed to it.

Natan Yellin: Yeah, so we recently launched a sub called Why This Kubernetes Thing. And what we do is we take different things on Kubernetes different tooling like Kustomize or Scaffold or stuff related security, like network policies, and you look at why they exist. Alot of stuff out there is nitty gritty of, okay how would you go and do this and what would the YAML look like? And how is this working behind the scenes? And that’s all important. But given the vast number of tools out there, . There’s also a question like, okay, what needs my attention?

Which tools do I need for my business and why should I even care about them? So what we try and do is we try and touch on the why these things exist and whether you need them or not, and then to give you a teaser so that if you are interested in it, then you can go and you can learn about that and read more about it.

Justin Parisi: Okay, and where else could I find more information?

Natan Yellin: You can just find the first result on Google and go the official page on GitHub. If you wanna install Prometheus on Kubernetes I would recommend people also look at the Robusta open source project which I work on.

We install Prometheus and we install it with that observability engine from Robusta so that you can now connect your Prometheus, whether it’s to pod logs and to other data that’s Kubernetes specific, and that’s all open source. We also have a SaaS platform and whole other offering, but you don’t need to use any of that just in order to install the Prometheus bundle. And the Prometheus documentation, so just the standard resources, I guess.

Justin Parisi: And you also have a YouTube channel and a LinkedIn, and we’ll add that to the links in the blog that accompany this show as well.

Natan Yellin: Yeah, I’m fairly active on social media and I write that SubStack which goes out every week as well. And please feel free to reach out directly. I like to hear from people. Sometimes you do a podcast like this, then people listen, but you never know. They listen and you don’t hear from anyone. So I love to hear from people. Just send me an email, natan, n-a-t-a-n @ robust dev. Say, I heard the podcast and I like this part. I didn’t like this part.

It’s always nice to hear from people.

Justin Parisi: Well, Nate, thanks so much for joining us and talking to us all about Prometheus and ChatGPT, and all sorts of other interesting Kubernetes related content.

All right, that music tells me it’s time to go. If you’d like to in touch with us, send us an email to or send us a tweet @NetApp.

As always, if you’d like to subscribe, find us on iTunes, Spotify, Google Play, iHeartRadio, SoundCloud, Stitcher, or via If you liked the show today, leave us a review. On behalf of the entire Tech ONTAP podcast team, I’d like to thank Nate Yellin for joining us today. As always, thanks for listening.

Podcast Intro/Outro: [Outro]



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s