Welcome to the Episode 360, part of the continuing series called “Behind the Scenes of the NetApp Tech ONTAP Podcast.”
Did you know that Google created Tensorflow and Kubernetes?
Or that Sun invented NFS? That Amazon created S3?
Or that Uber created Apache Spark Horovod?
Some of the key technologies that companies use today were originally created by businesses trying to solve their own internal challenges.
In this episode of the Tech ONTAP Podcast, NetApp TME Rick Huang (firstname.lastname@example.org) and Solutions Architect Ken Hillier (email@example.com) join us to discuss Rick’s new blog on using NetApp AI with Apache Spark Horovod for deep learning and inference use cases. We even manage to talk about ChatGPT.
For more information:
- Hybrid cloud solutions with Apache Spark and NetApp AI (blog)
- Deep learning with Apache Spark and NetApp AI—Horovod distributed training (blog)
- Using XCP to Move Data from a Data Lake and High-Performance Computing to ONTAP NFS
- Map-R to NFS
- HDFS to NFS
- TR-4863: Best-Practice Guidelines for NetApp XCP – Data Mover, File Migration, and Analytics
- poster link (available after Jensen’s keynote at GTC)
Finding the Podcast
You can find this week’s episode here:
I’ve also resurrected the YouTube playlist. You can find this week’s episode here:
You can also find the Tech ONTAP Podcast on:
I also recently got asked how to leverage RSS for the podcast. You can do that here:
The following transcript was generated using Descript’s speech to text service and then further edited. As it is AI generated, YMMV.
Episode 360: NetApp AI and Nvidia with Apache Spark Horovod
Justin Parisi: This week on the Tech ONTAP podcast, we discuss Apache Spark, HDFS data lakes, and other AI solution topics with Rick Huang and Ken Hillier.
Podcast Intro/Outro: [Intro]
Justin Parisi: Hello and welcome to the Tech ONTAP podcast. My name is Justin Parisi. I’m here in the basement of my house and with me today have a couple of special guests talk to us all about Apache Spark Hadoop, big data, data Lakes, AI, all these sorts of good topics. So to do that today we have Ken Hillier here.
So Ken, what do you do here at NetApp?
Ken Hillier: I can be reached at firstname.lastname@example.org. I’ve been here for a while. I’m an executive architect. It’s a fancy name for being an enterprise architect. I look at our solutions and my involvement with analytics and NetApp AI has been helping our customers get data to NFS and be able to access and do the workflows that they want to do without the contentions of other storage protocol restrictions.
Justin Parisi: All right. Also with us today, we have Rick Huang. So Rick, what do you do here at NetApp? How do we reach you?
Rick Huang: I’m a data scientist at NetApp’s AI Solutions engineering team.
I can be reached at email@example.com and what I do, I develop, publish and evangelize AI solutions across multiple industry verticals. Our goal is to empower customers with innovative AI solutions that provide maximum value to their businesses. Leveraging our expertise in data science, machine learning, and CloudOps.
We work tirelessly to offer cutting edge solutions that drive business outcomes and help our customers stay ahead of the competition. I recently published a series of blogs on Apache Spark and NetApp
Justin Parisi: So, as far as AI goes, we are a big partner with Nvidia and we do a lot of work with them. And my understanding is there’s a conference coming up pretty soon, Rick.
So are we gonna have a presence there? When is that conference? And what sort of things will we be doing there?
Rick Huang: We actually have a virtual poster in GTC. The name is Automated Insurance Risk Detection Solution with Nvidia and Quantify. It will be available after Jensen’s keynote.
I will share the link with you.
Justin Parisi: All right, excellent. We’ll add that to the blog. So we’re here to talk about the new blog that you actually put out that talks about NetApp AI and solutions with ONTAP and Apache Spark specifically. So let’s talk about that. What is Apache Spark? What does it do and how does it work?
Rick Huang: Sure. Apache Spark is a powerful open source, big data processing engine that enables efficient and fast processing of large data sets in a distributed computing environment.
Spark provides a unified analytics engine for large scale data processing with support for various data sources and types. One of its main advantages is the in-memory processing capability, which enables handling iterative algorithms and AI workloads with great speed and efficiency. For example, let’s say you’re analyzing a large data set of customer transactions for a banking company.
With Spark, you can quickly perform tasks such as data cleansing, feature engineering, and model training inferencing on the dataset and efficiently scale the processing capability to handle increasing the large datasets and complex environments from on premises or cloud native to hybrid multi-cloud as your business grows.
As for Horovod, it is a open source distributed training framework that enables scalable and efficient deep learning model training across multiple GPUs and nodes. With Horovod, you can distribute the workload across multiple workers, allowing you to process large amounts of data and train complex models much faster than with a single node setup. In addition, Horovod supports popular deep learning libraries such as TensorFlow, PyTorch, and MXNet, making it easy to integrate with your existing ML workflows. Let’s say you are training a computer vision model for object detection using Tensorflow. This is a common application during document digitization for automatic risk detection.
If you check out our virtual poster, you’ll see more details. Besides training faster, you can also fine tune the model with educational data and optimize the hyper parameters to further improve its performance. However, there are some challenges associated with using Horovod, such as a need to carefully manage the communication overhead and ensure that the data is distributed correctly across your nodes.
Another issue is dealing with hardware and software heterogeneity as different nodes may have different GPUs, CPUs, or software libraries and frameworks installed. Finally, it is important to consider the trade-offs between model accuracy, training time, and data architecture costs when using Horovod or any other distributor training framework.
My blog, "Deep Learning with Apache Spark and NetApp AI: Horovod DistributedTraining" will tell you other considerations, questions you ask yourself first before deploying or adopting Horovod, challenges and how NetApp solutions mitigate them, and a beautiful performance testing results figure.
Justin Parisi: What sort of real world use cases have we seen with deep learning? From a regular person’s perspective, we hear deep learning and machine learning and AI, our eyes kind of go glassy, right? We’re like, oh man, that’s really advanced stuff. But overall, it’s used in some really simple use cases, I would imagine.
So what sort of use cases would we find that are surprising that Apache Spark is in the background, doing the work.
Rick Huang: Right. I can provide a financial sentiment analysis example using deep learning. We analyzed NASDAQ top 10 company earnings call transcripts. From the transcripts, you can see the positive, negative, or neutral sentiments from the CEO/CIO’s prepared remarks is mostly neutral or positive. However, the interesting part is the analysts questions, right? You get from their wording and their questions. Is it a positive or negative remark, and then you can correlate it with next day stock performance. So over time, let’s say 30 years of NetApp’s earnings call transcripts, you can make a deep learning system that can predict your stock prices for the next day, and then, your CEOs, CIOs, they can prepare their remarks accordingly, right?
They can put their prepared transcripts into the system, and then the system can make suggestions like, I don’t know please don’t use this term. Otherwise the analyst from some banks or investment firms will say, something.
Justin Parisi: Okay. And Ken, you have some experience in this industry as well. So from a customer’s perspective that you work with, what sort of things are you seeing out there with the use of Apache Spark and that sort of thing?
Ken Hillier: Well, with some of the customers that I’m working with the big challenge that they’ve been having is getting access to the data. So Spark is a layer above the traditional data lakes that in the past have been defined by Hadoop servers, which could be white box servers with both compute and storage integrated, or, as things have progressed and things have been optimized, there has been shared storage options layered in. However, at the end of the day, you still have a data lake that needs to manage massive amounts of data, both ingest and all of the operations that’s going on on top of that. One of the big challenges that we ran into with one customer is that in order for them to access the data, They were a low priority beyond the other stuff that had to be maintained within the data lake.
So a big challenge was just getting the data out from the data lake itself so that application frameworks like Spark, could run against it. So that’s the biggest things that I’ve been seeing. And one of the ways we’ve been helping our customers get the data that they need, where they need it so they can conduct what operations against it they want to.
Rick Huang: Yeah, Ken, that is a great use case and I want to add that nowadays people tier data lakes into hot, warm, and cold layers for active production, prototyping, and historical archives. However, as they move towards hybrid multi-cloud architectures, migrating these data lakes is difficult, as you said. For example, the data may need to be transformed or re reformatted to be compatible with the new environment and security governance policies may need to be reevaluated.
NetApp’s BlueXP Copy and Sync, previously known as CloudSync, XCP, Cloud Data Sense and CVO help our customers migrate their data lakes securely while ensuring reliability, reducing costs, and improving performance.
Ken Hillier: Absolutely. I would actually double click on that, Rick because in the customers I’ve been working with, that’s been one of our defining differentiators is our ability to help them manage their data in and out of the data lakes and get it to where they want it to go, whether it is on a NetApp system on-prem, or a solution in the cloud, and going for true hybrid.
Justin Parisi: So this data has to go somewhere, right? It has to be stored somewhere and traditionally, Apache Spark has leveraged Hadoop and HDFS as a backend. But that has its drawbacks and its benefits. So let’s talk about that. Let’s discuss HDFS ,what it’s good for and maybe what some of the downsides of using HDFS are, especially within regards to the Apache Spark application.
Rick Huang: Yeah, I’m sure everyone knows HDFS is a distributed file system designed for storing large data sets. This key benefit is parallel processing of data across multiple servers, making it ideal for big data and AI/ML applications. It’s used by giant corporations in the social media, healthcare, life science, and financial services industry to store and process massive amounts of data.
However, HDFS has some drawbacks, such as high overhead required for data replication, and the fact that it’s not well suited for small files. Additionally, it requires a cluster to run, which can be costly and complex to set up and maintain. Here is where NFS comes to the rescue. NFS has been run for decades.
It allows files to be accessed and shared across the network and is often used for a wide range of scenarios from traditional web servers, file sharing on premises and hybrid cloud data pipelines, to more involved AI model training and inferencing. We have NFS direct access that works seamlessly with Spark. Compared to HDFS, NFS direct access is simple to set up and maintain and can help improve data processing speed and efficiency. I focus on this speed part. In my blogs you can read the performance comparison. We achieved pretty good runtime speedups. This is especially important in today’s data-driven business environment where real-time analytics and machine learning models require fast and reliable access to large volumes of data spread across multiple locations.
It’s a great choice for many small files in large data sets, like text logs generated hourly or even by the minute, depending on the use case and demand.
Ken Hillier: Yeah I would agree with all of that, Rick. Taking a step back and step up, one thing I would add just for perspective is that as our customers, as our whole industry is really looking at how they can leverage dynamic, different resources, whether it’s in the cloud or hybrid model or building something new up on-rem they need to be able to be agile and be able to create these things. And traditionally, a Hadoop infrastructure has been very siloed, so access to it is highly controlled. How it curates the data is designed to deal with massive amounts of data, so it has to control access to it, and that makes it difficult for us to leverage different resources or be able to get the data near enough to the cloud where customers might be able to actually leverage cloud resources against some of the data that they have.
I think that’s one of the main reasons why I’ve personally been seeing, and I think the industry is seeing Spark separating from Hadoop and trying to focus on being able to run against different data sources instead of just only on Hadoop itself.
Justin Parisi: So with Hadoop I would imagine that things like snapshots and replication and data management aren’t exactly key features, right? Or if they are features, they’re not as robust as maybe what ONTAP offers. Would you agree with that, or would you say that there’s a lot of similarities with what a massive FlexGroup volume hosted with NFS would offer with HDFS?
Rick Huang: I will say that with Hadoop there are many settings that you have to tweak. It’s not as intuitive like ONTAP with a click of a button. Your replication, your snapshots, those are taken care of, right?
Ken Hillier: What I would add to that is I think Hadoop was solving a problem in its day of managing large data footprints. They really did define what data lakes could be. And I think their focus in the architecture they set up within HDFS or within Hadoop was focusing on the data.
How can they manage and maintain, and provide durability to data when we’re all very familiar that things fail, hardware fails. So I think if we were to compare what Hadoop does to protect data and its data management capabilities against ONTAP, I think we do have a much rich er set of tools that could be used with the data, with snapshots, replication and all of that. Because we not only control the file system and how things are written out to the storage behind it, but we also control the RAID on top of it. And that’s all very optimized.
And I would say that is one of the key differences between say, like ONTAP versus the storage part of a Hadoop system. They don’t really have that level or that focus of optimization on the backend. They’re focusing more on protecting the data.
Rick Huang: Yeah, and I would add another point for data scientists and data engineers, working on Hadoop itself takes a lot of time.
And if you want to add those data management capabilities, it’s even more complex. With our data ops toolkit, data scientists, data engineers, they can quickly prototype and develop also put their models into production. And they don’t have to worry about the underlying data lake, how to works and what is it using. NFS, iSCSI, other protocos.
Justin Parisi: So are you telling me that data scientists don’t want the IT nerd street cred of the harder solution? They want the easy stuff?
Rick Huang: They want the quick stuff, right? They want to play around with their models instead of dealing with the underlying storage and IT management.
Justin Parisi: Yeah, I guess maybe back in the day that was cool, but now it’s like, man, I really don’t wanna deal with this
I wanna just get my work done right? I wanna get this stuff done and get it done quickly and efficiently and have the confidence that it’s gonna be protected.
Rick Huang: Yeah. They don’t want to copy a data set and wait for 16 hours. Yeah. And then don’t know what to do when it’s done within few minutes, with a snapshot, and then they can provision other Jupyter workspaces and play in it.
Justin Parisi: So you said the magic work copy, and that kind of takes us into our next topic where we talk about migrating a data lake. And really what it comes down to is if you’re trying to move off of Hadoop and you’ve got millions or billions of files or multiple petabytes of data, that’s gonna be a challenge.
So Ken, talk to me about that. How would we accomplish this to begin with and what are some of the challenges that we really need to look out for when we are planning this?
Ken Hillier: Well, speaking to the challenges first, there’s going to be a set of processes or governance that’s in place over the data lake, and all of that’s gonna have to be looked at. For whatever portion of the data that’s being moved, how is that gonna translate to a new solution or a new architecture. So that’s something that’s very real and it’s going to be something that as customers are looking to make a change with how they’re managing and maintaining their data.
What is this gonna look like if they’re moving to a different platform like ONTAP for NFS or StorageGRID for Object or something like that. What’s gonna be the data governance around that? So that’s definitely gonna be a very important part of the consideration.
But the mechanics, how a customer can approach this. There’s ways of getting natively out of HDFS. They provide NFS gateways, they have API access and that kind of stuff. And it’s usable. But I think one of the things that has been very successful with us, is NetApp has been focusing on how we can integrate more directly so that given a certain data set that we want to copy, that we can natively access HDFS or Hadoop and be able to copy that data to ONTAP and then from there we can get it to StorageGrid. So that’s been a big differentiator. It’s made it very easy for our customers to set up migration or batch jobs to grab data out of the data lake and be able to bring it over to the NetApp AI infrastructure so they can run their jobs against it.
Rick Huang: That’s a great point, Ken. The challenge overall is managing the complexity of multiple environments and technologies. A hybrid multi-cloud architecture involve mix of on-premises infrastructure, public cloud services, each with own sets of tools and technologies. And if you have data lakes cloud and also on premises managing and integrating these different components can be complex and time consuming, requiring careful planning and coordination.
Ken Hillier: Yes. I’m glad you mentioned that, Rick, cuz that actually really starts shining a light on some of the challenges our customers are dealing with. When we start looking at taking a step away from AI and analytics itself. If the data scientists want to work with the data, and they need to move that to the cloud, there’s going to be certain controls that need to be put in place to make sure that compliance is adhered to. So those are things that, with our experience managing data, we can certainly help guide our customers around our partner solution.
But I think that’s gonna be something that all of our customers are gonna be facing more and more of. And I think the key really is how can we make the data motion behind that – from data lakes to different parts of stuff on prem and/or in the hybrid model – getting the data where they need it, when they need it.
Rick Huang: Yes, and I would add another use case from our customer in media and entertainment. One is real time streaming analytics for personalized recommendation. And the other is more interesting is the end-to-end production of TV and films. I’m in the M&E capital of the world, so I know how difficult it is to make a movie. From a production perspective, you may need realtime dailies transferred from your current filming sets or data centers in different locations to the editors or VFX studios. This requires a reliable and efficient data transfer solution that can securely transfer large amounts of data over long distances, while maintaining data integrity and ensure data security from ransomware attacks or other cyber threats that can compromise data privacy and availability.
So do you want to really just leave your project deadlines and high quality results to chance? With our BlueXP Copy and Sync, CVO, Cloud Data Sense, we’ve got you covered, ready to take your data transfer game to the next level. Oh, and for those not familiar with the term, historically, a daily represents the prints of takes of camera footage from one day’s shooting, usually without correction or editing or examination by the director before the next day’s shooting.
That’s why it’s called a daily. In recent decades, most films are shot digitally, enabling directors to monitor the shot without added VFX effects. But terabytes of source data get generated onset these days and will eventually need to make it back to BDS servers, so a well designed, cost effective, secure high performance data fabric is crucial to the timely success of global film and TV product Okay. I digress. I always get too excited talking about films.
Justin Parisi: No, it’s really cool. It’s a really cool industry and hearing you talk about this, I start to think why would I copy that data? Why not use some of the integrated ONTAP features that require no copying, right? You’ve got things like FlexCache, you’ve got things like the ability to expose object storage using NAS shares. So we have stuff that allows you to copy things, of course, such as your BlueXP and your Cloud Sync and that sort of thing. But in some cases, you might not even have to copy any data. You would only use the data that you need as you access it, and it all happens automatically behind the scenes.
Ken Hillier: Justin. Yeah, I think you’re bringing up a really good point. The ability to get the data to where it needs to go is definitely an important aspect. But in a scenario like you’re describing and what Rick was talking about, there’s not a reason actually move the data if we have the data available, even if we need to, from a compliance or managing the data perspective, make sure that nothing happens to the original copy, we have the ability to do clones or snapshots and be able to set up entirely different environments maybe for that director so he can review, make some firsthand edits without actually touching the source or for any other operation that might be going on.
Justin Parisi: And Ken, you’ve got background in the media and entertainment space. I remember this from my support days. So you’ve seen this firsthand that customers in that industry, they wanna try to save as much real estate as possible. They wanna make it as easy as possible, and that’s really what these feature sets and ONTAP are doing.
Ken Hillier: Absolutely. Yeah. So getting data from one site to another, like you said, moving all of the data, especially if only pieces of it are needed. FlexCache makes a really great way for different sites within some of these media companies to be able to share data and also so that they can keep their assets where they want them, ’cause when you have multiple sites, more often than not in my experience, there’s going to be a primary site where they put in the infrastructure to be able to maintain data on a much larger scale than the rest of the sites.
Justin Parisi: So we’ve gotten a little off topic here, so let’s get back to the AI discussion.
And Rick, you kind of touched on it in media and entertainment with the algorithms that help you decide on what you wanna watch next, right? So if you open up your Amazon or your Netflix and you go in there and there’s a suggested for you topic. That’s not by accident. There’s something going on in the background that tells you what you’ve been watching and guesses what you’re gonna like based on a giant, probably data lake of selections from other people.
So is Apache Spark involved in that sort of thing or is it something similar in the background that does that?
Rick Huang: Yeah, for sure. Something in the background would do it and Apache Spark can handle the processing and the data pipelines also some AI machine learning. So I cannot say for sure if those companies are using Spark or a variant or their in-house development for this applications.
But yeah. You get the idea.
Justin Parisi: Yeah. I know Netflix is very open with their architecture. Like they have blogs on this stuff and I know that they’ve pretty much rolled their own private cloud. I mean, that’s virtually what’s going on there. Yeah. And it’s very impressive what they’ve done…
Rick Huang: Yeah, once these corporations are big enough, they start to take the open source stuff and make it their own.
This is actually how Horovod came into life with originally developed by Uber.
Justin Parisi: So you said originally developed by Google?
Rick Huang: Uber.
Justin Parisi: Oh, Uber. Okay. So Uber actually developed Horovod. That’s interesting.
Rick Huang: Yeah. They develop it, use it, open source it. Yeah.
Justin Parisi: Oh, cool.
It’s good to see that these tech companies are not just hoarding the technology, right.
They’re releasing it and delivering it to others to use because it actually benefits them in the long run. Cuz if you open source something, people can improve it. They can secure it. And it’s very transparent to everybody.
Rick Huang: Yeah. This is similar to Facebook with PyTorch and Google with Tensorflow.
Justin Parisi: Google with Kubernetes. I mean, that’s lots, lots out there. That’s,
Rick Huang: yeah. Tensorflow and Kubernetes.
Justin Parisi: All right. So speaking of Kubernetes, that is one of those things that powers a cloud architecture. So let’s talk about the hybrid cloud AI workflow. Like what does that look like? How do they interact with the AI?
Tell me end to end, how an AI workflow would take the cloud and leverage that for its benefit.
Rick Huang: Right. In my blog, I provide an example of a cloud service partner providing multi cloud connectivity for an IoT big data analytics environment.
If you look at the figure of that blog, it starts from the sensors. They sent realtime events via REST API to a EC2 VPC. And in that you have Kafka first and then goes to Spark jobs for processing, and it talks to our NFS direct access as the storage backend, and you can send those processed data to your on-premises cluster or another cloud provider via direct connect or express route.
And in your Spark cluster, you can do fancy stuff on premises with our ONTAP storage and your compute of choice
Ken Hillier: You know, that’s actually a really good point about hybrid cloud. Many of our customers are gonna be faced with the need to build or leverage resources that it’s not cost effective to build up in-house, and it’s going to be the reason why they take a look at the cloud to be able to augment the services they can provide to their own end users.
And really why we see the industry moving towards a hybrid, multi-cloud architecture across the board.
Rick Huang: Yeah. Customers want to run analytics and AI jobs on the same day by using multiple clouds but the main challenge is to build a cost effective and efficient solution that delivers hybrid analytics and AI/ML/DL services among different environments.
Ken Hillier: And another example of working with one customer, they only have so much resources on-prem, so many GPUs available. There’s different groups within their data science teams that don’t have the resources available for them, or it’s restricted. And that’s another reason why bursting into the cloud is, is very attractive and part of the reasons why they’re looking at leveraging cloud resources to augment what they already have on-prem.
Rick Huang: Yes. And speaking of bursting into the cloud, actually the public cloud providers, they all offer very good AI/ML like SageMaker or other Azure ML capabilities. These are good, but once the demand gets large, then you will keep accruing the cloud costs. Instance usage and then especially if you want to move data, ingress charges, it can be very intimidating.
Ken Hillier: Yeah, that’s actually another really good point. And the reason why I think we, as a company have seen a lot of our customers journey to the cloud, right? Even though they’re all using various resources from different hyperscalers, at the end of the day, there’s still a need to maintain on-prem resources.
And you nailed one of ’em. I mean, if there’s something that is being sustained, constant resource requirement, it’s probably gonna be cheaper building and maintaining that on-prem than paying for that in one of the clouds.
Justin Parisi: Would you say that’s the case when you’re dealing with like GPU servers? Right? So you have these very expensive GPU servers and maybe you’re not using them all the time, right? But the cloud has GPU servers that you can rent or you can lease. So in those use cases, it might be more cost effective to just use the cloud for your compute because that CapEx and the OPEX comparison is gonna be vastly different.
Ken Hillier: Yeah, that’s the use case I was talking about earlier with the one customer. Another thing is, is that the clouds, they have a lot of different analytic and AI services available, and some of these boutique services, it’d be cost prohibitive for all customers to start building the stuff up themselves.
So if they are not going to be using something all the time, renting it like we rent cars is definitely a good way to approach that, but it also allows them access to resources that they don’t have to build on- prem for when they need it.
Rick Huang: Yeah, I think we’ve talked about some drawbacks in the cloud, but I just want to say that you can also use our Spot Ocean and associated products in AWS to manage your cloud costs.
Ken Hillier: You know, that’s actually a really great point. Rick. I had a conversation – actually it was really recently – we’re great at data management and data motion, you know, getting, moving the data around, protecting it, whatever a customer needs from us, but we’re not just a data storage company anymore. We have a portfolio that allows us to help customers in so many more ways than just focusing only on the data. And you’re mentioning one of the key areas as well. There’s a few other Solutions that we have out there that could also augment how customers increasing visibility into how their data’s being used, either with Cloud Insights or with Data Sense.
Rick Huang: Yes. And we are currently also developing solution using a Apache Spark solution in a hypervisor.
Justin Parisi: So basically an ova that does all that for you? Is that what you’re talking about?
Rick Huang: Yes.
Justin Parisi: I think people would like that. Like an appliance, a virtual appliance that takes care of my hybrid cloud needs.
Rick Huang: Yeah. It sounds cool, right?
Justin Parisi: Yeah, it does.
Ken Hillier: Well, Justin, you had mentioned a little earlier, kind of in jest about not wanting to get down into the details of the technology or whatnot. Just having something available so that data scientists could get the job done that they’re looking at doing.
I think these are the type of solutions that we’re providing our customers that are going to help the data scientists focus on what they need to do versus building up the infrastructure so they can begin the work that they want to.
Justin Parisi: So with an Apache Spark solution, you touched on some of the protocols in use and mainly we’re looking at NFS and object. But why? Why would I be interested in using those and what are some of the benefits to each?
Ken Hillier: From an industry standpoint and the trend that I’ve been seeing object is becoming much more prolific. It was born in the cloud, it was born with Amazon, but now I’m seeing customers looking at object, not just for deep storage, on-prem. I’m seeing customers trying to use object as a first party protocol, not only for analytics, but also for just general object storage to augment. One customer in particular is trying to start leveraging more object directly than HDFS, mass storage for the next generation of their data lake. That’s a perfect example of what we’re seeing an industry trend where I don’t see CIFS and NFS going anywhere, anytime soon. I mean, obviously they’ve been around for decades and they’ll probably be around for decades more. But one of the things that’s starting to get much more common is customers looking at how they’re gonna leverage object as another protocol for allowing access to the storage, to their end users.
Rick Huang: I would say for AI applications directly using NFS gives you a much faster result and you can also take advantage of our ONTAP capabilities. But as Ken said, people are moving towards directly using objects, and we’ve worked with several ISVs like Domino Data Lake that can directly take objects and then train your model and put the model into production without ever needing to use NFS.
Justin Parisi: So Domino’s has a data lake, is it pepperoni?
Rick Huang: Domino data lake. We, yeah…
Ken Hillier: [laughs]
Justin Parisi: I gotcha! You’re like, what are you talking about? You idiot.
Anyway. So, ONTAP 9.12.1 introduced the ability to present NAS file shares as objects. So when I heard the feature first announced, I was like, okay, this is the perfect feature for something like an AI workflow.
Am I off base There?
Rick Huang: This duality is actually very important and we are waiting for it to release.
Ken Hillier: And from my experience, no, you’re not off base. Justin. When I’ve been involved with testing with customers with analytics, they wanna see primarily NFS, right? That’s where the performance at, that’s where the lion share of the protocol support for these analytic applications is at.
But there is growing object support and we’re testing object access. We’re testing it out within Spark workloads. We’ve done it within CPOC and customers are making sure that this is a path forward that we can help them go down should they decide to travel that road.
Justin Parisi: Yeah, and it’s great for customers that aren’t quite ready to make the leap to object. They still have NFS file shares, right? But they do wanna start dabbling in it a bit. And it’s a good way to transition. And it does all this without having to move any data. And this goes back to our data migration and how challenging that can be. If it’s already in place, if you’re just presenting it in another way, there’s very little you have to do other than just presenting it.
Ken Hillier: Absolutely. And then when you have these hybrid environments where some part of the workflow requires NFS and another part of the workflow may require object, you’re working often data in place.
So you’re not copying data, you’re not doing anything unnatural or spending more time working with it. You write it out and it’s access data in place in a storage efficient process.
Rick Huang: I think at the end of the day, customers just want a single pane of glass where they can manage their workflows.
Ken Hillier: I would agree with that, Rick. I think our customers are becoming more conscious about how much time they have to spend care and feeding the infrastructure versus how much they need to pay attention to what the business actually needs, what the data scientists needs in order for them to be able to achieve the goals that the whole business needs.
So operations I think is a very important aspect of it and with ONTAP being able to present both object as well as NFS to analytic applications, it makes it really easy to manage the storage infrastructure piece through the same workflows without doing anything different.
Rick Huang: Have you guys played with ChatGPT?
Justin Parisi: I personally have not. I’ve read a bit about it, but yeah, have, have not played with him, myself,
Ken Hillier: I’m in the same boat. But it’s very exciting what I’ve been reading about it.
Rick Huang: I have played with it and I’m very excited about its future prospects.
In the news you read that high school English is dead, but on the other hand some teachers have started to use ChatGPT or something similar to teach their students, and the way to do it is actually called co-editing, right? You ask it for a book summary and then it gives you something, but you ask further questions.
You don’t just submit an essay that ChatGPT wrote, and these large language models for sure will make our lives easier and increase productivity. One of the key benefits of using a language model like GPT3 in an LP application is its ability to learn from large amounts of data and generate high quality text output that closely resembles human language, even in complex and nuanced contexts such as legal, medical, or technical writing.
Justin Parisi: I think I’m just gonna use that to do my job now.
Ken Hillier: I was talking with somebody who is actually leveraging a ChatGPT based application to help her code.
Justin Parisi: Yeah, I mean, coding’s a really good use case for that. And scripting. Yes, like you’re an admin, like trying to create a script. And traditionally you go in and you’re searching through Stack Overflow and all these other blog examples and some things may or may not work.
If you just plug it into ChatGPT and you say, Hey man, ChatGPT, write me a script that looks for all my users in my AD environment, and it’s just gonna find all that stuff for you and it’s gonna give you something that’s probably worlds better than you could have cobbled together on your own in a lot less time.
Rick Huang: That’s true. But ChatGPT does not, well, it generates code, but it’s not very good . If you just use GitHub Co-pilot, it integrates with your IDEs like PyCharm, Eclipse or whatever. It generates pretty good code based on natural language input. And the problem with that is for enterprises, if you have intellectual property like us, 30 years of ONTAP code, you wouldn’t want to open it for co-pilot to learn.
So yeah, that’s would be another opportunity that we can develop in-house for code documentation, translation, searching, such that the ONTAP engineers, they can do their job in less time.
Ken Hillier: Yeah. You know what, you’re absolutely right, Rick, but you know what thought just struck me about ChatGPT? You guys remember the controversies around calculators when we were back in school?
Justin Parisi: Yes.
Ken Hillier: I’m beginning to wonder if ChatGPT is going become the modern version of the calculator controversy.
Justin Parisi: There’s something to be said for using tools in real world situations, right? Like calculators or ChatGPT or whatever. I think the point of schools not using those is you don’t get that intellectual base, right?
And I don’t know how much it matters cuz when’s the last time you used Calculus ? I mean, maybe Rick uses it…
Rick Huang: yeah, I think…
Justin Parisi: I don’t use it.
Rick Huang: More important thing is how to leverage all the tools, right? It’s not just memorizing knowledge. It’s how you leverage those tools to achieve something, make a better future.
Justin Parisi: Yeah. But I mean, you’re basically talking about, I know how to put it into the calculator, versus I know why two plus two equals four. Memorizing facts is one thing, but knowing the how and why things work.
And I do this on my day-to-day job too. Like if I’m trying to learn something new, yeah. I could read about it and use a tool to try to help me with that. But what really works for me and for many people, I think is going in, trying it out, breaking it, trying to fix it.
And then that kind of really teaches you all the little base inner workings of everything. And that way if you run into a problem that something like a chat g p t can’t solve, cuz I mean, ultimately these AI models need data. And if you’re not feeding it data, if everyone’s just using ChatGPT, then where’s it getting its information? How is it gonna learn if we can’t learn?
Ken Hillier: Yeah.
Rick Huang: Yeah and caution, I have to say is that these generative AI models are very good at saying something that makes sense and looks good. Suppose we want to buy an EV, and ask it about, Hey, what is the top five selling EVs from the past year?
And it gives you five EV models. You have to really check because out of those five, I will say one of them, from a automotive industry expert perspective, it looks dubious.
Justin Parisi: Is that right? Is it listing something that is like, whoa, I can’t, I can’t get that. That’s gonna be terrible. Is that
Rick Huang: Yeah, yeah, yeah. Something like that.
Justin Parisi: So I wonder if these companies are manipulating right, because you can manipulate the datasets…
Rick Huang: at the end of the day. Yeah. Yeah. That’s also another thing, but you have to be careful. Yeah. You don’t just use whatever they generate.
Justin Parisi: Yeah. Yeah, and I guess that’s the case with pretty much anything, right? I mean, anything you read on the internet, you should always trust but verify. I wouldn’t even say trust it, mistrust, and verify, and then maybe trust it. But you know, ultimately, yeah, take information’s only as good as the source.
Rick Huang: Take it with a grain of salt, and then , make your own decisions and take responsibility.
Justin Parisi: Yep. All right, so, one area that you can find information that you can trust is Rick’s blog. So Rick, tell us about this blog you wrote.
Rick Huang: Right, so the blogs I’ve covered is Hybrid Cloud Solutions with Apache Spark and NetApp AI. The other one, Deep Learning with Apache Spark and NetApp AI: Horovod Distributed Training. If you want to dive deeper into data lake migrations, I will suggest technical reports from my colleagues and I will share the links to my blog and also the GTC poster.
Justin Parisi: So if we wanted to reach you, Rick for more information how would we do that again?
Rick Huang: Right. My email is rick.huang – r-i-c-k.h-u-a-n-g – @netapp.com
Justin Parisi: And Ken.
Ken Hillier: I can be reached at Ken Hillier, k-e-n dot h-i-L-L I E r netapp.com. And as I said in the beginning, you can also get me at firstname.lastname@example.org.
Justin Parisi: That’s way easier.
Ken Hillier: It is.
Justin Parisi: You should have led with that…
Rick Huang: yeah. That’s good . Very good email.
Ken Hillier: When I got that, I got a lot of spam with it too.
Justin Parisi: Oh yeah. That’s probably accurate there.
Rick Huang: You gotta take the good with the bad. A good email address comes with a lot of stress.
Justin Parisi: All right. That music tells me it’s time to go. If you’d like to get in touch with us, send us an email to email@example.com or send us a tweet @NetApp.
As always, if you’d like to subscribe, find us on iTunes, Spotify, GooglePlay, iHeartRadio, SoundCloud, Stitcher, or via techontappodcast.com. If you liked the show today, leave us a review. On behalf of the entire Tech ONTAP podcast team, I’d like to thank Rick Huang and Ken Hillier for joining us today. As always, thanks for listening.
Podcast Intro/Outro: [Outro]