How many hops does it take to get to the center of a GPU? For most data architectures in the AI Era, the answer is an alarming 11. That’s a lot of steps to take for the massive volumes of data required to set all those weights and biases in billions of parameters.
Imagine building a skyscraper instead of a deep learning module. You’d need to get all the materials to the job site in a timely fashion, and in a particular order. Now, consider adding 10 stops to every item’s path, from source to destination. The cost increase would be massive!
And yet, that’s what happens every day, in almost every scenario where companies are training AI models. The challenge largely stems from the complexities that occur at the intersection of networks and file systems. Remember that Jimmy Hendrix song, Crosstown Traffic?!
In a past life, yours truly lived in Hastings-on-Hudson, New York, for a brief, shining few months. While I’d often take the train, sometimes I’d drive into town. The time it took to go 10 miles from home to midtown Manhattan was maybe 12 minutes. But what about the final leg of that trip?
I worked in the Empire State Building, 45th floor, such a beautiful view! But getting from the West Side Highway through the handful of blocks necessary to reach my usual parking garage? That could take 10, 15, 20, even 30 minutes! Why? Because of… Crosstown Traffic!
In our AI training scenario, there are many chokepoints for data that can muck up the process. Maybe the network is slow. Sometimes there’s a lot of activity in the file system. Perhaps there’s a hardware failure, a flash drive crashes, or a whole slew of requests come through at once.
The end result is time lost, anywhere from minutes to days or more. That pushes out time-to value, which puts pressure on the entire project. It’s time lost before realizing, perhaps, that the design is suboptimal, or that the training data did not generate a desired result.
Hammer Time
In a recent Briefing Room Webinar, several executives from Hammerspace joined me in describing the nuts and bolts of this very challenge. And they should know! The company is currently training Llama 2 and 3, Meta’s new, open-source Large Language Models.
First up, CMO Molly Presley took us down memory lane (both literally and figuratively), by explaining the critical nature of data processing architectures, and noting how they’ve changed over time. Some of those changes have been so serious as to require major refactoring.
“So if you think about the last 10, 20, 30 years, and Eric and I have talked about this on my Data Unchained podcast, and you think about what we’ve been doing, the architectures were really designed for the storage system owning the data and dictating how it’s retained, protected,” explained Presley.
To appreciate the complexity here, keep those 11 hops in mind. Several of those hops happen inside the storage layer, navigating the inherent file systems. Some of them happen at the network layer. We’ll get to more on that shortly.
Presley continued: “Anything that was done outside the storage system meant you had to make a copy of the data to move it to another application, make a copy of the data to get it into the cloud.”
Anyone who’s ever been bitten by version control knows that there are many devils in those details! Copy proliferation is an issue in almost every information architecture on the planet. And even in cases where vendors focus specifically on solving for that, other issues can arise.
“I can’t always rely on compute,” commented Dave Malek during a research project we conducted last year. He was referring to cloud data warehouse vendor Snowflake’s architecture, which cleverly focused on keeping just one copy of the data, but allows for no-copy clones to enable exploration of ideas, like a sandbox.
With this example, it’s true that the zero-copy cloning solves the copy proliferation challenge. Well done there! But to Malek’s point, there are times when making a copy of the data would greatly expedite some slicing and dicing that would otherwise rack up serious compute charges.
Presley continues: “With Hammerspace, a core innovation we had was moving the file system out of the stack instead of embedding it in the storage system. We’ve raised it above the stack so when users and applications connect to their data plane, to their file system, they just do that once.” There’s a name for that kind of innovation: it’s called User Experience!
But the story gets better: She notes, “If additional data sets are added, if additional data locations are added, different storage tiers are added, given if the user application has the permissions, they can immediately see all of the data in their global data environment. That gives visibility to the entire data set on a global level in all different types of storage environments and storage tiers.”
Think of it as the ultimate layer of data abstraction, an enterprise backplane that manages the shepherding of unstructured data wherever it needs to go, whenever it needs to get there.
“And then when the data needs to be moved, we aren’t making copies, we aren’t adding to this problem of copy proliferation,” says Presley. “Instead, through intelligent automation, it is placed where it needs to be used. If you have remote GPUs or you’re spreading your data across compute clusters, the data is automatically orchestrated to the compute. If you need to tier it for disaster recovery reasons or lower cost storage reasons, that’s also automated.”
A Shortcut for Data
We’ve established that the traditional data path to the GPU is a long and winding road, fraught with potential bottlenecks. Hammerspace’s Tier 0 dramatically shortens this journey, reducing the number of “hops” from a staggering 11 down to a mere 4. But how exactly does this magic happen?
Hammerspace CEO and Co-Founder David Flynn explains that Tier 0 leverages existing resources within the GPU server itself:
“What we have learned is that it’s very common for folks to have high-performance flash storage, NVMe, inside of their compute nodes or GPU nodes. Tier 0 is part of a continuum. It’s based on the notion of data orchestration and that data is able to move across a continuum from tier one, even tier two or tier three, even across to other data centers, and flow all the way into the GPU servers themselves.”
Essentially, Tier 0 unlocks the potential of the NVMe drives already present in most GPU servers. This “trapped capacity,” as Flynn calls it, often goes unused because it’s difficult to manage these isolated islands of storage. Hammerspace’s software, however, integrates these NVMe drives into its global data environment, transforming them into a high-performance, shared resource.
A Kernel of Truth
This innovation wouldn’t be possible without some clever tinkering within the Linux kernel itself. Trond Myklebust, CTO of Hammerspace and Linux NFS client kernel maintainer, sheds light on this crucial aspect:
“What we’re basically doing is allowing the client to identify the particular situation where it and the NFS server are co-located. Rather than going out of the network stack, once the client has identified that it is actually co-located, it can talk directly to the NFS server, get authorized to access the file, and then basically it is given an ordinary file descriptor. Then it can talk directly to the file system using this file descriptor and read and write the data directly without having to go through any extra hoops or impediments.”
By optimizing the communication between the client and the server within the kernel, Hammerspace eliminates unnecessary network trips, significantly reducing latency and boosting performance.
Hyperscale NAS: The Backbone of Tier 0
Underlying Tier 0 is Hammerspace’s Hyperscale Network Attached Storage (NAS) architecture. This innovative approach, as Flynn explains, takes the control path out of the data path, similar to high-performance file systems like Lustre or IBM GPFS. However, unlike those systems, Hammerspace achieves this using standard NFS, making it much easier to deploy and manage.
“This is just using the built-in data path that’s inside of the kernel that’s made to move data zero copy and with Remote Direct Memory Access (RDMA) capability. It’s the NFS client to the NFS server, the entire data path, top to bottom, unadulterated virgin Linux,” he says.
This streamlined architecture, coupled with the kernel optimizations, allows for incredibly efficient data movement, both within the server and across the network. And because it’s now built into the Linux kernel, that opens the door to just about any enterprise environment in the world, as Linux has been the standard OS for industrial systems for years.
While the performance gains of Tier 0 are undeniable, the benefits extend beyond just speed. Presley highlights the broader impact: “With our linear scalability, whether you have 10 nodes or a thousand, the benefits are the same, just in the appropriate ratio to nodes.”
This scalability makes Tier 0 a viable solution for a wide range of deployments, from small clusters to massive data centers. Additionally, by utilizing existing NVMe capacity, Tier 0 can lead to significant cost savings by reducing the need for external storage. That’s a message every CIO and CFO can agree on!
A Paradigm Shift
Tier 0 represents a fundamental shift in how we think about data management in the era of AI. By shortening the distance between data and compute, Hammerspace unlocks new levels of efficiency and performance. This innovation not only accelerates AI workloads but also simplifies data management, making it easier for organizations to extract value from their data.
As Flynn puts it: “Hyperscale NAS is what has allowed us to introduce data orchestration and to decouple data from storage, which has always been kind of that holy grail.”
With Tier 0, Hammerspace has achieved this holy grail, paving the way for a future where data is no longer a bottleneck but a catalyst for innovation. If Jimi Hendrix were still around today, and were given a demo of Hammerspace in action, he’d likely say something like this: “Excuse me, while I kiss the cloud!”