When you’re designing applications that run across the scale of an entire data center, consist of hundreds to thousands of microservices running on countless individual servers, and must be invoked within microseconds to give the impression of a monolithic application, building fully connected, high-bandwidth, two-span Clos networks is a must.
This is especially true since application servers, middleware servers, database servers, and storage servers can be located anywhere in a data center. You never know what needs to talk to something else on the network, so you want to overprovision bandwidth and connectivity and keep latencies as low as possible in the end.
But high-bandwidth Clos networks aren’t necessarily the best architecture for an AI training system. Especially given how expensive networking has become for AI clusters. Given the rising cost and complexity of AI networks, something needs to change. And that’s why researchers at MIT’s Computer Science and Artificial Intelligence Laboratory have been working with their networking colleagues at Meta Platforms to think outside the box. Or perhaps more accurately, to think about what’s already in the box – in an effort to eliminate an expensive switching layer from AI networks, thereby dramatically reducing costs without compromising AI training performance.
The resulting pure rail network architecture that CSAIL and Meta Platforms have developed was described in a recent paper and presented this week at the Hot Interconnects 2024 conference, and it definitely wins the “Well, that’s pretty obvious when you think about it” award from The next platformWe love such “obvious” insights because they often turn technologies on their head, and we believe that the insights of the CSAIL and Meta Platforms researchers have the potential to transform network architecture for AI systems in particular.
Before we get into this pure rail architecture—which we might have called a reverse spine network based on its actual implementation—let’s set the stage a bit.
Clos networks are a way to connect every node or element within a node (like a GPU or DPU) to all other nodes or elements within the entire data center. These Clos networks are not the only way to create all-to-all connections between devices on a network. Many supercomputing centers today use Dragonfly topologies, but if you add machines, you have to rewire the entire network, unlike Clos topologies, which allow this to be done relatively easily but do not provide consistent latency across the network like a Dragonfly network does. (We discussed these topology issues back in April 2022 when we analyzed Google’s proprietary “Aquila” network interconnect, which is based on a Dragonfly topology.)
As you know, the large AI training systems require about 24,000 to 32,000 GPUs to train a large model with trillions of parameters in a relatively timely manner. As we previously reported, the number of GPUs used in a system at Meta Platforms today to train the Llama 3.1 405B model is 24,576, and CSAIL and Meta Platforms expect the next-generation models to include 32,768 GPUs in a single cluster. The Clos networks are based on Ethernet leaf and spine switches, all with Remote Direct Memory Access (RDMA) support, allowing GPUs to share data with all other GPUs on the network simultaneously using this all-to-all topology.
Weiyan Wang, PhD student at CSAIL, gave the all-rail architecture presentation at Hot Interconnects and said that building a high-bandwidth Clos network to connect over 32,000 GPUs would cost $153 million, and the network alone would consume 4.7 megawatts of power. For further comparison, the paper is a bit more specific about network speed, saying that a full-bandwidth bisection Clos fabric connecting 30,000 GPUs with 400 Gb/s links would cost $200 million. Suffice it to say, that’s a lot of money. Much more money than any hyperscaler and cloud builder typically spends to connect 4,096 server nodes together.
Here is a very interesting chart that Wang put together that shows the interplay between network cost and network performance when scaling AI clusters:
Doubling the number of GPUs to 65,536 devices would cost the network $300 million at 400 Gb/s port speed and consume about 6 megawatts of power.
Most GPU clusters running large language models use what is known as a Rail-optimized network, a variant of the Leaf/Spine network that is familiar to readers of The next platform. That’s what the comparisons in the data above are for. It looks like this:
You have to somehow organize the compute elements and the way work is allocated to them. The interesting thing about these rail-optimized networks is that they aggregate the ranks of compute devices across multiple rails. So the first compute engine in each node is connected to a leaf switch, the second compute engine in each node is connected to another leaf switch, and so on.
To give a more precise—and, as you’ll see, relevant—example, Wang showed how a cluster of 128 eight-way Nvidia DGX-H100 nodes connects its GPUs together with a total of 128 leaf switches, with two leaf switches per rail to cover the eight different GPU ranks in the cluster:
Here is the insight that the researchers at CSAIL and Meta Platforms have gained. They wondered what the traffic patterns across the tracks and up to the spine switches look like during the training of an LLM and made an amazing and very useful discovery: the majority of the traffic stays within the tracks and does not run across the tracks:
The tests conducted by CSAIL and Meta Platforms did not refer to the Llama 3 models of the social network, but to variants of the OpenAI GPT model family with different numbers of parameters.
And here is another breakdown of the Megatron GPT-1T model’s traffic patterns:
Regardless of whether it is pipeline parallelism, tensor parallelism, or data parallelism, traffic very rarely enters these expensive spin switches that connect the leaf-based rail switches. Aha!
So what you can do is just cut the heads off the network. Get rid of spine aggregation switches completely.
But wait, you say. What about the rare cases where you need to share data across multiple rails? Well, it just so happens that each HGX node within a DGX server (or one of its clones) contains a series of very high bandwidth, very low latency NVSwitch storage fabric switches. And instead of piping data from the leaves up to the spines and across the rails, you can use the NVSwitch fabric to push it to an adjacent rail.
Genius!
Proof of what is right in front of our eyes, the pure rail network:
And that’s why we call it an inverted spine switch. It’s not that the spine switch isn’t needed, but the NVSwitch has enough capacity to do the job for the small bandwidth and time when it’s needed. (There is no Infinity Fabric switch from AMD, so this may not work with AMD’s GPUs.)
Of course, you have to be careful here. For this to work, the shards and replicas that control tensor and data parallelism need to be placed on the same rail in the network.
Below you can see the result of this (so to speak) simple conversion from rail-optimized to pure rail networks in terms of cost reduction for switches and transceivers, which by far dominate the costs of the entire network:
Nvidia may regret having such a powerful switch at the heart of the HGX system board. . . . But probably not. Even Nvidia knows that networking cannot account for 20 or 25 percent or more of the system cost for AI proliferation.
Going back to the cluster of 128 DGX H100 servers used in the example above, with the rail-optimized network, you need 20 128-port switches across the spines and leaves to connect those 1,024 GPUs together in the backend network. You also need 2,688 transceivers to connect the GPUs to the leaves and the leaves to the spines. With the rail-only network, you end up with eight switches for the rails and only 1,152 transceivers to put the GPUs on eight separate rails. You use the hundreds of NVSwitch ASICs on the HGX boards as a rarely used inverted spine aggregation layer. This saves 41 kilowatts of power and $1.3 million in network costs.
In benchmark tests, there was no performance penalty in 3D parallelism for LLM training of this rail-only approach, and there was only an 11.2 percent performance overhead in all-to-all communication in the cluster. In the trained LLM models, all-to-all communication was only 26.5 percent of total communication, so we are talking about a 2.86 percent penalty in communication performance. Remember that communication time is only a fraction of the total wall time during an LLM training run, so the effect of using NVSwitches as an occasional backbone is negligible.
This may or may not be the case for other types of data analytics, AI training, or HPC simulation workloads. But it will be interesting for people to find out.