In the technology world, 2023 was dominated by headlines about machine learning (ML) programs such as ChatGPT and Dall-E. Large language models (LLMs) and generative AI fascinated people with the ability to generate text following almost any prompt and an image created by the generative AI program Midjourney even won an art contest. Along with the fascination with the new capabilities of ML, many articles have been written about the potential future impact of AI on jobs and society. While the societal impact of artificial intelligence (AI) and machine learning will undoubtedly be felt in the future, AI is impacting the design and operation of data centers today. The growth of ML has upended hyperscaler’s data center designs and growth plans, forcing them to reconsider traditional data center operations to meet the unique demands of ML workloads while finding ways to scale even faster than they have in the past.

Deploying the GPUs used for machine learning is unlike deploying traditional CPUs in many ways. The challenge can be seen by the fact that Meta paused construction on multiple data center projects as part of a broader plan to rework 11 data centers around the world with drastic redesigns to meet the needs of advanced AI workloads [1]. Some of these differences include the following:

  • GPUs consume significantly more power than traditional CPUs. This impacts everything from requiring new higher-density rack designs to ensuring power availability in locations dealing with limited supply from utilities.

  • Higher-power GPUs require new cooling approaches to deal with the heat generated. This necessitates the need to consider liquid cooling approaches rather than relying solely on traditional forced air cooling.

  • ML training uses very large parameter sets that need to be exchanged among servers during each iteration with extremely low latency. This increases the demand for high-bandwidth network connections that can move massive amounts of data very quickly across the distributed infrastructure used for ML workloads. This will accelerate the transition to next-generation high-speed optics such as new 800G optics.

  • For large ML clusters with large parameter sets, the time spent on exchanging data during training can range from 30 to over 50% of the overall training time. Since the very expensive GPUs are sitting idle during the parameter exchange, optimizing the network becomes a key concern to improve the efficiency of ML workloads.  

While articles could be written on each of the individual items above, the focus of this article will be on optimizing the optical network to address the needs for machine learning.  

While traditional data center workloads are small and bursty – think of a Google search result or a Facebook post with an adjacent ad placement, machine learning workloads tend to be long-lived with very large data transfers. As an example, ChatGPT-3.5 was trained with 176 billion parameters and required an estimated 2 million GPU hours of computation. This equals 3 weeks if run on a cluster of 4,000 GPUs. Also, machine learning requires ultra-low latency for training and inference while the latency requirements are much less demanding for traditional data center workloads [2].  

The network design for a traditional data center is a multi-stage CLOS network with racks of servers and storage devices connected in a hierarchical tree structure to all other devices. For cost management, the uplink bandwidth is statistically multiplexed, allowing data exchange to be allocated using time-division multiplexing based on need. Since the data packets are small, latency issues caused by contention at the switches have a minimal impact on the overall response time for items like a search query.  

As mentioned earlier, the data flow for machine learning training involves the transfer of extremely large data sets among all the GPUs after each iteration. The processing of this data involves collective operations such as All-Reduce where data from different GPUs is combined into a global result and then distributed back to each of the processors. Since All-Reduce operates in stages, it can have a tail problem with the timing of the next step limited by the slowest data exchange among all the processors. For this reason, contention through a hierarchical Clos network is not ideal for machine learning applications. With the drive for the lowest possible latency, direct rack-to-rack interconnection that looks more like a direct mesh network is preferred in ML data centers. Since a full mesh network can be impractical to implement, network architectures such as dragonfly or rail-connected networks are being proposed to optimize networks for ML models with very large data sets [3].  

An additional fact that can be used to improve the optical network performance of ML training is that 

due to the size of these models, they are often distributed or parallelized across multiple GPUs. The types of parallization include data, model, and pipeline parallelization. Parallelization requires that different GPUs compute different parts of the deep neural network (DNN) model, leading to a need for varying bandwidth to exchange parameters between iterations during the training runs. The impact of this varying bandwidth is shown in Figure 1 below [4]. This figure shows a heat map of the data exchange between different GPUs in the cluster. It is clear from this figure that there are nodes that require high bandwidth for data exchange (dark blue) while there are other nodes that exchange much less information (light blue and white). Adjusting the bandwidth between nodes can provide a dramatic efficiency improvement in training ML workload. A collaboration between MIT, Meta, and Telescent using a robotic patch panel to optimize connectivity demonstrated a 3.4x improvement in training efficiency with no increase in communication costs [4]. Using this method, a 3-week training run could be reduced to less than a week.

Figure 1: Heat map for bandwidth exchange for various production workload in BigNet [4].

While we’ve discussed the challenges of meeting the needs for current ML data center designs, even more challenging for data center owners is that the field of machine learning and AI will continue to advance extremely rapidly. A chart published by Google highlights the rapidly evolving nature of machine language models [5]. In the chart shown in Table 1, the percentage of GPU time devoted to different deep neural network (DNN) models is shown for selected months over the past 7 years. The dominant model in 2016, the deep learning recommendation model (DLRM), is only 24% of the processor time in the most recently reported time period. Demonstrating significant growth, large language models have grown from nothing to 31% of processing time in just 2.5 years and will probably dominate the processing time when the chart is updated. This rapid change will place additional demands on the optical network, and maintaining flexibility will be critical to efficiently meet the needs in the future.

Table 1: Workloads by Deep Neural Network (DNN) model type (% TPUs used) at Google for selected months showing the changing mix of DNN models.  DLRM = Deep Learning Recommendation Model, RNN = Recurrent Neural Network, CNN = Convolutional Neural Network, BERT = Bidirectional Encoder Representations from Transformers [5].

Optimizing the connectivity between GPU nodes to address the optical networking needs of ML data centers can be done using optical circuit switches. Google has published multiple papers demonstrating improved performance in both machine learning clusters as well as traditional data center networks using a 136-port MEMS switch [5, 6]. While the MEMS devices offer a fast-switching speed, the low port count necessitates the need for bi-directional optics, circulators, and improved FEC chips to overcome the low port count of the MEMS device and relatively high optical loss through the switch [6].

Telescent has an all-fiber, high port count, low-loss optical switch that can be scaled to thousands of fibers per OCS to manage connectivity in a machine learning cluster. The Telescent system consists of a short fiber link between two ports with a robot that moves the selected port to the requested new location. The key element of the Telescent system that allows it to scale to high port counts is the routing algorithm that the robot uses to weave the fiber around other fibers to the new location. The original Telescent system was designed with 1,008 simplex LC ports, but this has been extended to multiple fibers per port. Using an MT-style connector allows the Telescent system to scale to 8 fibers per port and many thousand fibers per system. The Telescent system has passed NEBS Level 3 certification and has been used in production networks. Both single mode and multimode fiber have been deployed in the Telescent system, allowing use with lower cost, short-reach multimode transmitters.  

In a recent ACM Special Interest Group on Data Communication (SIGCOMM) presentation, Ryohei Urata from Google pointed out that the reconfigurability capability of optical circuit switches not only improved current performance but provided a “powerful form of future-proofing for future models.” [7] In a LinkedIn post, Dan Golding, an executive at Google with extensive AI/ML and data center networking experience, stated the following about ML: “It’s important for those operating and designing data centers to check their assumptions and get ahead of the curve. Being able to support ML means not just greater power/cooling density, but greater network density! [8].” 

To learn more about the Telescent robotic cross-connect system and how it can improve your machine learning performance both today and in the future, please contact Telescent at www.telescent.com

[1] Meta Stops Construction On 2 U.S. Data Centers As Part Of AI-Focused Redesign (bisnow.com)

[2] 4 Ways to Optimize Your Data Center for AI Workloads | Data Center Knowledge | News and analysis for the data center industry

[3] rail_llm_hotnets_2023.pdf (mit.edu)

[4] [2202.00433] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs (arxiv.org)

[5]  [2304.01433] TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings (arxiv.org)

[6] [2208.10041] Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale (arxiv.org)

[7] Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter [& ML] Systems (SIGCOMM'23 S8) - YouTube

[8] (30) Post | LinkedIn