If you are reading this blog, it is likely that you are already acquainted with the concept of Large Language Models (LLMs) and generative Artificial Intelligence (AI). Many articles in the field have expounded upon the remarkable capabilities of these LLMs and their potential applications, such as aiding in computer code generation or summarizing intricate legal documents. Additionally, discussions have arisen regarding the profound influence LLMs may exert on various industries.

However, it is crucial to comprehend the far-reaching implications of generative AI, LLMs, and other machine learning (ML) models on the data centers that underpin their operation. Training a sizable language model necessitates the involvement of thousands of processing units, exchanging an astonishing number of parameters – a process that continues for extended durations, often spanning weeks. Such workloads differ significantly from the conventional operations typically managed within data centers, necessitating an overhaul of the existing architectural framework to optimize for these novel demands.

Compounding this challenge is the breakneck pace at which advancements in machine learning and AI are unfolding. To illustrate, consider that the seminal paper that introduced the Transformer model underpinning the recent leaps in machine learning was only published in 2017 [1]. Google's publicly available data reveals the rapid evolution of deep neural network (DNN) models in Figure 1. This chart demonstrates the allocation of GPU time to various DNN models for select months spanning the past seven years. Remarkably, the dominant model in 2016, the Deep Learning Recommendation Model (DLRM), now accounts for just 24% of processor time in the most recent reporting period. Conversely, LLMs have surged from insignificance to occupying 31% of processing time in a mere 2.5 years, poised to assume even greater prominence in the future when the chart is updated.

Table 1: Workloads by Deep Neural Network (DNN) model type (% TPUs used) at Google for selected months, depicting the shifting landscape of DNN models. DLRM = Deep Learning Recommendation Model, RNN = Recurrent Neural Network, CNN = Convolutional Neural Network, BERT = Bidirectional Encoder Representations from Transformers [2].

Delving deeper into the ramifications of evolving DNN models, their sheer magnitude often necessitates distribution or parallelization across multiple GPU nodes. These parallelization strategies encompass data, model, and pipeline parallelization, leading to GPU nodes processing distinct segments of the DNN model.

Consequently, bandwidth requirements vary across GPU nodes during the parameter exchange of the training process, as visually represented in Figure 1 below [3]. The heat map delineates data exchange between different GPUs within a cluster, emphasizing nodes that require high bandwidth for data transfer (dark blue) as well as nodes with significantly reduced data exchange (light blue and white). Adjusting bandwidth allocation between nodes to match the demand can yield substantial efficiency enhancements in ML workload training. A collaborative effort involving MIT, Meta, and Telescent, utilizing a robotic patch panel to optimize connectivity, achieved a remarkable 3.4x increase in training efficiency without incurring additional costs [3].

Figure 1: Heat map illustrating bandwidth exchange for various production workloads in BigNet [3].

In addition to the aforementioned challenges, contemporary AI chips exhibit voracious power consumption tendencies. A solitary GPU may devour 500 to 700 watts, surpassing the energy demands of traditional CPU chips by a factor of 5 to 10. Full racks of GPUs can consume a staggering 30 to 50 kilowatts of energy, placing a strain on data center cooling infrastructure. This power consumption issue has even led to construction delays in certain data center locales such as Ashburn, VA, Dublin, Ireland, and Singapore, underscoring the urgency for energy-efficient and adaptable networking solutions to mitigate these challenges.

Again, an avenue for enhanced efficiency lies in optimizing optical networking between GPUs. As the volume of data sets and compute clusters dedicated to machine learning training continues to expand, the time spent on networking and parameter exchange after each iteration escalates. The chart in Figure 2 below demonstrates the network overhead incurred by different DNN models as cluster sizes increase [3]. In large clusters, networking consumes a substantial portion of time, with GPUs idling for nearly half of the training process while awaiting updated parameters for the subsequent iteration.

Figure 2: Networking overhead in machine learning training relative to cluster size for various DNN workloads [3].

Optimizing GPU node connectivity can be achieved through the deployment of optical circuit switches. Google has published multiple papers illustrating improved performance, both within machine learning clusters as well as traditional data center networks, using a 136-port MEMS switch [2,4]. Although MEMS devices offer swift switching capabilities, their limited port count necessitated the incorporation of bi-directional optics, circulators, and enhanced Forward Error Correction (FEC) chips to overcome the low port count.

Telescent has an all-fiber, high port count, low-loss optical switch that can be scaled to thousands of fibers per OCS to manage connectivity in a machine learning cluster. The Telescent system consists of a short fiber link between two ports with a robot that moves the selected port to the requested new location. The key element of the Telescent system that allows it to scale to high port counts is the routing algorithm that the robot uses to weave the fiber around other fibers to the new location. The original Telescent system was designed with 1,008 simplex LC ports, but this has been extended to multiple fibers per port. Using an MT-style connector allows the Telescent system to scale to 8 fibers per port and many thousand fibers per system. The Telescent system has passed NEBS Level 3 certification and has been used in production networks. Both single mode and multimode fiber have been deployed in the Telescent system, allowing use with lower cost, short-reach multimode transmitters.

In a recent presentation at Siggcomm, Ryohei Urata from Google underscored the reconfigurability benefits of optical circuit switches, not only for enhancing current performance but also “as a powerful form of future-proofing for future models [5].” To delve further into the Telescent robotic cross-connect system and its potential to enhance machine learning performance both today and in the future, we invite you to contact Telescent through our website at www.telescent.com.

[1] [1706.03762] Attention Is All You Need (arxiv.org)

[2] [2304.01433] TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings (arxiv.org)

[3] [2202.00433] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs (arxiv.org)

[4] [2208.10041] Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale (arxiv.org)

[5] Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter [& ML] Systems (SIGCOMM & S8) - YouTube

The Benefit of Optical Reconfiguration in Machine Learning Clusters