A huge variety of our daily activities are affected by machine learning algorithms. Machine learning is used in everything from Siri's speech recognition to your choice of movie based on a Netflix recommendation to which ad shows up in your timeline when you open Facebook. Since machine learning algorithms are also used in data centers to reduce downtime using predictive maintenance, machine learning even has an impact on how data centers are run.

Machine learning algorithms and training sets have become astronomically large to create these amazing capabilities. Natural language model are an example of the expanding sizes of data sets and machine clusters. 2,000 to over 6,000 GPUs/TPUs are used by models such as Gopher, Megatron-Turing NLG, and PaLM, and each process over 200 billion parameters [1]. However, as these datasets and xPU cluster sizes have increased, worker communication has turned into a bottleneck. A graphic displayed at the recent OCP conference proves this. Even with just a few tens of parallel workstations communicating over ethernet, the amount of time required for worker communication exceeds that required for computing [2]. For clusters of a few hundred parallel xPUs, nearly 90% of the training time is spent on communication. While communication over infiniband is faster, a significant time is still spent on communication rather than computation.

Figure 1: Slide shown during OCP conference, Oct 19, 2022

While faster communication links between xPU clusters help address the communication bottleneck, there are also cases where the communication links between xPU clusters will need to change. The first reason is that the workloads will be of different sizes. While the language models mentioned above required several thousand GPUs working together, there are other workloads where a smaller number of workers will be required for individual trainings, such as video and ad recommendation for individual users on TikTok. While the dataset is much smaller than in a natural language model, the ML training will need to stay current (to provide good video recommendations) for hundreds of millions of users, hence the similar need for speed and performance on these smaller ML training datasets.

Changing the communication links between GPUs also offers other benefits, for example when employing model parallelism to run different ML algorithms. For model parallelism, the data exchange between workers is not uniform across the cluster but is predictable and stable during the full training run. There are workers that exchange more data between iterations and optimizing the bandwidth between these high-load workers can improve the overall speed of the training run considerably. Researchers have demonstrated a 3 to 4x improvement in training time through optimization of the bandwidth between workers [3].

A final need for adjusting communication links between xPU workers is to handle machine failures in a torus-style ML cluster. As individual nodes fail, the complexity of routing around dead nodes within the static network becomes increasingly difficult. While this issue can be seen at smaller scale with hundreds of nodes, it becomes unmanageable at larger scale with thousands of nodes. Again, the solution to this problem is to build out a reconfigurable optical network with the ability to change the communication links between nodes, but here the motivation is reliability rather than performance.

Recently, the use of optical circuit switches (OCS) based on micro-electromechanical systems (MEMS) have been demonstrated in hyperscale optical networks at Google [4]. In this work the use of a 138-port MEMS OCS to replace the electrical spine switches within the data center Clos type fabric resulted in a 30% reduction in capital expense and a 40% reduction in power requirements. Implementing this technology into the DC network involved developing other technology due to the relatively low port count of the MEMS switch. Implementation required using bi-directional optics, adding an optical circulator and using an improved FEC algorithm due to increased loss and reduced return loss in the optical link.

For optical switching in machine learning applications, the use of 8 lane parallel optics and the size of the xPU clusters drives the need for an optical switch that can handle thousands of fibers. This would be extremely challenging for a MEMS-based OCS. An alternative technology is a robotic cross connect system. Telescent has an all-fiber, high port count, low-loss optical switch that can be scaled to thousands of fibers per OCS to manage connectivity in a machine learning cluster. The Telescent system consists of a short fiber link between two ports with a robot that moves the selected port to the requested new location. The key element of the Telescent system that allows it to scale to high port counts is the routing algorithm that the robot uses to weave the fiber around other fibers to the new location. The original Telescent system was designed with 1,008 simplex LC ports, but this has been extended to multiple fibers per port. Using MT-style connector allows the Telescent system to scale to 8 fibers per port and many thousand fibers per system. The Telescent system has passed NEBS Level 3 certification and has been used in production networks. Both single mode and multimode fiber have been deployed in the Telescent system, allowing use with lower cost, short-reach multimode transmitters.

To learn more about the Telescent robotic cross connect system and how it can improve your machine learning performance, please contact Telescent at www.telescent.com

[1] Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance – Google AI Blog (googleblog.com)

[2] https://arxiv.org/pdf/2104.06069.pdf

[3] W. Wang et. al., “TOPOOPT: Optimizing the Network Topology for Distributed DNN Training,” USENIX Symposium on Networked Systems Design and Implementation (NSDI '22) April 4-6, 2022, https://doi.org/10.48550/arXiv.2202.00433

[4] Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking – Google Research

Addressing Bandwidth Needs in Machine Learning Through Robotic Fiber Management