High Performance Computing (HPC) and Machine Learning (ML) rely on collective communication – concurrently aggregating and distributing data collected from processes running on clusters of interconnected compute nodes. This exchange of information is not only frequently used during a machine learning training or a computationally intensive task but is often the primary source of communication cost and computational delay for the process. Indeed, recent works suggest that with the explosive growth in the size of deep learning models and improved computational capabilities, the collective communication operations for model parameter synchronization across GPUs have become a major source of overhead in distributed ML training. Even with just a few tens of parallel workstations communicating over ethernet, the amount of time required for worker communication exceeds that required for computation [1]. For clusters of a few hundred parallel xPUs, nearly 90% of the training time is spent on communication.

Compute nodes can be connected in a variety of configurations – as a ring, mesh, torus or hypercube as just a few examples. When optimizing the network for the workload, the runtime of an algorithm can be broken down into a node latency component and a bandwidth latency component. A network topology has to balance the need for minimizing the number of hops associated with performing the collective computation with any load imbalances incurred on specific network links. A network with the minimal number of hops is latency optimized while a network where the link loads are balanced for all transfers is bandwidth optimized. For the all-reduce operation, where the data from all nodes is exchanged with all other nodes before averaging, a ring network is bandwidth optimized while a double binary tree is latency (node) optimized. The trade-off between bandwidth or latency optimization often comes down to resource limits and cost efficiency.

Since HPC and ML compute nodes are often connected in a static configuration using patch panels, the focus of optimizing performance is typically focused on optimizing the algorithm to fit a static topology. The extremely large size of some machine learning data sets has caused researcher to intensely focus on the overall runtime, still with a fixed topology. However, if a low-cost high-port-count reconfigurable patch panel could be added to the network, then the topology could be optimized for the workload, This creates another parameter for optimization and can bring significant benefits in performance and efficiency for machine learning or high performance computing workloads.

A robotic patch panel has recently been used by researchers at MIT, Raytheon, and Meta to modify the network topology in a machine learning cluster [2, 3]. Depending on what the system is being trained to achieve, such as natural language recognition, image detection or video recommendations, different algorithms are employed. With model parallelism of these algorithms, the communication bandwidth varies between workers but this worker bandwidth is stable and predictable during the full training run. This option allows the topology to be reconfigured before each long-running application, which for large training sets can take hours to days. Using a reconfigurable patch panel on a machine learning testbed, these researchers demonstrated a 3x improvement in training time relative to a binary fat-tree network [3]. As you can imagine, this can have a significant impact on machine learning – allowing ever larger data sets to be used for training or to run the training more often to keep recommendations current for the millions of customers.

The Telescent robotic patch panel was used for the results described above. Telescent has an all-fiber, high-port-count, low-loss optical switch that can be scaled to thousands of fibers per system to manage connectivity in a machine learning cluster. The Telescent system consists of a short fiber link between two ports with a robot that moves the selected port to the requested new location. The key element of the Telescent system that allows it to scale to high port counts is the patented algorithm that the robot uses to weave the fiber around other fibers to the new location. The Telescent system has passed NEBS Level 3 certification and has been deployed in production networks. Both single mode and multimode fiber have been deployed in the Telescent system, allowing use with lower cost, short-reach multimode transmitters.

Contact Telescent today to learn about using robotic patch panel systems to improve your high-performance computing or machine learning application.

[1] https://arxiv.org/pdf/2104.06069.pdf

[2] [2202.03356] Optimal Direct-Connect Topologies for Collective Communications (arxiv.org)

[3] paper (4).pdf (mit.edu)

The Value of Robotic Patch Panels in Machine Learning Training