The desire for a low-cost optical switch has been around for decades. As seen with long-haul networks in the late 1990s, the value of all-optical networks enabled by optical switches is greatest where the transmission bandwidth demand is greatest. This is due to the fact that these networks require the newest, fastest transmitters and receivers, which are still somewhat expensive and in limited supply. In the dot-com era, the explosive growth of the internet created the need for high-capacity long-haul networks. Today, the “fattest pipes” are needed between clusters in machine learning applications due to the enormous size of the training sets.

The size of machine learning training sets has skyrocketed in an effort to enhance models or recommendation systems. Natural language models are one example. 2,000 to over 6,000 GPUs/TPUs are used by models such as Gopher, Megatron-Turing NLG, and PaLM, and each processes over 200 billion parameters [1]. However, as these datasets and xPU cluster sizes have increased, communication between nodes has become a significant bottleneck. A chart displayed at the recent Open Compute Project (OCP) conference demonstrated this. The amount of time needed for worker communication exceeds the amount of time needed for processing, even with just a few tens of parallel workstations interacting over ethernet [2]. For clusters of a few hundred parallel xPUs, nearly 90% of the training time is spent on communication between nodes to exchange the updated parameter values. While communication over Infiniband is faster, a significant time is still spent on communication rather than computation. Optical networks using optical switches to efficiently optimize the bandwidth between workers based on the communication demands between nodes offers the promise of more efficient networks and up to 3 times faster processing of machine learning training sets [3].

While there are a number of technologies that have been pursued for optical switches, most of these technologies have remained in the lab and have not been deployed in large volumes in production networks. Since a major driver for optical switches is to reduce the cost of the network by optimizing the use of high-speed transceiver, the design of the optical switch should be based on low-cost components and avoiding custom designs. After all, replacing a transceiver with an optical switch doesn’t make sense if the port cost of the optical switch is greater than the transceiver cost.

Since the optical switch will be used in large networks, another key to keeping the cost down is to avoid a design that scales exponentially with the number of ports. In other words, the design should allow linear scaling rather than N^2 scaling as the port count grows. Since transceiver cost is dependent on the output power, the optical cross connect should have the lowest loss possible. The overall cost of the optical networks isn’t improved if high losses in the optical switch require the use of higher power, more expensive transceivers. Also, the design should be manufacturable by a contract manufacturing firm where their scale and expertise can be applied to reduce the cost in volume. This again means using common parts and common manufacturing steps that will be familiar to the contract manufacturer. Finally, since training runs on these large data sets can runs for hours to days, the optical switch needs to be reliable to not disrupt a training run.

Some early attempts at an optical switch used micro-electromechanical switches or MEMS. These used lithographic processing to create an array of mirrors that could be electrically aligned to steer the light signal to the desired output port. Since the MEMS approach includes a free-space propagation section where the light leaves the fiber, the total optical loss for MEMS devices can be large. This approach also suffers from the N^2 scaling problem based on the array design. Also, mirror yield and loss becomes a problem as the system scales to larger port counts. As an example of the yield challenge, for a MEMS devices developed by Google and deployed within their network, their MEMS die had 176 individually controllable mirrors that yielded 136 mirrors in the operational system [4]. To highlight the manufacturing challenges of this design, Google reported in the same paper just cited that due to the difficulties in maintaining reliability and quality of this technology at scale, the decision was made to internally develop the MEMS system over the past decade.

Another design for an optical switch uses a piezoelectric stack to aim input fibers to output fibers. This approach again includes a free space propagation section which increases loss, especially as the port count increases. The design also uses a 2-D array on the input and output sides of the system, leading to an N^2 scaling problem – it gets exponentially harder to make higher-port count systems. As with the MEMS and the piezo stack approach, the port count size of the system is limited by the actuation angle that can be achieved by the individual mirror and piezo stack. Also, since the piezo stack aiming of fibers is unique to this product, this component is specialized, leading to high cost and the difficulty of transferring to a contract manufacturer.

A different approach has been developed by Telescent. Telescent has an all-fiber, high port count, low-loss optical switch that can be scaled to thousands of fibers per optical switch to manage connectivity in a machine learning cluster. The Telescent system consists of a short fiber link between two ports with a robot that moves the selected port to the requested new location. The key element of the Telescent system that allows it to scale to high port counts is the routing algorithm that the robot uses to weave the fiber around other fibers to the new location – allowing over one thousand ports per system. The algorithm also insures that the system can always reconfigure the connection without affecting any other port – in other words the system is reconfigurable and non-blocking. Since the signal always remains in the fiber, the total loss of the Telescent system is equivalent to just 2 fiber connections, typical 0.3 dB. The Telescent system has passed NEBS Level 3 certification and has been used in production networks. Both single mode and multimode fiber have been deployed in the Telescent system, allowing use with lower cost, short-reach multimode transmitters.

The Telescent system is manufactured by a contract manufacturer – bringing their manufacturing expertise and scale to implementing robust processes and reducing cost. With contract manufacturing and the fiber based design of the Telescent system, at reasonable volumes the per-port price of the Telescent system is approximately an order of magnitude lower than the other approaches discussed on this article. With the proven reliability, low loss and scalable manufacturing design, the long-desired goal of a low-cost optical switch for networks has arrived.

[1] Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance – Google AI Blog (googleblog.com)

[2] https://arxiv.org/pdf/2104.06069.pdf

[3] W. Wang et. al., “TOPOOPT: Optimizing the Network Topology for Distributed DNN Training,” USENIX Symposium on Networked Systems Design and Implementation (NSDI '22) April 4-6, 2022, https://doi.org/10.48550/arXiv.2202.00433

[4] [2208.10041] Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale (arxiv.org)

The Value of a Low-Cost Optical Switch