Automation can address the “most labor-intensive and error prone step” during re-striping

Figure 1: Expansion of data center capacity to match demand [1].

Hyperscale data centers are a significant investment. The buildings can run to almost 1 million square feet of space, can house over a million servers and have a cost exceeding $1 billion. Since these data centers are built to support the ever-growing user demand for compute and storage capacity, the construction and equipment deployment is planned to bring capacity online in stages, as shown in Figure 1. This staged buildout balances investment with revenue by reducing stranded capacity and improves the technology deployment and refresh cycle by allowing the use of the latest technology. [1]

Since access to all servers is required across the entire data center, the compute capacity in hyperscale data centers is connected through switches connected in Clos networks and in multiple layers – from Top-of-Rack (ToR) switches to fabric and spine layers switches. Data halls or clusters of compute capability can be brought online when completed, and then new data halls can be added as these are ready in the future.

Figure 2: Schematic example of data center with multiple data halls with layers of switches for connectivity.

Since the racks and data halls are connected in a Clos network architecture, during expansion the connectivity must change to accommodate the added data halls and compute clusters. This is shown schematically in the figure below going from a network with 3 server blocks (SB) and spine blocks (SP) to a network with 4 server and spine blocks. This modification of the connectivity is called restriping. Restriping can occur at any level in the data center where capacity is added, from the ToR switches to the fabric, spine or even external connections between data centers in a metro region.  

Figure 3: Re-striping of connectivity during expansion of the Clos network architecture [2].

While the above diagram shows a direct connection between the server and spine blocks, the connection is usually done through an intermediate patch panel. This is to simplify the cable management. Since the servers and switches are located throughout a data center, changing direct connections between these systems would require a large variety of cable lengths with weeks spent laying new cables in cable trays. By cabling to a patch panel, changing the connectivity can be achieved by changing the connections between ports on the patch panel during restriping. Of course, there will be tens of thousands of patch panel connections to manage connectivity of the data center halls.

To quote a section of a paper presented by Google at a recent Networked System Design and Implementation conference: “rewiring links via patch-panel changes is the most labor intensive and error-prone step. Typically, thousands of links need to be rewired during one expansion, creating the possibility of multiple physical mistakes during the rewiring process. To check for errors, we perform a cable audit…This audit results in automated tickets to repair faulty links, followed by a repeat of the audit” [1].

While much of the network management has been automated through software defined networking, as the prior quote describes, restriping is still traditionally done manually. In one presentation, a large-scale data center operator found a 5% error rate in connections even when using two of their most experienced and careful technicians [3]. Aside from errors, there is also an unknown quality of fiber for the connections. While the compute hardware is expected to be refreshed every few years in a data center, the fiber is installed during constructions and expected to last the life of the data center. In a separate paper discussing data center operations, 39% of fiber links required remediation during installation and even after remediation 9% were unusable [4]. These problems are expected to grow as higher rate optics are introduced into data centers with fiber installed years ago.

While manual patching of patch panels may have been required in the past due to a lack of automation technology, Telescent has developed and built a robotic patch panel that can bring automation to the restriping process. The Telescent Network Topology Manager (NTM) preserves the advantages of static patch panels such as latchability and low-loss while introducing the ability to remotely manage and control the connectivity at large scale with high reliability. By using a patented fiber routing algorithm, the Telescent NTM manages an internal array of fibers to allow remote connection, disconnection and reconfiguration of over 1,000 duplex ports in the system. The Telescent system is non-blocking and can provide any-to-any connectivity as required for the Clos network design. The robot is machine accurate and avoids the human errors found during manual reconfiguration. The Telescent system can also include an optional OTDR which can be used to analyze the fiber and should help reduce the percentage of unusable links. Finally, since the Telescent robot can move SN and MT-type connectors, the system can handle over 10,000 fibers. With NEBS Level 3 certification and over 1 billion port-hours of in-service lifetime, the Telescent NTM offers the proven reliability needed for network deployment.

Of course, a new technology will only be introduced if it can be offered at a price that creates value. The Telescent switch is manufactured by Flex, one of the largest contract manufacturers in the world. While currently produced at their Austin, Tx facility, as volumes increase the production can be transferred to Mexico and to East Asia to reduce manufacturing costs. While the current price of the Telescent system is approximately twice the price of manual patch panels, with volumes the cost is expected to approach the cost of high-quality manual patch panels.  


References:

[1] nsdi19-zhao.pdf (usenix.org)

[2] Minimal Restriping For Data Center Expansion (tdcommons.org)

[3] Demystifying DataCenter CLOS networks - YouTube

[4] Low-margin optical networking at cloud scale [Invited]