Telescent, MIT Partner to Speed up ML Workflows

Telescent Inc., a manufacturer of automated fiber patch panels and cross-connects for networks and data centers, will announce the results of the company’s collaboration with the Massachusetts Institute of Technology Computer Science & Artificial Intelligence Laboratory (MIT CSAIL), aimed at accelerating training time for machine learning workflows, at an invited presentation, April 17-19, at the Networked Systems Designs and Implementation (NSDI) conference in Boston.

The NSDI conference focuses on the design principles, implementation, and practical evaluation of networked and distributed systems. The goal of the organization is to bring together researchers from across the networking and systems community to foster a broad approach to addressing overlapping research challenges.

Today’s machine learning (ML) training systems are deployed on top of traditional data center fabrics with electrical packet switches arranged in a multi-tier topology. The performance and efficiency of this architecture face severe limitations because of localized network bandwidth bottlenecks. The Telescent programmable patch panel can provision and deliver network connections with essentially unlimited network bandwidth (i.e. thousands of Terabits per second) within a massive GPU cluster while consuming minimal energy.

The collaboration between Telescent and MIT CSAIL focused on improving the training time for machine learning workflows by optimizing the communication between workers in the graphics processing unit (GPU) cluster through programmable network connections. The collaboration accelerated workflows by 3.4 times, a significant performance improvement that overcomes limitations of current GPU clusters in ML training applications.

According to Manya Ghobadi, Associate Professor at MIT CSAIL and program co-chair of NSDI, large-scale ML clusters require enormous computational resources and consume a significant amount of energy. As a prime example, training a ChatGPT model with 65 billion parameters requires 1 million GPU hours and costs over $2.4 million. In January, ChatGPT served 600 million live inference queries and used as much electricity as 175,000 people. As a result, “this trend is not sustainable,” said Ghobadi.

To address the challenge, the MIT CSAIL researchers proposed TopoOpt, a reconfigurable optical data center for DNN (Deep Neural Network) training to leverage the unique performance and scalability of the Telescent programmable patch panel. TopoOpt is the first ML-centric network architecture that co-optimizes the distributed training process across three dimensions, computation, communication and network topology, to improve performance.

The team at MIT CSAIL integrated TopoOpt with Nvidia’s NCCL library and built a prototype of TopoOpt with the Telescent robotic patch panel and remote direct memory access (RDMA) forwarding at 100 Gbps.

For more information visit Telescent at: www.telescent.com.