Meta Works with NVIDIA to Build AI Research Supercomputer

Meta’s AI supercomputer — the largest NVIDIA DGX A100 customer system to date — will deliver Meta AI researchers five exaflops of AI performance and features cutting-edge NVIDIA systems, InfiniBand fabric and software enabling optimization across thousands of GPUs.

The AI Research SuperCluster (RSC), announced it is training new models to advance AI. Once fully deployed, Meta’s RSC is expected to be the largest customer installation of NVIDIA DGX A100 systems.

“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they could seamlessly collaborate on a research project or play an AR game together,” the company said in a blog.

The new AI supercomputer uses 760 NVIDIA DGX A100 systems as its compute nodes. It packs a total of 6,080 NVIDIA A100 GPUs linked on an NVIDIA Quantum 200Gb/s InfiniBand network to deliver 1,895 petaflops of TF32 performance.

RSC took 18 months to go from an idea on paper to a working AI supercomputer, thanks in part to the NVIDIA DGX A100 technology at the foundation of Meta RSC.

It’s the second time Meta has picked NVIDIA technologies as the base for its research infrastructure. In 2017, Meta built the first generation of this infrastructure for AI research with 22,000 NVIDIA V100 Tensor Core GPUs that handles 35,000 AI training jobs a day.

Meta’s early benchmarks showed RSC can train large NLP models three times faster and run computer vision jobs 20 times faster than the prior system.

In a second phase later this year, RSC will expand to 16,000 GPUs that Meta believes will deliver a five exaflops of mixed precision AI performance. And Meta aims to expand RSC’s storage system to deliver up to an exabyte of data at 16 terabytes per second.

NVIDIA AI technologies are available to enterprises of any size.

NVIDIA DGX, which includes a full stack of NVIDIA AI software, scales easily from a single system to a DGX SuperPOD running on-premises or at a colocation provider. Customers can rent DGX systems through NVIDIA DGX Foundry.

When RSC is fully built out, Meta aims to use it to train AI models with more than a trillion parameters. That could advance fields such as natural-language processing for jobs like identifying harmful content in real time.

In addition to performance at scale, Meta cited extreme reliability, security, privacy and the flexibility to handle “a wide range of AI models” as its key criteria for RSC.