Elon Musk is catching up with Mark Zuckerberg with AI chips

The billionaire said on Monday that xAI, a company he launched in 2023, had brought a massive training cluster of new chips online over the weekend, claiming it was “the world’s most powerful AI training system.”

The system, called Colossus, was built using 100,000 of Nvidia’s H100 GPU chips at a facility in Memphis, Tennessee. Musk said the cluster was built in 122 days and will “double in size” as more GPUs are added in the coming months.

this weekend, @xAI The team brought the Colossus 100k H100 training cluster online in 122 days from start to finish.

Colossus is the most powerful AI training system in the world, and will double in size to 200k (50k H200s) within the next few months.

wonderful…

— Elon Musk (@elonmusk) September 2, 2024

Musk confirmed the size of the cluster in July, but bringing it online marks a major step in his AI ambitions and, crucially, will enable him to catch up with his Silicon Valley nemesis, Mark Zuckerberg.

Zuckerberg and Musk’s ambitions — in Musk’s case, to turn xAI into a company that advances “our collective understanding of the universe” with its Grok chatbot — depend on high-performance GPUs to provide the computing power needed for powerful AI models.

These are not easy to come by, nor are they cheap.

Since the release of ChatGPT in late 2022, the hype around AI has seen companies scrambling to acquire Nvidia GPUs, but intensifying supply and demand constraints have led to a shortage of GPUs, with chips selling for more than $40,000 in some cases.

Despite the barriers to access, companies have tried every possible way to secure and leverage GPU supplies to get an edge over their competitors.

Llama vs Grok

Nathan Benaich, founder and general partner at Air Street Capital, has been tracking how many H100 GPUs tech companies have acquired. He puts Meta’s total at 350,000 and xAI’s at 100,000. Musk’s other company, Tesla, has 35,000.

In January, Zuckerberg said Meta would have 600,000 GPUs in stock by the end of the year, of which about 350,000 would be Nvidia’s H100.

Microsoft, OpenAI and Amazon have not disclosed the size of their H100 piles.

Meta hasn’t said exactly how many of Zuckerberg’s 600,000 GPUs target were secured or how many are in use. But Meta said in a research paper published in July that the largest version of its Llama 3 large-scale language model was trained on 16,000 H100 GPUs. The company said in March that it was deploying two 24,000-GPU clusters to support Llama 3 development, making it a “major investment in Meta’s AI future.”

This indicates that xAI’s latest training cluster, with 100,000 H100 GPUs, is much larger than the cluster used to train Meta’s largest AI model.

The magnitude of this achievement has been communicated throughout the industry.

Over at X, Nvidia’s data center account responded to Musk, posting, “Exciting to watch Colossus, the world’s largest GPU supercomputer, come online in record time.”

xAI co-founder Greg Yang had a more colorful response to the news, referencing a song by American rapper Tyga:

Sean Maguire, a partner at venture capital firm Sequoia, wrote to X that the xAI team is currently “accessing the most powerful training clusters in the world” to build the next version of the Grok chatbot. He added that “in the last few weeks, Grok-2 has rapidly grown to a level nearly on par with state-of-the-art models.”

But like most AI companies, there’s a big question mark over the commercialization of the technology. “It’s great that xAI has raised significant funding from Elon Musk and is making progress, but the company’s product strategy remains unclear,” Benaich told Business Insider.

In July, Musk said the next version of Grok, after training on 100,000 H100s, “should be really special.”

It will soon be seen how competitive he will be with Zuckerberg when it comes to AI.