AI's New Frontier: Training Trillion-Parameter Models with Much Fewer GPUs

Researchers used just 8% of the world's most powerful supercomputer to train a model the size of ChatGPT.

2 Min Read
Frontier supercomputer at the Oak Ridge National Laboratory
OAK RIDGE NATIONAL LABORATORY

This articles was originally published on AI Business. 

Training a language model the size of OpenAI’s ChatGPT would normally require a sizable supercomputer. But scientists working on the world’s most powerful supercomputer discovered innovative techniques to train gigantic models using a lot less hardware.

In a new research paper, scientists from the famed Oak Ridge National Laboratory trained a one trillion parameter model using just a few thousand GPUs in their Frontier supercomputer, the most powerful non-distributed supercomputer in the world and one of only two exascale systems globally.

They used just 3,072 GPUs to train the giant large language model out of 37,888 AMD GPUs housed in Frontier. That means the researchers trained a model comparable to ChatGPT’s rumored size of a trillion parameters on just 8% of Frontier's computing power.

The Frontier team achieved this feat using distributed training strategies to essentially train the model across the unit's parallel architecture. Using techniques like shuffled data parallelism to reduce communication between layers of nodes and tensor parallelism to handle memory constraints, the team was able to distribute the training of the model more efficiently.

Other techniques the researchers employed to coordinate the model’s training include pipeline parallelism to train the model across various nodes in stages to improve speed.

Related:Top 10 AI Stories of 2023

The results saw 100% weak scaling efficiency for models 175 billion parameter and 1 trillion parameters in size. The project also achieved strong scaling efficiencies of 89% and 87% for these two models.

A Trillion Parameters

Training a large language model with a trillion parameters is always a challenging undertaking. The authors said the sheer size of the model stood at a minimum 14 terabytes. For contrast, one MI250X GPU found in Frontier only has 64 Gigabytes.

Methods like the ones the researchers explored will need to be developed to overcome issues with memory.

However, one issue they faced was loss divergence due to large batch sizes. Their paper states that future research into bringing down training time for large-scale systems must see an improvement in large-batch training with smaller per-replica batch sizes.

The researchers also called for more work to be done around AMD GPUs. They wrote that most large-scale model training is done on platforms that support Nvidia solutions. While the researchers created what they called a blueprint for efficient training of LLMs on non-Nvidia platforms, they wrote: “There needs to be more work exploring efficient training performance on AMD GPUs.”

Frontier held onto its crown as the most powerful supercomputer in the most recent Top500 list, pipping the Intel-powered Aurora supercomputer.

Read more about:

AI Business

About the Author(s)

Ben Wodecki

Assistant Editor, AI Business

Ben Wodecki is assistant editor at AI Business, a publication dedicated to the latest trends in artificial intelligence.

AI Business

AI Business, a DCK sister site, is the leading content portal for artificial intelligence and its real-world applications. With its exclusive access to the global c-suite and the trendsetters of the technology world, it brings readers up-to-the-minute insights into how AI technologies are transforming the global economy - and societies - today.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.