How Twitter Shrunk Its Hadoop Clusters and Their Energy Consumption

Rearchitecting Hadoop clusters to remove a storage bottleneck opened doors for more improvements.

How Twitter Shrunk Its Hadoop Clusters and Their Energy Consumption
Matt Singer, a Twitter senior staff hardware engineer (left), and Navin Shenoy, executive VP and general manager of the Data Center Group at Intel, speaking at an Intel event in San Francisco on April 2, 2019.Yevgeniy Sverdlik

Twitter’s Hadoop infrastructure is enormous. Users’ every tweet and retweet gets streamed into its Hadoop clusters for analytics, and, according to the company’s CTO Parag Agrawal, they are among the largest Hadoop clusters in the world.

Hosting and managing this infrastructure in data centers is expensive, and an opportunity to shrink its footprint – or at a minimum slow down its growth rate – is welcome news. It’s especially welcome when there are performance gains to be had along with the footprint reduction.

By rearchitecting Twitter’s Hadoop clusters, the company’s infrastructure team recently ticked both boxes, speeding them up while shrinking their physical size.

That’s according to Matt Singer, senior staff hardware engineer at Twitter, who spoke at an Intel event in San Francisco Tuesday, where the chip giant rolled out a wide-ranging portfolio of its latest and greatest data center tech. Twitter now expects to get up to 50 percent faster runtimes on its Hadoop clusters, reduce the clusters’ energy consumption by about 75 percent, and spend 30 percent less on owning and operating them, Singer said.

An upgrade to Intel’s second-generation Xeon Scalable server chips – which played a central role at Tuesday’s event – was partially responsible for improving Twitter’s Hadoop infrastructure. But that upgrade was only possible because of the broader, “system-level” changes the team made.

“There are hundreds of millions of tweets every day, and when users interact with those tweets, it turns into actually over a trillion events per day, and that’s a lot of data,” Singer said. The physical storage capacity of Hadoop clusters that store and analyze that data adds up to more than 1 exabyte, he said. A typical cluster can have more than 100,000 hard drives, translating into 100 petabytes of logical storage.

Last year, the company said it had moved some of its Hadoop capacity into Google’s cloud to improve scalability. But it appears to have kept much of it inhouse.

Because they’re cheap, hard drives are the workhorses of Twitter’s Hadoop clusters. But while hard drive capacity has increased over time, the amount of IOPS a hard drive can perform has remained essentially flat. “And that’s resulted in a storage bottleneck,” he said.

Because of the way its Hadoop clusters had been architected before, Twitter ran up against this bottleneck. Typical data flow in and out of a Hadoop server consists of two parts: HDFS for stored data and Yarn for temporary data. Both happen at the same time, often clogging up access to the hard drives.

Twitter and Intel engineers ran a series of experiments to get around this bottleneck. The solution they eventually came to selectively caches Yarn-managed temporary data to a fast SSD. This removes competition for resources between Yarn and HDFS, reducing hard drive utilization.

Removing the storage IO bottleneck enabled Twitter to reduce the number of racks in its Hadoop clusters, subsequently reducing its data center footprint. The infrastructure team moved from 12 smaller hard drives per system to eight larger ones, shrinking the amount of hard drives in a cluster without negatively impacting performance.

They could also now use a lot more CPU horsepower, so they switched from four-core processors to 24-core second-generation Xeon Scalable chips.

“It’s really great that we can have the same result for about 75 percent less energy consumption,” Singer said. “We expect that caching temp data and bumping up processor core counts results in up to 50 percent faster runtimes, and the increased density results in 30 percent lower TCO.”

Read more about:

Data Center Knowledge

About the Authors

Yevgeniy Sverdlik

Former editor in chief of Data Center Knowledge.

Data Center Knowledge

Data Center Knowledge, a sister site to ITPro Today, is a leading online source of daily news and analysis about the data center industry. Areas of coverage include power and cooling technology, processor and server architecture, networks, storage, the colocation industry, data center company stocks, cloud, the modern hyper-scale data center space, edge computing, infrastructure for machine learning, and virtual and augmented reality. Each month, hundreds of thousands of data center professionals (C-level, business, IT and facilities decision-makers) turn to DCK to help them develop data center strategies and/or design, build and manage world-class data centers. These buyers and decision-makers rely on DCK as a trusted source of breaking news and expertise on these specialized facilities.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like