How Twitter Shrunk Its Hadoop Clusters and Their Energy Consumption
Rearchitecting Hadoop clusters to remove a storage bottleneck opened doors for more improvements.
Twitter’s Hadoop infrastructure is enormous. Users’ every tweet and retweet gets streamed into its Hadoop clusters for analytics, and, according to the company’s CTO Parag Agrawal, they are among the largest Hadoop clusters in the world.
Hosting and managing this infrastructure in data centers is expensive, and an opportunity to shrink its footprint – or at a minimum slow down its growth rate – is welcome news. It’s especially welcome when there are performance gains to be had along with the footprint reduction.
By rearchitecting Twitter’s Hadoop clusters, the company’s infrastructure team recently ticked both boxes, speeding them up while shrinking their physical size.
That’s according to Matt Singer, senior staff hardware engineer at Twitter, who spoke at an Intel event in San Francisco Tuesday, where the chip giant rolled out a wide-ranging portfolio of its latest and greatest data center tech. Twitter now expects to get up to 50 percent faster runtimes on its Hadoop clusters, reduce the clusters’ energy consumption by about 75 percent, and spend 30 percent less on owning and operating them, Singer said.
An upgrade to Intel’s second-generation Xeon Scalable server chips – which played a central role at Tuesday’s event – was partially responsible for improving Twitter’s Hadoop infrastructure. But that upgrade was only possible because of the broader, “system-level” changes the team made.
“There are hundreds of millions of tweets every day, and when users interact with those tweets, it turns into actually over a trillion events per day, and that’s a lot of data,” Singer said. The physical storage capacity of Hadoop clusters that store and analyze that data adds up to more than 1 exabyte, he said. A typical cluster can have more than 100,000 hard drives, translating into 100 petabytes of logical storage.
Last year, the company said it had moved some of its Hadoop capacity into Google’s cloud to improve scalability. But it appears to have kept much of it inhouse.
Because they’re cheap, hard drives are the workhorses of Twitter’s Hadoop clusters. But while hard drive capacity has increased over time, the amount of IOPS a hard drive can perform has remained essentially flat. “And that’s resulted in a storage bottleneck,” he said.
Because of the way its Hadoop clusters had been architected before, Twitter ran up against this bottleneck. Typical data flow in and out of a Hadoop server consists of two parts: HDFS for stored data and Yarn for temporary data. Both happen at the same time, often clogging up access to the hard drives.
Twitter and Intel engineers ran a series of experiments to get around this bottleneck. The solution they eventually came to selectively caches Yarn-managed temporary data to a fast SSD. This removes competition for resources between Yarn and HDFS, reducing hard drive utilization.
Removing the storage IO bottleneck enabled Twitter to reduce the number of racks in its Hadoop clusters, subsequently reducing its data center footprint. The infrastructure team moved from 12 smaller hard drives per system to eight larger ones, shrinking the amount of hard drives in a cluster without negatively impacting performance.
They could also now use a lot more CPU horsepower, so they switched from four-core processors to 24-core second-generation Xeon Scalable chips.
“It’s really great that we can have the same result for about 75 percent less energy consumption,” Singer said. “We expect that caching temp data and bumping up processor core counts results in up to 50 percent faster runtimes, and the increased density results in 30 percent lower TCO.”
Read more about:
Data Center KnowledgeAbout the Authors
You May Also Like