Stories Claiming Google ‘Beat Nvidia’ Greatly Overstated

Stories Claiming Google ‘Beat Nvidia’ Greatly Overstated

[ad_1]

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

Google researchers published a paper last week discussing the performance of its home-grown TPUv4 AI accelerator chips, which garnered a lot of attention. In the paper, the researchers compare the performance of their chips with market leader Nvidia’s GPUs. This has been reported as “Google reveals its newest A.I. supercomputer, says it beats Nvidia,” according to CNBC, amongst others.

In the paper, the researchers are not claiming the TPUv4 outpaces Nvidia’s current-gen flagship AI accelerator, the H100. Instead, Google is comparing its TPUv4 to the previous-gen A100. The way this has been reported has prompted some industry-watchers to claim either that Google is “beating Nvidia,” or that Google is somehow making unfair comparisons. Neither is correct.

The A100 is an appropriate comparison for the TPUv4; both the TPUv4 and the A100 were deployed in 2020 and both use 7-nm process technologies. The Google paper does make clear which generation of Nvidia hardware it is comparing against.

“The newer, 700W H100 was not available at AWS, Azure or Google Cloud in 2022. The appropriate H100 match would be a successor to TPU v4 deployed in a similar time frame and technology (e.g., in 2023 and 4 nm),” the Google researchers say in the paper.

Image of Google TPUv4 supercomputer
Google’s TPU supercomputer is designed for AI acceleration. (Source: Google)

The renewed media interest in AI chips is based on the surge in demand for AI workloads in the data center, thanks to the need for at-scale training and inference of generative AI models like ChatGPT.

Google submitted MLPerf training results for its TPUv4 in July 2020, January 2021, June 2021 and July 2022. Google did not submit scores in November 2022, which is the first round we saw H100 training scores. However, if one wanted to use published benchmarks to try and work out whether TPUv4 or H100 is faster for training ChatGPT, it might be possible to get a rough idea.

Bert is the MLPerf benchmark most closely resembling ChatGPT today. In June 2021, 64 Google TPUv4s could train Bert in 4.68 minutes. When H100 debuted scores in November 2022, 32 Nvidia H100s could train Bert in 1.797 minutes. Unfortunately, Google has never submitted inference results for its TPUv4 systems.

It is clear that TPUv4 is not beating Nvidia’s current-gen technology, and Google does not claim this in its paper. Rather, Google uses published MLPerf training results to come to the following conclusion: “For similar sized systems, TPUv4 is 1.15× faster for Bert than the A100 and ~4.3× faster than the [Graphcore Bow] IPU.”

However, there are plenty of interesting things about Google’s TPUv4 and its TPUv4 supercomputer in the paper:

  • Over 90% of training in Google data centers is done on TPUs. As of October 2022, 57% of that workload was training transformers (26% of the overall total on Bert and the remaining 31% on large language models (LLMs)). Twenty-four percent of training done on TPUs was for recommendation models.
  • Google’s TPUv4 supercomputer uses a new type of entirely optical circuit switch it invented, called Palomar—a design with 3D MEMS mirrors that switch in milliseconds. This optical switching technology enables topology to be changed to match different types of parallelism (data parallel, tensor parallel, and/or pipeline parallel) required for very large neural network training. Further, Google uses AI to configure the topology of its supercomputer for the most efficient results on training LLMs. AI-designed topologies performed 1.2× a human expert’s design for GPT-3 pre-training.
  • TPUv4 has moved to 7 nm from the 16-nm node TPUv3 was on. It has double the number of matrix multipliers and the clock is 11% faster—this makes for 2.2× the peak performance. Peak performance is 275 TFLOPS for BF16.
  • Each TPUv4 die with two tensor cores, each with 128×128 matrix multiply units and a vector processing unit with 128 lanes.
  • Google has used a domain-specific accelerator called the SparseCore for embedding training in several generations of TPUs. It uses about 5% of the die area and 5% of the power. Embeddings are a key feature of recommendation models, and they are very difficult to accelerate.

The Google blog on the paper is here, and the paper itself is here, but we’ll have to wait until Google builds a TPUv5 to see who is really winning this battle.



[ad_2]

Source link

Share this post
Facebook
Twitter
LinkedIn
WhatsApp