TPUv4 Adds Large On-Chip Memory
The TPUv4 is now generally available through Google Cloud, although the company has used it internally for a year. The ASIC doubles the number of matrix units relative to the TPUv3.
For more than a year, Google has broadly deployed the fourth-generation TPU chip for its internal workloads, but it only recently made the AI accel¬erator generally available to cloud customers. The company has trickled out details of the TPUv4 design, which delivers much better performance per watt than its predecessors.
The TPUv4 doubles the number of matrix units per core relative to the TPUv3, raising peak performance to 275 trillion operations per second (TOPS) using the Bfloat16 format. It also adds a large shared memory, reducing the number of power-wasting accesses to the external High Bandwidth Memory (HBM). This change, along with a shrink to 7nm, helps slash the chip’s power.
Google began designing its own AI accelerators in 2014, using a small team to quickly produce the first TPU. The success of that design led the company to deploy in 2017 the TPUv2, a more complex chip that could handle both training and inference. The TPUv3 was a fast upgrade, doubling the number of matrix units per core. Google debuted its next architecture in the TPUv4i, a single-core chip optimized for inference, and then in the dual-core TPUv4, which mainly targets training. It offered the TPUv4 to cloud customers “in preview” for several months before making it broadly available.
The company employs TPUs for all its AI training and inference work, eliminating purchases of Nvidia GPUs. According to MLPerf results, however, Nvidia’s A100 performs similarly to the TPUv4 in both large and small clusters, and its new H100 sets a high bar for the next TPU.
Subscribers can view the full article in the TechInsights Platform.
Revealing the innovations others cannot inside advanced technology products
1891 Robertson Rd #500, Nepean, ON K2H 5B7