Editorial: It's Getting Hot In Here
Server processors, AI accelerators, and switch chips are consuming more power and generating more heat, some as much as 950W. Data centers must change to accommodate these hot chips.
Server processors, switch chips, and AI accelerators can generate lots of heat, which is why data centers resound with fan noise. For high-end chips, 300W was once considered the upper limit, but recent announcements have smashed through that barrier, in some cases topping 900W. New process nodes bring greater transistor counts but only minor power-efficiency increases. Chip designers wanting greater performance have little choice but to use the extra transistors and let their customers worry about cooling.
AI accelerators have led this trend. Nvidia limited its older V100 to 300W but stretched to 400–500W for its A100 two years ago, and its new H100 unabashedly carries a 700W TDP. Broadcom’s 12.8Tbps Tomahawk 3 switch chip used 300W (typical), but the 25.6Tbps Tomahawk 4 jumps to about 450W. Startup Tachyum rates its 128-core server processor, which has large matrix units, as high as 950W.
Transistor progress has slowed, with 5nm delivering modest benefits and 3nm looking worse. Although foundries claim that each new node reduces power per transistor by 20–30%, these figures fail to account for rising signal resistance as metal layers become narrower. In a typical processor design, this greater resistance cancels most or all of the power-efficiency gain. Supply voltages are no longer dropping, leaving no easy way to save power.
The changes affect data centers as well. Many are built to supply around 15kW per rack, enough for dozens of standard servers. But such a rack can hold only a single DGX H100 system (with eight GPUs), which burns 10kW. A full rack of H100 systems requires 70kW. Thus, data-center operators need new wiring to supply these voracious devices.