Cerebras Dives Into WSE Architecture
The startup disclosed new details about how its tiny cores deliver tremendous performance and how its sparsity support boosts performance when training large AI models.
Secretive Cerebras spilled some more beans at last month’s Hot Chips conference, disclosing new details about its tiny compute core and how it can process even the biggest AI models. The startup is known for its wafer-scale processor, now in its second generation. The WSE2 packs 850,000 of these cores into a slab of silicon the size of a baking pan. The accelerator contains 40GB of SRAM and can generate 7,500 trillion FP16 operations per second (Tflop/s). Maximum power for the CS-2 system, which contains the WSE2, is 23,000 watts.
For years, Cerebras withheld the flops rate of its design because despite its sheer magnitude, the design falls short of the leading GPUs when measured in flop/s per watt or per unit die area. But large neural networks are typically limited by memory size and bandwidth, making the compute rate less relevant. In the WSE2, each FP16 multiply-accumulate (MAC) unit can access 12KB of stored operands at full speed; for Nvidia’s new Hopper GPU, the corresponding figure is only 128 bytes.
Nearly all AI models have fewer than 20 billion weights and can fit entirely in the WSE2’s memory. For the biggest ones, Cerebras offers a separate MemoryX box that can store trillions of weights in DRAM and stream them to one or more CS-2 systems. This approach enables the company to take advantage of model sparsity by removing zero weights instead of feeding them to the compute cores. Thus, the cores can focus on useful operations.
Although Cerebras promotes its systems for training these enormous models, they’ve seen deployment mainly in high-performance computing (HPC) and niche AI applications. Customers include US national laboratories in Argonne and Livermore as well as Germany’s Leibniz supercomputer center. Major corporations such as AstraZeneca, Bayer, Genentech, and GlaxoSmithKline (GSK) use the systems for biological research, whereas others perform physics simulations and similar tasks.