Behind Intel’s HPC Chip that Will Pierce the Exascale Barrier – IEEE Spectrum

Ponte Vecchio packs in a lot of silicon to power the Aurora supercomputer
Samuel K. Moore is IEEE Spectrum’s semiconductor editor.
The Ponte Vecchio processor is destined for the Aurora supercomputer at the U.S. Argonne National Laboratory, slated to be unveiled later this year.
On Monday, Intel unveiled new details of the processor that will power the Aurora supercomputer, which is designed to become one of the first U.S.-based high-performance computers (HPCs) to pierce the exaflop barrier—a billion billion high-precision floating-point calculations per second. Intel Fellow Wilfred Gomes told engineers virtually attending the IEEE International Solid State Circuits Conference this week that the processor pushed Intel’s 2D and 3D chiplet integration technologies to the limits.
The processor, called Ponte Vecchio, is a package that combines multiple compute, cache, networking, and memory silicon tiles, or “chiplets.” Each of the tiles in the package is made using different process technologies, in a stark example of a trend called heterogeneous integration.

Ponte Vecchio is, among other things, a master class in 3D integration.
The result is that Intel packed 3,100 square millimeters of silicon—nearly equal to four Nvidia A100 GPUs—into a 2,330 mm2 footprint. That’s more than 100 billion transistors across 47 pieces of silicon.

An illustration shows groupings of labelled blue, dark grey, and tan tiles on the left and their positions on a green circuitboard on the right.Ponte Vecchio is made of multiple compute, cache, I/O, and memory tiles connected using 3D and 2D technology.Source: Intel Corp.
Ponte Vecchio is, among other things, a master class in 3D integration. Each Ponte Vecchio processor is really two mirror image sets of chiplets tied together using Intel’s 2D integration technology Co-EMIB. Co-EMIB forms a bridge of high-density interconnects between two 3D stacks of chiplets. The bridge itself is a small piece of silicon embedded in a package’s organic substrate. The interconnect lines on silicon can be made narrower than on the organic substrate. Ponte Vecchio’s ordinary connections to the package substrate were 100 micrometers apart, whereas they were nearly twice as dense in the Co-EMIB chip. Co-EMIB dies also connect high-bandwidth memory (HBM) and the Xe Link I/O chiplet to the “base silicon,” the largest chiplet, upon which others are stacked.

An illustration shows an exploded view of the parts of a processor each represented by different colored rectangles.The parts of Ponte Vecchio.Source: Intel Corp.
Each set of eight compute tiles, four SRAM cache chiplets called RAMBO tiles, and eight blank “thermal” tiles meant to remove heat from the processor is connected vertically to a base tile. This base provides cache memory and a network that allows any compute tile to access any memory.
Notably, these tiles are made using different manufacturing technologies, according to what suited their performance requirements and yield. The latter term, the fraction of usable chips per wafer, is particularly important in a chiplet integration like Ponte Vecchio, because attaching bad tiles to good ones means you’ve ruined a lot of expensive silicon. The compute tiles needed top performance, so they were made using TSMC’s N5 (often called a 5-nanometer) process. The RAMBO tile and the base tile both used Intel 7 (often called a 7-nanometer) process. HBM, a 3D stack of DRAM, uses a completely different process than the logic technology of the other chiplets, and the Xe Link tile was made using TSMC’s N7 process.
An illustration shows a cut-through of a processor with large blue, black, and grey rectangles representing the silicon parts and copper showing the connections.The different parts of the processor are made using different manufacturing processes, such as Intel 7 and TSMC N5. Intel’s Foveros technology creates the 3D interconnects and its Co-EMIB makes horizontal connections.Source: Intel Corp.
The base die also used Intel’s 3D stacking technology, called Foveros. The technology makes a dense array of die-to-die vertical connections between two chips. These connections are just 36 micrometers apart and are made by connecting the chips “face to face”; that is, the top of one chip is bonded to the top of the other. Signals and power get into this stack by means of through-silicon vias, fairly wide vertical interconnects that cut right through the bulk of the silicon. The Foveros technology used on Ponte Vecchio is an improvement over the one used to make Intel’s Lakefield mobile processor, doubling the density of signal connections.
Expect the “zettascale” era of supercomputers to kick off sometime around 2028.
Needless to say, none of this was easy. It took innovations in yield, clock circuits, thermal regulation, and power delivery, Gomes said. In order to ramp performance up or down with need, each compute tile could run at a different voltage and clock frequency. The clock signals originate in the base die but each compute tile can runs at its own rate. Providing the voltage was even more complicated. Intel engineers chose to supply the processor with a higher than normal voltage (1.8 volts) so they could simplify the package structure due to the lower current needs. Circuits in the base tile reduce the voltage to something closer to 0.7 volts for use on the compute tiles, and each compute tile had to have its own power domain in the base tile. Key to this ability were new high-efficiency inductors called coaxial magnetic integrated inductors. Because these are built into the package substrate, the circuit actually snakes back and forth between the base tile and the package before supplying the voltage to the compute tile.

A micrograph at left shows grey and white layers of the processor calling out parts that control the flow of heat. The same parts are illustrate on the right.Getting the heat out of a complex 3D stack of chips was no easy feat.Source: Intel Corp.

Ponte Vecchio is meant to consume 600 watts, so making sure heat could be extracted from the 3D stack was always a high priority. Intel engineers used tiles that had no other function than to draw heat away from the active chiplets in the design. They also coated the top of the entire chiplet agglomeration in heat-conducting metal, despite the various parts having different heights. Atop that was a solder-based thermal interface material (STIM) and an integrated heat spreader. The different tiles each have different operating-temperature limits under liquid cooling and air cooling, yet this solution managed to keep them all in range, said Gomes.
“Ponte Vecchio started with a vision that we wanted democratize computing and bring petaflops to the mainstream,” said Gomes. Each Ponte Vecchio system is capable of more than 45 trillion 32-bit floating-point operations per second (teraflops). Four such systems fit together with two Sapphire Rapids CPUs in a complete compute system. These will be combined for a total exceeding 54,000 Ponte Vecchios and 18,000 Sapphire Rapids to form Aurora, a machine targeting 2 exaflops.
It’s taken 14 years to go from the first petaflop supercomputers in 2008—capable of one million billion calculations per second—to exaflops today, Gomes pointed out. A 1000-fold increase in performance “is a really difficult task, and it’s taken multiple innovations across many fields,” he said. But with improvements in manufacturing processes, packaging, power delivery, memory, thermal control, and processor architecture, Gomes told engineers, the next thousandfold increase could be accomplished in just six years rather than another 14.
Samuel K. Moore is the senior editor at IEEE Spectrum in charge of semiconductors coverage. An IEEE member, he has a bachelor's degree in biomedical engineering from Brown University and a master's degree in journalism from New York University.
Interesting and well done by Intel, we’ve been up against the physics and therefore economics, for some time, although we’ve found ways around. The future will probably be optical computers, with their lower heat per operation profile, of course it hasn’t just been in the calculation rates. Better caches, RAM, now going to DDR5, SSD, now going to PCIe5, more closely integrated, RAM on chip, a 3D lattice of processor, AI, ML, GPU, plus RAM, SSD.

In supercomputing, superconduction offers a way around the heat problem, with no heat from resistance, at higher temperatures, a liquid nitrogen generator, can be as small, as a coffee pot. Much progress has been made in the consumer field, with calculate per Watt increasing and higher utility, by tailored design. Just an example, going from 6 TFLOPS in machine learning, in A12, say in an iPad mini 5, to 16 TFLOPS, in A15, some of this is going from 7nm to 5nm processors. But some is increasing the priority of machine learning in terms of transistor surface area.

To achieve things consumers want, for example going from photo enhancement, to video enhancement, getting around the bottlenecks, combined progress. For example, I get 3/4 of a GB a second, in Wi-Fi 6, on a dedicated frequency bandwidth, no buffering, GBs a second through USB C, thunderbolt. For my backups, using a fan cooled PCIE4 stick, then there’s 5G mobile, with it’s lower latency.

source

Facebook Comments Box

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *