Mật độ cao hơn, nhiều dữ liệu hơn tạo ra nút thắt cổ chai mới trong chip AI
Data movement is becoming a bigger problem at advanced nodes and in advanced packaging due to denser circuitry, more physical effects that can affect the integrity of signals or the devices themselves, and a significant increase in data from AI and machine learning.
Just shrinking features in a design is no longer sufficient, given the scaling mismatch between SRAM-based L1 cache and digital logic. Chip and system architectures need to be rethought based on real-world workloads, which in turn determines where and how much data is created, where that data is processed and stored, and where potential impediments can crop up to slow or block the flow of data.
“As the number of components and connections increases, managing interconnect density and routing challenges becomes crucial to avoid congestion and performance bottlenecks,” said Chowdary Yanamadala, senior director of technology strategy at Arm. “Additionally, securing sensitive data necessitates cryptographic operations, which can impact data transfer performance.”
The increase in resistance and capacitance due to pushing signals through thinner wires adds another thorny set of issues. “The cost of data transmission, of course, is both power and latency,” said Marc Swinnen, director of product marketing at Ansys. “It takes power to move data around, and then it just slows down because it takes time to move it. Those are the two technical choke points. The core of the problem, certainly at the chip level, is that the transistors have scaled faster than the interconnect, so the transistors are getting smaller and smaller, faster and faster, but the wires are not scaling at the same rate.”
Moreover, while the number of transistors per mm2 continues to increase, the amount of data that needs to be moved has grown even faster. “It’s widely acknowledged, for AI in particular, that the memory system is a big bottleneck in terms of keeping the processing engines working,” said Steven Woo, fellow and distinguished inventor at Rambus. “They’re often just waiting for data. You have AI training and AI inference. In training, the challenge is to get these big training data sets in and out of memory, and back to the processor so it can actually learn what to do. The size of these models has been growing pretty close to 10 times or more per year. If you’re going to do that, you need the appropriate growth in the amount of data you’re training it with, as well. The thing that’s been a big challenge is how to get memory systems to keep up with that growth rate.”
While one solution is to simply increase the number of circuits to avoid the distortion, that can lead to pushing the power needs to a level that can be unacceptable for a project’s limitations. Woo noted that in many designs, the amount of power needed to move data now far outstrips the power budget for compute itself.
“It turns out about two-thirds of the power is spent simply getting the data out of memory and moving it between the chips,” he said. “You’re not even doing anything with the data. You’ve got to get it in and out of the DRAM. It’s pretty crazy how much it costs you, and a lot of that is really just driven by the physical distance. The longer you go, the more power you need to drive it that distance. But there are other things with electrical signaling, like making sure you can process the signal because it distorts a little bit. There’s also interference. So not surprisingly, people are realizing that’s the big part of the energy pie, and they have to cut that distance down.”
Partitioning of the logic creates additional challenges. “Multi-core architectures require effective data sharing between processing units, leading to increased bandwidth demands,” Arm’s Yanamadala said. “Another critical challenge is maintaining high data transfer rates in a power-efficient manner. It is also essential to implement proper thermal management solutions to prevent performance degradation and ensure overall system efficiency.”
Increasing complexity
There are no silver bullets, and no single approach solves all problems. But there are numerous options available, and each comes with tradeoffs.
Included in that list is I/O disaggregation, which can include high-speed Ethernet or PCI. “They can be disaggregated into possibly a larger process geometry,” said Manmeet Walia, executive director of product management at Synopsys. “We have use cases around multi-functions coming together — an RF chip, a digital chip, an analog chip — to form a highly dense SoC. Even the 2.5D technologies are now evolving to higher density. We used to have what is generically called an interposer, which is a passive die at the bottom, with active dies on the top. Now, that is evolving and getting fragmented into multiple different technologies that are based on RDL.”
To address the issues inherent in battling physics-based restraints to moving increasing amounts of data around, designers will have to get creative. “Addressing these challenges requires innovative approaches, such as advanced interconnect technologies, efficient memory architectures, and sophisticated power management techniques,” said Yanamadala, pointing to pre-integrated subsystems such as Arm’s Neoverse Compute Subsystems as a way of freeing up developers to focus on building differentiated, market-customized solutions. “As chip architectures continue to evolve, the ability to overcome these obstacles will be critical to unlocking the full potential of future computing systems.”
Woo agreed that the solutions must be multi-faceted, citing better compression algorithms as a way to speed up data movement, and more parallel processing as a way to process it faster so that less needs to be moved in the first place.
“There is some limit on how fast you can go,” he said. “It’s kind of like when airplanes first were approaching the speed of sound, for example. The design of the airplane had to change, and there was an exponential increase in how fast they could make the airplanes go. Then, of course, over the last 20, 30 or 40 years, there really hasn’t been as much of an increase because you start to get to physical limits of how fast you can pass data over these wires that form a bus. One of those limits is simply the distance you’re trying to go. It turns out that the signals will distort, and start to disturb each other. You typically have a whole bunch of wires next to each other, and so the faster you go, and the more of these wires you try to cram together, the more they can interfere, and the more the signals can distort.”
Memory changes
The rapid increase in the amount of data being processed also accounts for the rapid product cycles in high-bandwidth memory, which is being used for both L2 and L3 cache. While new versions of DDR typically appeared every five years or so in the past, new versions of HBM are being released every couple of years, augmented by more layers in the stack.
“This is because the demand is so high for bandwidth that we’re having to change our architectures, and we’re having to heavily tune the design to the process technologies that are available at the time,” Woo explained.
One approach in research today trades off some of that memory for compute engines as one way to alleviate the traffic jam. That necessitates a rethinking of physical architecture, as well as important decisions on functionality.
“If it’s so hard to bring all this data over to a processor, maybe we ought to put little compute engines much closer to the DRAM core itself,” he said. “Companies like Samsung and Hynix have demonstrated in sample silicon that they can do this kind of thing. If you do that, it does take away some of the DRAM capacity, because you’re removing the bit cells that store data, and you’re putting some compute engines in there. But there is an effort within the industry to determine the minimum necessary amount of compute logic needed to get the biggest bang for the buck.”
Do more, earlier
Figuring out which are the best architectural options for mitigating bottlenecks requires an increasing level of experimentation using high-level models early in the design cycle.
“In a tool for physical analysis you can build a power map on top of the floor plan and start to do thermal analysis,” said Tim Kogel, principal engineer for virtual prototyping in the Systems Design Group at Synopsys. “Putting things together to make the distances shorter might have an adverse effect on the thermal aspect. If these two blocks are both getting too hot and you need to spread out the computations on a bigger area — so as to have the power dissipation, but not run into thermal issues — it should be modeled and analyzed in a quantitative way earlier, so you don’t leave it to chance.”
In addition to more data passing through a chip or system, there also is more data to consider in the design process. This is especially true with more transistor density and different advanced packaging approaches, and it becomes even more complex with new approaches such as RDL circuitry, bridges, chiplets, and various interconnect schemes. For a designer working on a new chip or system, that can become overwhelming.
“You have more data, to the point where some companies will tell you that when they archive a design at the end of a project, they’re now talking about the need for petabytes of disk space to manage all this data,” said Michael Munsey, vice president of semiconductor industry for Siemens EDA. “You can basically equate the size of the transistor to the amount of data that needs to be managed and handed off from point to point to point. For the digital designer, that just means more and more files, more and more people collaborating on the design, maybe more IP coming in. Having to manage IP, from third parties, from other parts of your organization where you’re sharing design information, or maybe even with companies that you’re collaborating with, you have this explosion of data because the transistors have started to get small. This necessitates having formal processes to manage the traceability of the information along the entire design.”
As a result, what traditionally were discrete steps in the design process now have to be dealt with concurrently, and all of that has to happen earlier. This includes high-level tradeoffs between power, latency, bandwidth, and density, which must be dealt with earlier by building architectural models while taking the workload you want to execute into account, said Kogel. “How do you partition that workload, either within a chip, within an SoC, between different types of processing and engines, CPUs, GPUs, AI accelerators? Or when it comes to multi-die partitioning, between multiple dies, and within that partitioning, how do you make the decision where to process something? Where do you store data, and how do you organize the data movements that gives you a way to analyze these tradeoffs before going to implementation?”
Conclusion
Growing complexity in chips has made moving data a complicated endeavor. While compute continues to grow rapidly, the limitations of wires to move the data generated by that compute is limited by the laws of physics.
Research continues into new architectures that can reduce the flow of data, such as computing closer or inside of DRAM, and new ways to shorten the distance between transistors and memory such as stacking logic on SRAM on a substrate. But in the end, it all comes down to the best way to process, store, and move data for a specific workload, and the number of options and hurdles are a daunting challenge.
Related Reading
Striking A Balance On Efficiency, Performance, And Cost
More efficient designs can save a lot of power, but in the past those savings have been co-opted for higher performance.
CPU Performance Bottlenecks Limit Parallel Processing Speedups
Hardware optimizations and well-thought-out software architectures can help, but only incrementally.
New AI Processors Architectures Balance Speed With Efficiency
Hot Chips 24: Large language models ratchet up pressure for sustainable computing and heterogeneous integration; data management becomes key differentiator.