Chiplets tiến triển khi sử dụng Interconnects làm keo
Breaking up SoCs into their component parts and putting those and other pieces together in some type of heterogeneous assembly is beginning to take shape, fueled by advances in interconnects, complex partitioning, and industry learnings about what works and what doesn’t.
While the vision of plug-and-play remains intact, getting there is a lot more complicated than initially imagined. It can vary greatly by application and by workload, which in turn can affect timing, latency, and cost. And it can vary by package type, whether AI is included or not, how much software is needed for scheduling and prioritization, and the type of interconnects being used.
Interconnects are the glue, according to Rob Dimond, system architect and fellow at Arm. They encompass the network on chip (NoC), which is on the chiplet, all the other interconnects that reside within the chiplet, as well as the UCIe die-to-die connection, which takes data across chiplets. The interconnects work with other high speed interfaces, as well, taking data from one boundary to some other boundary.
“The fundamental challenge with chiplet interconnects is understanding how you are disaggregating,” said Arif Khan, senior product marketing group director at Cadence. “You’re dividing your compute and data flow problem. What is your architecture all about? How have you partitioned it? You’ve got data flow within the chiplet, and then data flow across these chiplets. It all depends where your data is going and what the context is. For example, what is the problem you’re solving? If you’re looking at a GPU-type application, you can’t even fit that large language model into a single GPU. You’re looking instead at an AI factory of millions of them. Then you’re looking at different coherency models. You’re looking at that fact that even the standard protocols don’t cut it.”
Today those interconnects typically are wires (although in the future there may be optical interconnects between and potentially even within packages, or some combination of both). But not all wires behave the same way. They can be different diameters, packed together in different densities with different insulation, and even different materials.
“The number of wires you can get, and the characteristics of those wires, are very different,” said Elad Alon, CEO and co-founder of Blue Cheetah, said. “That’s what drives you to have to do things differently. The other piece of this — which is not fundamentally physics-driven, but is more just practical engineering-driven — is that oftentimes people want to isolate the timing interfaces across the chiplet boundaries. When a chiplet is in a 2.5D or 3D package, there’s some room for maneuvering, but it’s a typical design decision to isolate those timing interfaces from each other. This is primarily driven from the idea that it’s physically partitioned in a different die. ‘I don’t want to have to do this multiple cross-die timing closure exercise.’ It’s not that you can’t. It’s just that one is motivated not to do so for practical reasons. That’s the other place where chiplet interconnect will tend to be different than an on-die interconnect. The on-die interconnect will be within a single clock domain and can be driven by a more ‘standard’ place-and-route type of flow. But the fact that you have fewer wires means you need to run them faster. An isolated timing interface is where the analog people come into the picture to do that. And obviously, it’s as low-area, low-power as possible.”
Because a chiplet interconnect needs to transport data over a die-to-die connection, these physical interfaces often are very high speed but relatively narrow. Unlike an SoC interconnect, however, a chiplet interconnect normally is packetized like a communications protocol, and less like an on-chip bus.
“A chiplet interconnect normally will allow data to be sent across the inter-die link over a very wide interface in packetized format, which can be serialized and sent over the link,” explained Ashley Stevens, director of product management at Arteris. “Die-to-die interconnects need to support various sideband signaling, which in a SoC often is handled by point-to-point signaling, such as interrupts and power management. These also will need to be transported from die-to-die in packetized format over the same link as normal memory and peripheral transactions, and should not be forgotten.”
Those interconnects also need to be matched to the application. “Chiplets demand a highly efficient D2D (die-to-die) interconnect that excels on critical parameters,” said Letizia Giuliano, vice president, product marketing and management at Alphawave Semi. “We need to tailor the D2D interconnect for the chiplet applications to optimize overall TCO (total cost of ownership) for that interface on a given system in the package. Area efficiency is measured in bandwidth shoreline density that enables the highest Tb/s of data per millimeter of shoreline. Power is energy efficiency, and pj/b needs to be as low as possible. When we use D2D interconnect in chiplet, we create duplications of I/O circuitries. Both physical layer and digital logic are added, and they need to reduce the impact on the overall power budget and fit in the overall TCO.”
Latency is a critical performance metric, and transmitter (TX) plus receiver (RX) travel time needs to be minimized. “The design of a D2D interconnect must strike a delicate balance between circuit complexity and PPA, which is best in class,” Giuliano said. “This ensures that we do not oversize circuits and lose focus on the application space. For instance, a simple interface with a single-ended architecture and good balance in voltage regulations aids in power efficiency. Simultaneously, compact circuitry in the analog TX and RX requires careful study for mismatch and noise.”
Maximizing the benefits of heterogeneous integration requires a deep understanding of the end application and workload and how best to design a solution for that specific domain. “We cannot lose touch with the application space and minimize overall TCO, so the D2D architecture needs to be designed for different types of packages and bump pitches. When designing a system, we need to consider all circuit impairments for a realistic implementation,” Giuliano noted. “We are moving from the on-die to the on-package. The natural way of breaking down our SoC die in a chiplet system in the package is to transport our SoC network on chip on the package, so we are adding a physical layer transport to our nominal on-die transport layer.”
Moving data in chiplets
There are numerous competing protocols available for moving data. AMBA CHI, UCIe, and BoW are the best known. Which one, or combination, ultimately wins remains to be seen. But they essentially perform the same function of moving data quickly between chiplets.
“[AMBA CHI] is packetized, widely used and openly licensed, and is the basis for AMBA CHI C2C to enable it to be connected between chiplets using a suitable chiplet physical and link layer,” said Arm’s Dimond. “For aggregating components across the motherboard into a package, it works best to use established interconnect standards atop a new physical layer that is optimized for chiplets. For disaggregating an SoC into multiple chiplets, it similarly makes sense to use an established on-SoC interconnect.”
Arm sees chiplet interconnects evolving from either existing on-board, or existing on-SoC interconnects. But with chiplet architectures, there are more and different layers to consider.
“For the physical layer, a die-to-die interconnect between chiplets will likely support fewer physical connections running over longer distances,” Dimond explained. “A SerDes may be required. In the case of AMBA CHI C2C, the protocol is packetized to support running atop a physical layer. The protocol layer will need an architecture specification to give the required long-term stability to support reuse over time, and potentially between different players in a value chain as an ecosystem emerges.”
To a large extent, chiplet-to-chiplet communication is a partitioning problem, and it’s one that is particularly challenging in automotive designs.
“Here’s an example — I can get a chiplet from Company X that has a perfect CPU complex on it, but it doesn’t have a GPU,” said David Fritz, vice president, Hybrid and Virtual Systems at Siemens Digital Industries Software. “I’m trying to do something for IVI, so I need a GPU there for rendering. There’ll be companies that say, ‘How about if I just take our GPU and I put it in a chiplet of its own, and I’m going to call that chiplet a droplet?’ It’s just one subsystem block that can’t stand on its own. People will create these droplets, then what they’ll do is say, ‘You go ahead and take our droplet and go to some other company, and they’ll put what they need around it.’ So what’s happened is we’ve gone right back to essentially selling hard macros. ‘I’ve got the GPU here, but my memory is on another chiplet? Oh, wait a minute, that’s not going to work because I don’t have the bandwidth that I need for a GPU, for high res, multi display.’ So again, if you don’t have the tools to explore the complexities of the space and derive the deeper, hard requirements that are not intuitive or obvious, then you’re just going to end up making bad decisions and you’re not going to end up with a competitive product.”
Partitioning in heterogeneous systems is more than just about hardware. Software also needs to be compatible across chiplets.
“If you think about inference, inference usually works with a smaller data set and makes decisions on that,” said Kevin Donnelly, vice president of strategic marketing at Eliyan. “The processing elements might be all contained within one chip, and the interconnect that you need to do is to the outside world and to memory. That drives what kind of interconnect you have, and what kind of bandwidth you need over those interconnects. That would drive the partitioning of an inference-like chipset. If it’s training and you’re dealing with massive data sets like NVIDIA does, what they care about is taking lots of very large disaggregated chips and making them look seamlessly, like they’re actually just bigger and bigger monolithic chips. In those, they need to interconnect the GPU cores as tightly as they can and get as much bandwidth between the chiplets. That off-chip interconnect issue is what drove their partitioning decision, and it’s the reason they rotated it 90 degrees versus what everyone else had done prior to that, which was to try to make two massive, monolithic die look like one even larger, huge, monolithic die. Then the connections outside that go to the I/O world and to other memory. That’s how the on-chip interconnects played a role in their partitioning. At a software level, they’re able to make it look like one huge processor versus two disaggregated ones, and that lets them get great performance benchmarks, according to what they’ve published versus what had been available prior to that.”
This also can be referred to as cross-sectional bandwidth and energy consumption. “These are two things you need to pay attention to when you’re partitioning things away from each other, off of a monolithic chip and into two heterogeneous pieces that need to get reconnected, or homogeneous, for that matter,” noted Patrick Soheili, chief strategy and business officer at Eliyan. “You’re looking at areas where you can afford to put more power in because now you can connect them outside the die. It’s always more efficient to do it inside, but if you don’t have room you have no choice. So one decision is made by that. The other one is how fast one chip needs to talk to the other, i.e., what the cross-sectional bandwidth needs to be, and whether I can afford to put it far from each other and not in a monolithic chip. Those two are software partitioning, and making sure the whole system sees the SIP as one — which is always a critical piece of that — has nothing to do with the chiplet strategy other than making sure that everything works together as a subsystem.”
What chiplet brings to interconnect implementation
The advent of chiplet systems brings with it a new challenge of creating production-ready implementations. “This necessitates a new way of testing the D2D interface at increasing data rates and allowing the testing and screening of good die,” Alphawave’s Giuliano said. “How do we test a D2D interconnect physical layer on the wafer or on the package? Do we know if the HBM learning applies here, or do we need to do things differently? We are now talking about links at higher data rates of 32Gbps, and moving 64Gbps per pin, which are connecting more and more chiplets. Typically, this is implemented in advanced bump pitches that are not probable at the wafer level. It is essential to design test-level structures inside our PHY that can provide insight into the health of the silicon and observability of critical timing parameters over time.”
Alphawave has implemented advanced testing and debugging methodologies that allow its engineering team to test the link with internal loopback and register access. The company is also collaborating with OSATs to implement structural tests that ensure comprehensive test coverage across the D2D structures.
Another new problem stems from integrating D2D interconnect and chiplet from different vendors and implementations that need to interoperate. “Today, most of the systems we are deploying have one single vendor implementation, but we are working with ecosystem partners and customers to pave the way for multi-vendor interoperability. We have created test-vehicle and release chiplets that can be used with other parties to pipe clean electrical interoperability testing and protocol testing,” Giuliano noted.
System discovery is another area that will need to be standardized in chiplets, Arteris’ Stevens said. “To create an ecosystem of chiplets, they will need to be ability to ‘discover’ what is out there and align to form a system if the requirement is to support a true chiplet mix and match. Today, chiplets are designed and verified together as a single system, but that lacks flexibility of how they’re used together. Verification IP is also key to chiplets. To enable interoperability, there must be trusted ‘golden’ verification IP that is relied on in the industry. This enables chiplet designs to verify to the VIP and not need to verify to other chiplets.”
The overall memory map also must be considered from the perspective of the interconnect is the overall memory map. “The memory map is how accesses to specific addresses map to memory controllers in the system,” Stevens said. “In a chiplet system, memory accesses can go across chiplets. The mapping of this can have performance effects. A fine-grained mapping would spread accesses evenly across chiplets, but may cause performance issues due to the longer latency of remote chiplets. A coarse-grained mapping may be better, but then the accesses may not be spread as evenly, so there’s a tricky tradeoff to be made. System architects should model this, but another approach is to make this boot time configurable so that it can be experimented with after silicon bring-up.”
Yet another important consideration for the chiplet architecture is that there is not one D2D interconnect that can fit all the chiplet partitioning and architecture. “It is essential to understand the target KPI to select the correct configurations for the D2D link and chiplet partitions,” Giuliano noted. “We bring our chiplet custom silicon expertise and D2D interconnect leadership to guide our customers to partition the system correctly and find the best compromise between hitting optimal TCO and time to market. An important example is the packaging technology, and the D2D configurations needed for that given configuration. The selections need to involve all layers of the chiplet interconnect. The electrical PHY layer and package type are then moved to the interconnect protocol and partitions of the chiplet specific for domain architecture.”
Fig. 1: Alphawave’s multi-standard I/O chiplet. Source: Alphawave Semi
With better understanding of chiplet interconnects, the big question is how soon until there is a commercial chiplet marketplace. While companies such as Intel, AMD, NVIDIA, and Apple already are using chiplets, those chiplets are designed specifically for their own devices. Having commercial chiplets that are essentially plug-and-play is still a ways off.
“The next level we will see is the current players opening up ecosystems around their IP, allowing companion chiplets,” said Tim Kogel, senior director, technical product management at Synopsys. “That will require a whole methodology of architecture and tooling for collaboration. Especially in the automotive industry, that is a very important trend. In Europe there is the imec Automotive Chiplet Program (ACP). Japan has the Advanced SoC Research for Automotive (ASRA) consortium. There are working groups for architecture collaboration, and the physical aspects. How do we make it work at the signal level? How do we make it work in terms of macro architecture to fit things together? Especially in automotive, there is this big drive, because they clearly see the benefits of using the chiplet concept to have this scalable architecture. They want to go from a low-end car to a mid-end to a high-end by just saying in a simplified way, ‘Okay, this is one chiplet, this is two, this is four chiplets.’ They see the big economic scale, and they are going to enable this going through the chiplet path.”
Much work still needs to be done before that happens, however. As an industry, we are still learning about chiplets and the standards, all of which work on different areas,” said Chun-Ting “Tim” Wang Lee, signal integrity application scientist and high speed digital applications product manager at Keysight. “The big challenge for the industry is to focus on making sure they all work together since there will be a time when they all have to interconnect and function together.”
Related Reading
Chiplets: 2023 (EBook)
What chiplets are, what they are being used for today, and what they will be used for in the future.