Săn lỗi trong NoCs. Đổi mới trong xác minh
Despite NoCs being finely tuned in legacy subsystems, when subsystems are connected in larger designs or even across multi-die structures, differing traffic policies and system-level delays between NoCs can introduce new opportunities for deadlocks, livelocks and other hazards. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
The Innovation
This month’s pick is NoCFuzzer: Automating NoC Verification in UVM. 2024 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. The authors are from Peking University, Hong Kong University and Alibaba.
Functional bugs should be relatively uncommon in production grade NoCs, but performance bugs are highly dependent on expected traffic and configuration choices. By their nature NoCs will almost unavoidably include cycles; the mesh and toroidal topologies common in many-core servers and AI accelerators are obvious examples. Traffic in such cases may be subject to deadlock or livelock problems under enough traffic load. Equally weaknesses in scheduling algorithms can lead to resource starvation. Such hazards need not block traffic in a formal sense (never clearing) to undermine product success. If they take sufficiently long to clear, they will still fail to meet the expected service level agreements (SLAs) for the system.
There are traffic routing and scheduling solutions to mitigate such problems – many such solutions. Which will work fine within one NoC designed by one system integration team, but what happens when you must combine multiple legacy/3rd party subsystems, each with a NoC designed according to its own policy preferences and connected through a top-level NoC with its own policies? This issue takes on even more urgency in chiplet-based designs adding interposer NoCs to connect between chiplets. Verification solutions become essential to tease out potential bugs between these interconnected networks.
Paul’s view
A modern server CPU can have 100+ cores all connected through a complex coherent mesh-based network-on-a-chip (NOC). Verifying this NOC for correctness and performance is very hard problem and a hot topic with many of our top customers.
This month’s paper takes a concept called “fuzzing” from the software verification world and applies it to UVM-based verification of 3×3 OpenPiton NOC. The results are impressive: line and branch coverage hit 95% in 120hrs with the UVM bench vs. 100% in 2.5hrs with fuzzing; functional covergroups reach 89-99% in 120hrs with the UVM bench vs. 100% across all covergroups in 11hrs with fuzzing. Also, the authors try injecting a corner case starvation bug into the design. The baseline UVM bench was not able to hit the bug after 100M packets whereas fuzzing was able to hit it after only 2M packets.
To achieve these results the authors use a fuzzing tool called AFL – checkout its Wikipedia page here. A key innovation in the paper is the way the UVM bench is connected to AFL: the authors invent a simple 4-byte XYLF format to represent a packet on the NOC. XY is the destination location, L the length, F a “free” flag. The UVM bench reads a binary file with a sequence of 4-byte chunks and then injects each packet in the sequence to each node in the NOC round robin style, first packet from cpu 00, then cpu 01, 02, 10, 11, and so on. If F is below some static threshold T then the UVM bench just has the cpu put nothing into the NOC for the equivalent length of that packet. The authors set T for a 20% chance of a “free” packet.
AFL is given an initial seed set of binary files taken from a non-fuzzed UVM bench run, applies them to the UVM bench, and is provided back with coverage data from the simulator – each line, branch, covergroup is just considered a coverpoint. AFL then starts applying mutations, randomly modifying bytes, splicing and re-stitching binary files, etc. A genetic algorithm is used to guide the mutation towards increasing coverage. It’s a wonderfully abstract, simple, and elegant utility that is completely blind to the goals for which it is aiming to improve coverage.
Great paper. Lots of potential to take this further commercially!
Raúl’s view
Fuzzing is a technique for automated software testing where a program is fed malformed or partially malformed data. These test inputs are usually variations on valid samples, modified either by mutation or according to a defined grammar. This week’s paper uses AFL (named after a breed of rabbit) which employs mutation; its description offers a good understanding of fuzzing. Note that fuzzing differs from random or constrained random verification commonly applied in hardware verification.
The authors apply fuzzing techniques to hardware verification, specifically targeting Network-on-Chip (NoC) systems. The paper details the development of an UVM-based environment connected to the AFL fuzzer within a standard industrial verification process. They utilized Verilog, the Synopsys VCS simulator, and focused on conventional coverage metrics, predominantly code coverage. To interface the AFL Fuzzer to the UVM test environment, the test output of the fuzzer must be translated into a sequence of inputs for the NoC. Every NoC packet is represented as 40-bit string which contains the destination address, length, port (each node in the NoC has several ports) and a control flag that determines if the packet is to be executed or if the port remains idle. These strings are mutated by AFL. A simple grammar converts them into inputs for the NoC. This is one of the main contributions of the paper. The fuzzing framework is adaptable to any NoC topology.
NoC are the communication fabric of choice for digital systems containing hundreds of nodes and are hard to verify. The paper presents a case study of a compact 3×3 mesh NoC element from OpenPiton. The results are impressive: Fuzz testing achieved 100% line coverage in 2.6 hours, while Constrained Random Verification (CRV) only reached 97.3% in 120 hours. For branch coverage Fuzz testing achieved full coverage in 2.4 hours and CRV only reached 95.2% in 120 hours.
The paper is well written and offers impressive detail, with a practical focus that underscored its relevance in an industrial context. While it is occasionally somewhat verbose, it was certainly an excellent read.
Share this post via: