Chuẩn bị cho một cuộc cải tổ trong NPU Edge
When the potential for AI at the edge first fired our imagination, semiconductor designers recognized that performance (and low power) required an accelerator and many decided to build their own. Requirements weren’t too complicated, commercial alternatives were limited and who wanted to add another royalty to further reduce margins? We saw NPUs popping up everywhere, in-house, in startups, and in extensions to commercial IP portfolios. We’re still in that mode but there are already signs that this free-for-all must come to an end, particularly for AI at the edge.
Accelerating software complexity
The flood of innovation around neural net architectures, AI models and foundation models, has been inescapable. For architectures from CNNs to DNNs, to RNNs and ultimately (so far) to transformers. For models in vision, audio/speech, in radar and lidar, and in large language models. For foundation models such as ChatGPT, Llama, and Gemini. The only certainty is that whatever you think is state-of-the-art today will have to be upgraded next year.
The operator/instruction set complexity required to support these models has also exploded. Where once a simple convolutional model might support <10 operators, now the ONNX standard supports 186 operators, and NPUs make allowance for extensions to this core set. Models today combine a mix of matrix/tensor, vector, and scalar operations, plus math operations (activation, softmax, etc). Supporting this range requires a software compiler to connect the underlying hardware to standard (reduced) network models. Add to that an instruction set simulator to validate and check performance against the target platform.
NPU providers must now commonly provide a ModelZoo of pre-proven/optimized models (CV, audio, etc) on their platforms, to allay cost of adoption/ownership concerns for buyers faced with this complexity.
Accelerating hardware complexity
Training platforms are now architecturally quite bounded, today mostly a question of whose GPU or TPU you want to use. The same cannot be said for inference platforms. Initially these were viewed somewhat as scaled-down versions of training platforms, mostly switching floats to fixed and more tightly quantizing word sizes. That view has now changed dramatically. Most of the hardware innovation today is happening in inference, especially for edge applications where there is significant pressure on competitive performance and power consumption.
In optimizing trained networks for edge deployment, a pruning step zeroes out parameters which have little impact on accuracy. Keeping in mind that some models today host billions of parameters, in theory zeroing such parameters can dramatically boost performance (and reduce power) because calculations around such cases can be skipped.
This “sparsity” enhancement works if the hardware runs one calculation at a time, but modern hardware exploits massive parallelism in systolic array accelerators for speed. However such accelerators can’t skip calculations scattered through the array. There are software and hardware workarounds to recapture benefits from pruning, but these are still evolving and unlikely to settle soon.
Convolutional networks, for many of us the start of modern AI, continue to be a very important component for feature extraction for example in many AI models, even in vision transformers (ViT). These networks can also run on systolic arrays but less efficiently than the regular matrix multiplication common in LLMs. Finding ways to further accelerate convolution is a very hot topic of research.
Beyond these big acceleration challenges there are vector calculations such as activation and softmax which either require math calculations not supported in a standard systolic array, or which could maybe run on such an array but inefficiently since most of the array would sit idle in single row or column operations.
A common way to address this set of challenges is to combine a tensor engine (a systolic array), a vector engine (a DSP) and a scalar engine (a CPU), possibly in multiple clusters. The systolic array engine handles whatever operations it can serve best, handing off vector operations to the DSP, and everything else (including custom/math operations) is passed to the CPU.
Makes sense, but this solution requires a minimum of 3 compute engines. Product cost goes up both in die area and possibly royalties, power consumption goes up, and the programming and support model becomes more complex in managing, debugging, and updating software across these engines. You can understand why software developers would prefer to see all this complexity handled within a common NPU engine with a single programming model.
Growing supply chain/ecosystem complexity
Intermediate builders in the supply chain, a Bosch or a ThunderSoft for example, must build or at least tune models to be optimized for the end system application, considering say different lens options for cameras. They don’t have the time or the margin to accommodate a a wide range of different platforms. Their business realities will inevitably limit which NPUs they will be prepared to support.
A little further out but not far, software ecosystems are eager to grow around high-volume edge markets. One example is around software/models for earbuds and hearing aids in support of audio personalization. These value-add software companies will also gravitate around a small number of platforms they will be prepared to support.
Survival of the fittest is likely to play out even faster here than it did around a much earlier proliferation of CPU platforms. We still need competition between a few options, but the current Cambrian explosion of edge NPUs must come to an end fairly quickly, one way or another.
Share this post via: