Tầm nhìn là lý do tại sao LLM lại quan trọng
Large Language Models (LLMs) have taken the world by storm since the 2017 Transformers paper in 2017 but pushing them to the edge has proved problematic. Just this year, Google had to revise its plans to roll out Gemini Nano on all new Pixel models — the down-spec’d hardware options proved unable to host the model as part of a positive user experience. But the implementation of language-focused models at the edge is perhaps the wrong metric to look at. If you are forced to host a language-focused model for your phone or car in the cloud, that may be acceptable as an intermediate step in development. Vision applications of AI, on the other hand, are not so flexible: many of them rely on low latency and high dependability. If a vehicle relies on AI to identify that it should not hit the obstacle in front of it, a blip in contacting the server can be fatal. Accordingly, the most important LLMs to fit on the edge are vision models — the models whose purpose is most undermined by the reliance on remote resources.
“Large Language Models” can be an imprecise term, so it is worth defining. The original 2017 Transformer LLM that many see as kickstarting the AI rush was 215 million parameters. BERT was giant for its time (2018) at 335 million parameters. Both of these models might be relabeled as “Small Language Models” by some today to distinguish from models like GPT4 and Gemini Ultra with as much as 1.7 trillion parameters, but for the purposes here, all fall under the LLM category. All of these are language models though, so why does it matter for vision? The trick here is that language is an abstract system of deriving meaning from a structured ordering of arbitrary objects. There is no “correct” association of meaning and form in language which we could base these models on. Accordingly, these arbitrary units are substitutable — nothing forces architecture developed for language to only be applied to language, and all the language objects are converted to multidimensional vectors anyway. LLM architecture is thus highly generalizable, and typically retains the core strength from having been developed for language: a strong ability to carry through semantic information. Thus, when we talk about LLMs at the edge, it can be a language model cross-trained on image data, or it might be a vision-only model which is built on the foundation of technology designed for language. At the software and hardware levels, for bringing models to the edge, this distinction makes little difference.
Vision LLMs on the edge flexibly apply across many different use cases, but key applications where they show the greatest advantages are: embodied agents (an especially striking example of the benefits of cross-training embodied agents on language data can be seen with Dynalang’s advantages over DreamerV3 in interpreting the world due to superior semantic parsing), inpainting (as seen with the latent diffusion models), LINGO-2’s decision-making abilities in self-driving vehicles, context-aware security (such as ViViT), information extraction (Gemini’s ability to find and report data from video), and user assistance (physician aids, driver assist, etc). Specifically notable and exciting here is the ability for Vision LLMs to leverage language as a lossy storage and abstraction of visual data for decision-making algorithms to then interact with — especially as seen in LINGO-2 and Dynalang. Many of these vision-oriented LLMs depend on edge deployment to realize their value, and they benefit from the work that has already been done for optimizing language-oriented LLMs. Despite this, vision LLMs are still struggling for edge deployment just as the language-oriented models are. The improvements for edge deployments come in three classes: model architecture, system resource utilization, and hardware optimization. We will briefly review the first two and look more closely at the third since it often gets the least attention.
Model architecture optimizations include the optimizations that must be made at the model level: “distilling” models to create leaner imitators, restructuring where models spend their resource budget (such as the redistribution of transformer modules in Stable Diffusion XL) and pursuing alternate architectures (state-space models, H3 modules, etc.) to escape the quadratically scaling costs of transformers.
System resource optimizations are all the things that can be done in software to an already complete model. Quantization (to INT8, INT4, or even INT2) is a common focus here for both latency and memory burden, but of course compromises accuracy. Speculative decoding can improve utilization and latency. And of course, tiling, such as seen with FlashAttention, has become near-ubiquitous for improving utilization and latency.
Finally, there are hardware optimizations. The first option here is a general-purpose GPU, TPU, NPU or similar, but those tend to be best suited for settings where capability is needed without demanding streamlined optimization such as might be the case on a home computer. Custom hardware, such as purpose-built NPUs, generally has the advantage when the application is especially sensitive to latency or resource consumption, and this covers much of the applications for vision LLMs.
Exploring this trade-off further: Stable Diffusion’s architecture and resource demands have been discussed here before, but it is worth circling back to it as an example of why hardware solutions are so important in this space. Using Stable Diffusion 1.5 for simplicity, let us focus specifically on the U-Net component of the model. In this diagram, you can see the rough construction of the model: it downsamples repeatedly on the left until it hits the bottom of the U, and then upsamples up the right side, bringing back in residual connections from the left at each stage.
This U-Net implementation has 865 million parameters and entails 750 billion operations. The parameters are a fair proxy for the memory burden, and the operations are a direct representation of the compute demands. The distribution of these burdens on resources is not even however. If we plot the parameters and operations for each layer, a clear picture emerges:
These graphs show a model that is destined for gross inefficiencies at every step. Most of the memory burden peaks in the center, whereas the compute is heavily taxed at the two tails but underutilized in the center. These inefficiencies come with costs. The memory peak can overwhelm on-chip storage, thus incurring I/O operations, or else requiring a large excess of unused memory for most of the graph. Similarly, storing residuals for later incurs I/O latency and higher power draws. The underutilization of the compute power at the center of the graph means that the processor will have wasteful power draw as it cannot use the tail of the power curve as it does sparser operations. While software interventions can also help here, this is exactly the kind of problem that custom hardware solutions are meant to address. Custom silicon tailored to the model can let you offload some of that memory burden into additional compute cycles at the center of the graph without incurring extra I/O operations by recomputing the residual connections instead of kicking them out to memory. In doing so, the total required memory drops, and the processor can remain at full utilization. Rightsizing the resource allotment and finding ways to redistribute the burdens are key components to how these models can be best deployed at the edge.
Despite their name, LLMs are important to the vision domain for their flexibility in handling different inputs and their strength at interpreting meaning in images. Whether used for embodied agents, context-aware security, or user assistance, their use at the edge requires a dependable low latency which precludes cloud-based solutions, in contrast to other AI applications on edge devices. Bringing them successfully to the edge asks for optimizations at every level, and we have seen already some of the possibilities at the hardware level. Conveniently, the common architecture with language-oriented LLMs means that many of the solutions needed to bring these most essential models to the edge in turn may also generalize back to the language-oriented models which donated the architecture in the first place.
Ben Gomes
Ben Gomes is a linguistics engineer at Expedera. He holds a PhD, Master’s Degree, and BA in linguistics from UC Davis.