Phân loại an toàn trong DNN. Đổi mới trong xác minh

How do you measure safety for a DNN? There is no obvious way to screen for a subset of safety-critical nodes in the systolic array at the heart of DNNs. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.

Safety Grading in DNNs

The Innovation

This month’s pick is Toward Functional Safety of Systolic Array-Based Deep Learning Hardware Accelerators. This was published in IEEE Trans on VLSI Systems. The authors are from Intel and UT Dallas. The paper has 49 citations (and still climbing!).

Think FMEDA has safety compliance locked down? Not so in AI accelerators, where what constitutes a safety-critical error in the hardware cannot be decoupled from the model running on the hardware. Research here is quite new but already it is clear that the rulebook for faulting, measurement, and ultimately safety mitigation must be re-thought for this class of hardware.

There are multiple recent papers in this field, some of which we may review in later blogs. This paper is a start, taking one view of what errors mean in this context (misclassification) and what tests should be run (a subset of representative test images). On the second point, the authors provide important suggestions on how to trim the test set to a level amenable to repeated re-analysis in realistic design schedules.

Paul’s view

Intriguing paper this month: how do you check that an autonomous drive AI accelerator in a car is free of hardware faults? This can be done at manufacturing time with scan chains and test patterns, but what about faults occurring later during the lifetime of the car? One way is to have duplicate AI hardware (known in the industry as “dual lock step”) and continuously compare outputs to check they are the same. But of course this literally doubles the cost, and AI accelerators are not cheap. They also consume a lot of power so dual lockstep drains the battery faster. Another is built-in-self-test (BIST) logic, which can be run each time the car is turned on. This is a good practical solution.

This paper proposes using a special set of test images that are carefully selected to be edge cases for the AI hardware to correctly classify. These test images can be run at power on and checked for correct classification, giving a BIST-like confidence but without needing the overhead of BIST logic. Unfortunately, the authors don’t give any direct comparisons between their method and commercial BIST approaches, but they do clearly explain their method and credibly argue that with only a very small number of test images, it is possible to detect almost all stuck-at faults in a 256×256 MAC AI accelerator.

The authors propose two methods to pick the edge case images. The second method is by far the best and is also easy to explain: the final layer of a neural network used to classify images has one neuron for each object type being classified (e.g. person, car, road sign, …). Each of these neurons outputs a numerical confidence score (0 to 1) that the image contains that object. The neuron with the highest confidence score wins. The authors sort all the training images by the max confidence score of all their neuron outputs and then use the n images with the lowest max confidence scores as their edge case BIST images. They present results on 3 different open-source image classification benchmarks. Injecting faults into the AI accelerator that cause it to have a 5% mis-classification rate, using 10 randomly selected images for BIST achieves only 25% fault coverage. Using their edge-case selection method to pick 10 images gives 100% fault coverage. Intuitive and effective result.

Raúl’s view

In March, we blogged about SiFI-AI, which simulated transient faults in DNN accelerators using fast AI inference and cycle-accurate RTL simulation. The results showed a “low” error probability of 2-8%, confirming the resilience of DNNs, but not acceptable for Functional Safety (FuSa) in many applications. This month’s paper explores both transient and permanent faults in DNN accelerators to assess FuSa, aiming to create a small set of test vectors that cover all FuSa violations.

The used configuration consists of 1) a Deep Neural Network (DNN) featuring three fully connected hidden layers with architecture 784-256-256-256-10, 2) a systolic array accelerator similar to Google’s Tensor Processing Unit (TPU) that has a 256 x 256 Multiply-Accumulate (MAC) array of 24-bit units, and 3) three datasets for image classification with 60,000 training and 10,000 test images: MNIST, a benchmark of digit images, F-MNIST a set of 10 fashion images, and CIFAR-10 a set of images in 10 classes.

The paper performs a FuSa assessment on inference of the fully trained DNN running on the systolic array, injecting various faults into the systolic array reused for all DNN layers. The error-free DNN shows a classification accuracy of 97.4%. Obvious and not so obvious findings include:

  • Errors in less significant bit positions have lower impact, in the accumulator from the 9th bit onwards no effect.
  • Accumulator errors drop accuracy by about 80%, while multiplier errors cause only a 10% drop.
  • More injected faults lead to greater accuracy reductions.
  • Activation functions affect accuracy: ReLU results in about an 80% drop, whereas Sigmoid and Tanh result in around a 40% drop.
  • Data sets also impact accuracy: MNIST and F-MNIST have about an 80% drop, while CIFAR-10 has only a 30% drop.

The key section of the paper focuses on how to select test cases for detecting FuSa violations (any reduction in accuracy). The primary insight is that instead of using random nondeterministic patterns from the entire input space, which mostly do not correspond to an image and cannot be classified by a DNN, the proposed algorithms choose specific patterns from the pool of test vectors in the application data set for FuSa violation detection. The first algorithm calculates the Euclidean distance of each test case from multiple classes in the data set and selects those that resemble multiple classes. The outcomes are remarkable: with only 14-109 test cases, 100% FuSA coverage is achieved. Another algorithm picks the k patterns that have the lowest prediction confidence values, where the number k of test patterns is set by the user. With just k=10 test patterns, which is 0.1% of the total 10,000, all FuSA violations are identified. The authors also present results for a larger DNN with 5 hidden layers and a bigger data set containing 112,800 training and 18,800 test images, achieving similar outcomes.

This is an enjoyable paper to read. The title “Towards Functional Safety” hints at not ready for practical application, due to the limited dataset of merely 10,000 images and just 10 classes. It remains open whether this approach would be effective in scenarios with significantly larger datasets and categories, such as automotive applications, face recognition, or weapon detection.

Share this post via:

source

Facebook Comments Box

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *