Why Verification of SoCs is Critical in High Integrity Applications
By Enrique Martinez-Asensio, Functional Safety Manager in Silicon Characterization at EnSilica
High integrity/reliability electronic systems (hi-rel) refer to applications where failures are simply not an option. Industries that fit this category include automotive, aerospace, medical, and manufacturing, where reliability and safety functions are not just critical, but mandated through various regulatory standards.
Many of these applications are driven by a dedicated system-on-chip (SoC) which incorporate processing units, memory, analog, RF, and more in devices containing millions of transistors and thousands of embedded code lines. Such high numbers are understandably daunting; how can we be sure that nothing will go wrong?
Some of the most infamous accidents in the aviation or automotive fields have been attributed to bugs in vehicle control hardware or software. The Boeing 737 MAX incident led to pilots losing control of the aircraft due to a faulty sensor reading, causing two fatal crashes and grounding the 737 for more than a year while investigations took place. The Toyota “unintended acceleration” problem in 2009 led to numerous deaths and the emergency recall of around 10 million vehicles. Both incidents led to lengthy and expensive lawsuits against the involved companies.
Design vulnerabilities have since come to light, which point to insufficient verification of the electronic systems involved. This sends a very clear message: we still have a lot of learning to do when it comes to the design and deployment of hi-rel electronic systems.
Some changes are required in the development process
Having a robust development process is a must when dealing with hi-rel systems. From top-level requirement specifications to detailed implementation, having a clean documentation management system with full traceability will ensure that any changes along the design cycle are properly monitored, analyzed, and approved. The so-called V-Model, which splits the development process into design, implementation, and integration/testing, is commonly used to guide this process. But more can be done.
A deep analysis of how things can fail, the consequences, and remedies is absolutely necessary. This can be achieved by using standard techniques like the FMEA (Failure Mode Effects Analysis) and/or the FTA (Faults Tree Analysis). The relevant standards of these approaches require developers to provide objective evidence of the achieved level of safety through specific metrics, like unsafe FIT rates, the SPFM (Single Point of Failure Metrics), or SFF (Safe Fault Fraction).
Existing industry standards
Several industry standards have already been published around the concepts of reliability and functional safety with the purpose of ensuring that compliant products will be safe. The entire product life-cycle is covered in such documents: product definition, project management, design, implementation, integration, verification, validation, production and even service and decommissioning.
For instance, the automotive standards ISO 26262 and ISO/PAS 21448 (SOTIF) apply to most of the non-entertainment electronics present in car: engine control, braking (ABS), airbag, radar/lidar anti-collision systems, and especially to the newest generation of ADAS system. Industrial control systems must also follow the IEC 61508 standard when safety is critical, and robotic systems are particularly subject to an adapted standard: the ISO 13849.
What do these standards have in common? The need for a tight control of all the design and verification processes, the analysis of how things can fail, and the adoption of new approaches to the hardware and software development methodologies. With this in mind, the verification phase becomes a crucial milestone in achieving both reliability and functional safety.
All the standards mentioned – and more – have sections dedicated to the product verification. In the context of semiconductors, verifying complex SoC containing millions of gates is not an easy task, but it becomes even harder if such pieces of silicon serve high integrity systems: specific scenarios where faults are present must be taken into account to make sure that the system will react properly.
The critical role of verification
Verifying the correct behaviour of a SoC against the specified safety or reliability requirements is probably the most critical step in the chip product life-cycle, and the previously mentioned standards dedicate entire sections to this topic. At either the hardware, software or hardware-software integration levels, different methods are recommended to guarantee that the product won’t cause issues when doing its job. As an example, the ISO 26262 recommends the following hardware verification methods when a product must handle ASIL D safety requirements:
- Requirements compliance, especially safety ones
- Internal and external interfaces
- Boundary values
- Knowledge or experience based error guessing (lessons learned)
- Functional dependencies
- Common limit conditions, sequences and sources of dependent failures
- Environmental conditions and operational use cases
- Process worst cases and significant variants
- Fault injection simulation
Verifying the compliance to the specified standards is especially important when it refers to safety requirements. By using safety analysis techniques (FMEA, FTA, etc), safety engineers can determine what mechanisms are necessary to tackle safety issues, and then it becomes the verification engineer’s task to prove their efficiency.
Safety standards don’t concern themselves with technical details around implementation, they just say that specific test cases must be created for each of the referred bullet points. It is up to the verification engineer themselves to determine the best technical approach, using the standard verification techniques in silicon design: RTL simulation, STA, Monte-Carlo, etc. Needless to say that all steps must come with the right documentation and traceability through a verification plan, verification specs and, finally, a verification report.
Fault injection
Knowing the response of a SoC under faulty conditions is of paramount importance in the verification of high integrity systems, and the technique called “fault injection” provides the solution.
A common mistake of newcomers to the hi-rel engineering is to confuse the terms “fault simulation” with “fault injection”. The term “fault simulation” refers to the standard technique of verifying how good a test pattern is in terms of observability; so, how a fault occurred in an internal node (stuck-at-1, stuck-at-0) will be detected at the I/O pins. This technique helps to build effective test patterns for the device screening at the industrial production stage, and it is normally available in the simulation EDA tools. The figure of merit when using this methodology is the percentage of faults covered by a given test pattern.
The term “fault injection” is something different. It is a technique oriented to verify if the internal chip mechanisms designed to mitigate failures react properly. For example, in a digital chip containing memories equipped with error detection and correction (EDC), a soft-error toggling a memory cell must not have consequences if the EDC works properly.
The fault injection tools user interface is much more complex than the fault simulation one, since the pass/fail criteria are not so obvious: in some cases the system requirements can stipulate that in the event of a failure, the SoC must jump to a safe state (e.g. ISO 26262), so a simple comparison with a reference good pattern is not sufficient, and some more elaborated comparison is needed. In silicon chips, the fault injection methodology must be performed at the pre-tape-out stage or at a device validation/qualification time.
Different approaches are available for this task at the SoC RTL or gate level, including some dedicated EDA tools; numerous literature is available on the web about this topic. A possible solution is to use the Verilog built-in scripting language (PLI) by creating dedicated test cases containing PLI-coded fault injectors, as depicted in the following figure.
Verilog RTL test case with fault injection
Fault injection instances can be properly placed in the Verilog code upon the definition of the test campaign, targeting the specific nodes where faults should be injected.
The fault injection at the silicon validation or qualification stages can be done in different ways, depending on the product complexity and category. Aerospace chips normally require a radiation qualification campaign, to verify their immunity to soft-errors or some other radiation effects. Even automotive standards, like the AEC-Q100 recommends a soft-error qualification for chips containing more than 1Mbit RAM.
Testability
In the design of high integrity SoCs there is pitfall: what is good for reliability is not good for testability. Indeed, the use of redundancy like the standard 3-votes TMR flip-flops (Triple Modular Redundancy) may lead to unscreened faults during the chip production, since the TMR will “correct” them. Therefore, special design measures must be taken during the design and verification phases to disable such redundancy when the chip is not in mission mode. For example, a chip using scan-path as a DFT strategy, must treat every TMR flip-flop as three different ones.
It goes without saying that putting a chip in test mode must be designed in such a way that it becomes virtually impossible to happen accidentally during mission mode. This is a test case that must be part of every fault injection simulation campaign.
Embedded Software
Complex SoC normally contain embedded processors who manage the overall data processing flow. Internal ROM, OTP or flash memories store the firmware image, and depending on the application, it is loaded at production time or at system startup via comms devices (bootloader). Moreover, systems based on flash memories offer the possibility to upgrade the image once the application is in the field.
The software development process is subject to exhaustive verification steps, and in the case of ISO 26262 ASIL D, the standard proposes different methods, which often need the use of auxiliary tools.
The software architecture definition and the subsequent code writing stage, must guarantee a clean and readable code. A classic open source code analysis tool called “lint” has coined the verb “linting” for this kind of verification. Code restriction rules, like the known MISRA C must be respected to avoid the obfuscation that some languages can introduce, like C or C++.
Prior to integrating the different software units, an individual verification must be performed through the so-called “unit testing”, and some recommended methods by ISO 26262 at this step are:
- Analysis of requirements, especially safety ones
- Analysis of boundary values; bugs normally hide in the corners.
- Error guessing, using the lessons learned process
Testing individual software units that will be integrated in the SoC flash, or other non-volatile memory, may be challenging, since normally, the companion hardware is not yet available at the verification time. Some techniques allowing the hardware-software co-verification must be used, like the so-called “hardware-in-the-loop” (HIL), through the use of emulation FPGAs or some other EDA tools dedicated to this purpose.
Such tools monitor the code behaviour, providing as well additional reports about the code’s branch or conditions coverage, requested by the involved standards.
At the final hardware-software verification phase, once silicon samples are available, some verification methods are recommended by ISO 26262:
- Interface test
- Fault injection test
- Resource usage test
- Back-to-back comparison test between model and code
Again, the fault injection test shows up. In such an environment, with the real hardware available, another methodology is necessary. Some different approaches have been proposed, for example, using the JTAG debug interface with a script based fault injection campaign.
Fault injection at the prototype verification
Conclusions
Different standards tailored to different industrial contexts have been published to guide hi-rel product development, but all of them have some something in common: high integrity systems require a careful verification plan, able to reproduce critical situations that could occur in the field to ensure that the implemented safety measures do the job properly. Even if effective verification methods have been proposed, the high complexity and time-consuming nature of these tasks shows that there is still a lot of room for improvement in order to make this process more reliable and efficient.
At EnSilica, we have a robust development process, as well as the experience and necessary tools to produce the most demanding hi-rel SoC serving applications in the automotive, aerospace, medical, and industrial fields.