[RFC] Functional Safety in HAL

Postby **Giovanni** » Thu Dec 05, 2019 12:22 pm

Hi,

I just wanted to open a discussion about functional safety. RT and NIL, while not officially qualified for safety, have been designed for that. The exception is HAL which has been created with just the idea to have a lightweight drivers framework.

HAL is definitely not designed with functional safety certifications in mind, it would require some level of redesign:

Errors Detection

Operations could no more be assumed to not be able to fails, every function would have to implement some kind of error handler. For example, all the xxxStart() functions should be changed to return an error code, initialization should be assumed to be able to fail and upper layers should be informed.

Centralized Errors Management

Errors should also be routed to some kind of central handler and classified as: info, warning, error, critical (just examples).

System-wide Mechanisms for failures detection and handing

- Timeouts.
- Storms prevention.
- Permanent checks (assertions can be disabled)
- Centralized code for HW access (that SE library would be part of this).

Other ideas?

Being professionally involved in automotive-level projects I know perfectly FuSa is something requiring a wholly different approach to things. It is expensive, takes time, takes resources, often some solutions make the code bloated and less efficient.

Of course the main question is: is it worth it? the operation would have a significant cost and HAL is not generating profits in itself, it is just a driver for RT/NIL sales. How the whole operation could be justified? HAL is already taking most of our bandwidth as-is.

Giovanni

steved · Postby **steved** » Thu Dec 05, 2019 3:07 pm

I think progress along this path would be an excellent move.

Not everyone needs formal certification (although would it widen the market for ChibiOS if achieved?), but improved robustness and error detection would definitely be of benefit to many applications. Often there are economic implications (as well as safety) to software problems - both in the commercial and non-commercial worlds. Tridge is a good example of the latter; probably the most likely outcome of a software problem in a drone is that it crashes or gets lost, with a small chance that it also injures someone. In the commercial world, a software failure can cause loss of service, sometimes with substantial costs such as loss of revenue and compensation as a result.

My own interest is in this middle ground, and some of the thoughts I posted in viewtopic.php?f=3&t=5294 relate to that.

I personally don't see a great need to test for hardware failures within a single device; my own experience is that they are extremely rare. And testing external hardware is necessarily application-dependent; although helper functions, and a proper reporting mechanism, could be useful here.

I suggest that, as a general principle, the intent is that all functions must return from a call eventually, even under error conditions, unless non-return is a specific intention. The magnitude of "eventually" could be anything from microseconds upwards.

A uniform mechanism for handling problems would also be welcome; I'm sure I'm not alone in tending to implement recovery solutions piecemeal. For error handling, I think I've already suggested use of a wide range of error codes; they would presumably be returned in a 16-bit or 32-bit value, so we may as well take advantage of that. And if a function can return many different error codes (rather than just pass/fail) it can make it very quick to identify the problem area. For error classification, I've used the standard *nix levels 0..7.

Finally, on processing overhead, I would divide processors into "small" (M0 etc) and "big" (M4, M7). I see a move towards functional safety as being more useful on the big processors, which also conveniently are likely to have spare processing capacity.

Postby **Giovanni** » Fri Dec 06, 2019 1:49 pm

Excellent suggestions steved.

My doubts are mainly on errors handling, making each and every function return an error code would force the application developer to intensive errors handling with all the inherent problems like:
- Multiple exit points in functions.
- Extra code because each call to the HAL would have to be checked for errors.
- Extra effort in understanding if something can -really- return an error or it is just a pattern and it always returns OK in practice

We could consider different approaches as alternatives (or in addition), for example:

- Something like a thread-local (or driver-local) setjmp/longjmp with extra info about what failed and how.

Code: Select all

/* Thread-local errors handling.*/
if (halSetJump(/* current thread implicitly */)) {
  /* Errors handling for all HAL functions called by this thread.*/
  halerr_t err = halGetError(/* current thread implicitly */);
}
else {
  /* Multiple operations */
}

Code: Select all

/* Driver-local errors handling.*/
if (adcSetJump(&ADCD1)) {
  /* All errors associated to ADCD1, executed in the context of the thread that called ADCD1 methods.*/
}
else {
  /* Multiple operations */
}

- Make functions return void but call a centralized handler in case of failures, the handler could implement logging, events, messages or others, left to the implementer.

About the commercial aspects, HAL could stay free and moved to API-compatibility with this hypothetical "hardened HAL" which would be a commercial extension. In the normal HAL all the safety functionalities would be stubbed and have no code impact. Note that a lot of safety solutions could be implemented already in the HLD so no impact in LLDs. LLDs would implement platform-specific countermeasures.

All of this without any HAL certification in mind, just to move the HAL code base in that direction. RT and/or NIL will be certified at some point, those are generating revenues and it is a common request.

Giovanni

steved · Postby **steved** » Fri Dec 06, 2019 5:13 pm

Error handling in the HAL need not affect the application programmer at all - they will have the option of just ignoring error returns - but they are there if needed. The main effect will be in the HAL inself, since LLDs will have to identify and return errors where appropriate (and ensure the driver and its associated hardware are left in a safe and recoverable state).
I can confirm that handling errors arising deep in a driver is sometimes challenging!

On the actual error values, I find a similar approach to file I/O to be generally useful:
0 - success
<0 - error states
>0 - useful information (if applicable) (implied success)

Maybe encode the severity into a few bits of the error value, to make filtering easier.

Fine-grained error returns are very useful for debugging (including remote debugging - if the actual error number is accessible, a user can relay it so the application programer can quickly identify the affected area).
They can also be relevant operationally - taking our old friend I2C as an example, a device not responding may be acceptable (e.g. expansion module not fitted), while other values indicate a genuine malfunction.
Overall I think it's essential that error codes are returned in the function call - because the result of the call may affect future actions.
I also like the idea of being able to call a thread-specific logging entry - both for errors, and also logging (especially during debug). Use of severity levels here would allow easy filtering of information.
I can see uses for the thread-specific error reporting/logging entry point, since it allows control of what is logged, and optionally directed onwards to
a central logging point. So this gives the following additions to a thread's data structures:
Entry point for logging of errors and other information
Flags - enable processing of error calls (maybe one for each level of severity)
Flags - enable forwarding of calls to a central logging point (one for each level of severity)

As far as setjmp/longjmp is concerned, I've never used it, so can't really comment. It seems to have a number of potential pitfalls. A possible killer was in one of the web posts I looked at (so usual caveats!):
MISRA (MISRA-C:2004:Rule 20.7) and JFS (AV Rule 20) : "The setjmp macro and the longjmp function shall not be used." And I can see why they might say that.

If allowed, I can see that it might be useful where there are deeply nested function calls. And it could make it easy to turn on this level of debugging with a couple of macro calls. Does the C library include a suitable implementation, or is that something which would have to be done?

tridge · Postby **tridge** » Mon Dec 16, 2019 11:30 am

Thanks for starting this discussion! A few random thoughts ....
I'm not keen on setjmp/longjmp if it can be avoided. It can make debugging hard and also hard to reason about locking. There are places in things like interpreters where the flow of control needed means using jumps makes sense, but I'm not really keen on it being a feature used in this way in the HAL.
One correction on the commercial/non-commercial distinction. ArduPilot is used a lot in both commercial and non-commercial contexts. The code is always free under GPLv3, but a lot of companies use it in products. There is also increasing scrutiny with regard to reliability. The recent I2C issues led to this advisory from NZ civil aviation authority for example: https://www.aviation.govt.nz/about-us/m ... 1-firmware
Meeting formal standards would be nice, but also comes with a lot of administrative overheads, and can rather perversely lead to buggy code being used for longer due to the difficulty of applying changes. I'm much more interested in apply good practical software engineering methods to making the code more reliable than I am in formal standards. The expense and time commitment of formal standards is one of the reasons we have deliberately steered ArduPilot away from use in manned aviation (or any use case where the autopilot is in control of human life).

FXCoder · Postby **FXCoder** » Tue Dec 17, 2019 12:26 am

Hi,
A few thoughts...
1. Regarding detecting hardware failures
Harsh environments are a good reason for detecting and handling partial hardware failures.
It may not be common for peripheral blocks of an MCU to fail in isolation but the connected devices may/do behave differently.
e.g. an errant I2C device which fails (prematurely) due to low temp can kill the whole I2C device chain irrecoverably.
I have first hand experience of this in high altitude balloon systems.
The only recovery option for the particular board design was sledge hammer model (assert and watchdog).
The I2C device had lost its mind, locked the bus up and couldn't be convinced to respond.

2. Using the events system
I've implemented a rudimentary form of error notification by using the thread events for application level.
To do this I've extended the standard ChibiOS return codes by adding a MSG_ERROR value and using msg_t extensively in returned results.
A calling function can immediately respond to the returned MSG_ERROR code.
Detail is in the thread event data posted by the lower level(s) or ISR.
The 32 event bits are divided up between "normal" events and "error" events and defined in a common header.
Apart from direct action on the MSG_ERROR returned data the application can choose to broadcast INFO/WARNING events.
For example health monitor thread(s) can subscribe to receive events and compute system performance statistics.

For HAL a similar approach might be relevant where a new instance of event flags in the thread structure would be used purely for HAL safety?
--
Bob

Postby **Giovanni** » Tue Dec 17, 2019 10:09 am

In automotive world, the standard is the Autosar MCAL layer of which the ChibiOS HAL is very (VERY) loosely inspired from.

In MCAL there are two pseudo drivers for error handling DEM and DET.

DET is similar to assertions, it is for development time, we have this covered.
DEM is for runtime errors, it is called by the various drivers passing a driver ID (a number) and error codes, we don't have an equivalent. DEM cannot be disabled.

Note that DEM is just called, it is used for reporting run-time anomalies, it does nothing to handle errors, the implementation of DEM is system-dependent. I would introduce something like this implementing it using events like Bob suggested.

- One event source for each driver instance.
- Driver-specific event flags for errors.

I would create a set of standard event flags common to all drivers:
- Driver started.
- Driver stopped.
- Warning condition (recoverable error happened, recovery action taken).
- Error condition, unrecoverable driver error (those drivers with final error states). This could trigger a stop()/start() for re-initialization.
- Others?

Note that some drivers already have event sources for their own errors, Serial and CAN for example, those would not be changed.

In addition we also need:
- Return codes for all functions interacting with outside world (the send/receive/transmit/exchange/convert/etc kind).
- Return codes for xxxStart() functions. Some systems may accept to be unable to start a driver but need to be informed.

Giovanni

steved · Postby **steved** » Thu Dec 19, 2019 1:40 pm

tridge wrote:Meeting formal standards would be nice, but also comes with a lot of administrative overheads, and can rather perversely lead to buggy code being used for longer due to the difficulty of applying changes. I'm much more interested in apply good practical software engineering methods to making the code more reliable than I am in formal standards. The expense and time commitment of formal standards is one of the reasons we have deliberately steered ArduPilot away from use in manned aviation.

This is a very valid point.

tridge · Postby **tridge** » Fri Dec 20, 2019 2:12 am

FXCoder wrote: The only recovery option for the particular board design was sledge hammer model (assert and watchdog).
The I2C device had lost its mind, locked the bus up and couldn't be convinced to respond.

Did you try a RCC reset approach on the specific I2C bus? That has worked well for ArduPilot using the interrupt quota. We have yet to find a situation where the RCC reset does not recover the bus. It is a much less drastic solution than watchdog.
We still have the watchdog enabled, but that is a last resort and should really only be for application coding errors. It shouldn't trigger for something like a bad I2C bus.

Postby **Giovanni** » Fri Dec 20, 2019 8:28 am

Tridge, have you tried the SW I2C implementation? that would be the safest one. You can make it go fast enough without using much CPU.

Giovanni

ChibiOS Free Embedded RTOS

[RFC] Functional Safety in HAL

[RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Re: [RFC] Functional Safety in HAL

Who is online