So I'm developing an embedded application based on STM32F446 with ChibiOS v16.1.6. The RTOS is configured in tickless mode with non-simplified IRQ handling (using BASEPRI for critical sections instead of PRIMASK), two highest IRQ priority levels are reserved for the application's fast IRQ handlers.
The application runs some communication interfaces and some basic logic in regular OS threads with a couple of regular IRQ; alongside that it runs some hard real time control process driven by the two IRQ which preempt the kernel (due to strict hard real time requirements). The hard real time part performs intensive computation involving matrices and FPU. Typical mode of operation requires the firmware to spend about 90-95% of time performing computations in the hard real time IRQ handlers; rest of the time it operates in the normal RTOS mode.
Obviously, the hard real time part is completely isolated from all RTOS services, and all standard precautions relevant to preemptible kernels are taken into account.
While debugging the application, I noticed that once in a while it would fail in either of 2 ways:
- Trip on the assertion check at chVTDoTickI.
- Make all threads that are in the SLEEPING state sleep forever. Threads that are blocked in other states, e.g. waiting for synchronization objects or channels, continue to function normally.
Two days of investigation led me to the system tick timer. According to my understanding, when the kernel needs to schedule a new alarm in tickless mode, it does the standard read-modify-write on the system timer registers:
1. read the current system time from the timer;
2. add the desired duration to the obtained value;
3. store the value into the compare register.
Pretty standard. OK, obviously the timer keeps running at all times, so the obvious necessary precautions were taken:
- the update is performed in a critical section;
- the minimum duration is limited by a configuration parameter named CH_CFG_ST_TIMEDELTA which cannot be less than 2.
Makes perfect sense so far.
The problem creeps in when the kernel is used in preemptive mode, i.e. when the entire RTOS can be interrupted at any point, including a critical section, to serve a hard real time IRQ, which is exactly the thing that is happening in my application. If a hard real time interrupt occurs anywhere between steps 1 and 3, and takes longer than CH_CFG_ST_TIMEDELTA to execute, the newly computed deadline will end up in the past, freezing the SLEEPing threads until the counter wraps around (which takes about forever). Sometimes the assertion check in chVTDoTickI would catch this and crash the system, sometimes it won't.
In order to verify this theory, I sketched the shim shown below and put it into the context switch and tick hooks:
Code: Select all
using TimeType = decltype(st_lld_get_counter());
static constexpr TimeType HalfRange = std::numeric_limits<TimeType>::max() / 2;
static constexpr TimeType DetectionThreshold = S2ST(2);
const TimeType counter = st_lld_get_counter();
const TimeType real_alarm = st_lld_get_alarm();
const TimeType alarm_with_offset = real_alarm + DetectionThreshold;
if (TimeType(alarm_with_offset - counter) >= HalfRange)
{
chibios_rt::System::halt(os::heapless::concatenate(
"OS TIMER DEADLINE MISSED: CNT=", counter,
" ALARM=", real_alarm).c_str());
}
It proved to be able to reliably detect the problem, which, expectedly, happened to correlate well with the computational load in the hard IRQ handlers.
Having confirmed that, I thought about solutions. I see three of them:
- Disable tickless mode. I wouldn't like this, because I'm already having a hard time squeezing the application into the tight performance limits of this MCU, and the ticked mode will increase the RTOS overhead even further (although probably not significantly, I haven't checked this yet).
- Increase CH_CFG_ST_TIMEDELTA. This solution is highly unreliable for obvious reasons.
- Fix chVTDoTickI. This is the only proper solution to the problem, so I'll focus on it below.
We want to fix chVTDoTickI in a way that will make preemptible operation in tickless mode reliable, and at the same time minimally affect non-preemptible systems. A possible solution is to wrap the whole function (or, more precise, the part of it that handles the case CH_CFG_ST_TIMEDELTA > 1) into a loop, and add a check at the end of the loop if the freshly installed deadline is in the past. If it is, the function would have to try again, otherwise exit. The overhead for non-preemptible systems would be unnoticeable (just one more check, it will always succeed), which can be avoided completely if the looping is included via conditional compilation only if CORTEX_SIMPLIFIED_PRIORITY is false.
One might argue that the try-check-retry approach is fundamentally non-deterministic, but that should not pose a problem, since in preemptible kernels the OS itself typically does not have to adhere to strict real time requirements, delegating these to the preempting logic.
Another possible (very unlikely) issue is spurious synchronization of the try-check-retry loop with the hard real time IRQ: the problem occurs if each pass of the loop happens to perform the steps 1-2-3 (see above) exactly at the same time when the fast IRQ fires, which will cause it to fail continuously at every iteration forever, until it slides out of phase with the hard real time IRQ. A possible solution is to double CH_CFG_ST_TIMEDELTA after every iteration, until the time delta value exceeds the duration of the preemption.
This solution can be explored further. At any rate, even if the proposed solution will not be implemented, the documentation certainly must be updated with a warning explaining why tickless mode + preemptible kernel is a bad idea.
Feedback is welcome.
Pavel.
P.S. The title is a cliсkbait, I know.
P.P.S. Any plans to move the project away from Sourceforge, e.g. to Github?