Sorry its a long post, but lots of information!
Using F767, initially with compiler V7.2.1, latterly with V9.3.1, Chibi 19.1.3 with a selection of updates from SVN, plus the QSPI routines from trunk/V20. No non-Chibi interrupts.
There's more explanation at the end; the key point is that the corruption seems to be happening in a context switch, usually when a "double switch" (preemption) is required.
Probably the clearest example is shown in .
Steps -12, -4 are the start of a QSPI transaction; set it going, then wait in the idle thread
Steps -11, -3 are the QSPI interrupt which occurs on completion of the transaction
Steps -10, -2 show AR, SR early on in the ISR
Steps -9, -1 show AR, SR after executing the QSPI "end of transaction" macro
Step 0 is an abnormal completion - AR is OK in the ISR exit, but zero in the idle->main context switch
Looking at the code flow from step 0, it's as follows:
Code: Select all
OSAL_IRQ_EPILOGUE(); // Checks AR; non-zero
_port_irq_epilogue();
_port_switch_from_isr()
chSchDoReschedule()
thread_t *otp = currp;
/* Picks the first thread from the ready queue and makes it current.*/
currp = queue_fifo_remove(&ch.rlist.queue);
currp->state = CH_STATE_CURRENT;
/* Handling idle-leave hook.*/
if (otp->prio == IDLEPRIO) {
CH_CFG_IDLE_LEAVE_HOOK(); <--- Corruption detected here
}
I have other examples showing the problem arising where two interrupts occur end to end, without an intervening thread switch. Here the corruption occurs between the start of the QSPI interrupt, and the chSysLockFromISR() immediately before the next trace write.
In these, the corruption is consistently picked up in the OSAL_IRQ_EPILOGUE() macro. On occasion it has been any ISR, not just the QSPI one!
My QSPI usage means that it never sets the QSPI address register to zero after initialisation, and as far as I can tell nor should any on-chip mechanism.
And the puzzling thing is that the corruption appears when Chibi is in control.
Has anyone else encountered something like this? Or any suggestions on how to debug further? Or am I missing something very obvious?
Further explanation and notes
=============================
The example code is in the startup sequence (After the normal halInit() and chSysInit()), with very little other activity (as can be seen from the trace).
I disable caching on all RAM.
The QSPI address register can only be written to when the QSPI is busy, which limits the time when this can happen to a short period between transfer start and transfer complete. So according to the logged status, it shouldn't be possible to update AR.
I have corruption checks in CH_CFG_IDLE_ENTER_HOOK(), CH_CFG_IDLE_LEAVE_HOOK(), CH_CFG_CONTEXT_SWITCH_HOOK(), CH_CFG_IRQ_PROLOGUE_HOOK() and CH_CFG_IRQ_EPILOGUE_HOOK(), as well as immediately after writing to the register.
In the example, CH_CFG_IDLE_LEAVE_HOOK() was triggered.
All interrupts which might be enabled are from normal Chibi drivers.
The detail and frequency of the problem varies as I add and subtract code, and also as I swap between -O0 and -Og. But I can usually trigger the problem.
There's plenty of stack space, and all Chibi debug options are enabled.
Statistics enabled (also tried disabled; no change).
FPU disabled.
The "ready list" threads look good (just main, idle)
No relevant errata on the QSPI from ST (although there's one for other F7 family devices; doesn't change anything).
There is a slight possibility that CAN-related code plays a part; if I strip out all my CAN code, leaving the Chibi-level drivers enabled, the problem still occurs. If I disable the Chibi Drivers, the problem goes away.
I've checked the DMA registers, and there's nothing to suggest that DMA is responsible. (QSPI is the only active user of DMA).
(I have relatively briefly tried both GCC V5.4.1 and GCC V8.3.1 - no crashes at the time, but have changed things a bit since then.)
Above tests done with STM32_WSPI_QUADSPI1_PRESCALER_VALUE 5 (43MHz I think).
I have also tried a few runs with prescaler values of 8 and 11, all of which failed in the same way.
The same problem occurs on two different sets of hardware (essentially an F767 Nucleo plugged into a carrier board which buffers up all the ports).
File hal_wspi_lld_extract.c shows the relevant parts of the LLD, including my debug checks