SDIOv1/SDMMCv1 hangs system forever on transfer error Topic is solved

Report here problems in any of ChibiOS components. This forum is NOT for support.
tpw_rules
Posts: 6
Joined: Wed Nov 12, 2025 4:35 am
Been thanked: 3 times

SDIOv1/SDMMCv1 hangs system forever on transfer error  Topic is solved

Postby tpw_rules » Sat Nov 15, 2025 8:42 pm

Using the ArduPilot project ChibiOS fork on an STM32F765 based Pixhawk 4 Mini flight controller and an SD card prone to errors (in this case created by mechanical vibration), the SDMMCv1 peripheral can signal a CRC error in the middle of a transfer.

If this happens, the peripheral no longer is transferring bytes, so the DMA can never complete and the processor hangs forever here: https://github.com/ChibiOS/ChibiOS/blob ... lld.c#L289 . Only a watchdog reset can recover the system.

I did some additional investigation:
* SDIOv1 is vulnerable to the same problem (it is the only other driver that uses the dmaWaitCompletion function anyway?).
* SDMMCv2 is not vulnerable as it stops DMA unconditionally (SDMMCv1 cannot as the DMA is external and therefore we can't know it's complete by the time the DATAEND interrupt is taken, unlike SDMMCv2 where the reference manual says this is known).

I therefore propose the following patches (attached in one file):
* Unlock the system as soon as possible to avoid a complete hang, instead only hang the calling thread.
* Only wait for DMA completion on the success case defined in the reference manual (note the special check required by the reference manual for a late error). This should only be needed on the read path anyway but I did not want to complicate the patch too much. We properly shut down the DMA in sdc_lld_error_cleanup so no additional code is needed.
* Also turn on interrupts for an undocumented status bit to avoid cases where no other status bits get asserted after a read and the driver sleeps forever.

With these patches I can now repeatedly vibrate the SD card and cause dozens of disconnections/re-connections without hanging the thread or the system. If you are happy with the proposed patches, I can also port them and test them on the SDIOv1 driver.
Attachments
chibios-sdmmcv1-fixes-2025-11-15.patch.zip
(1.98 KiB) Downloaded 34 times

tpw_rules
Posts: 6
Joined: Wed Nov 12, 2025 4:35 am
Been thanked: 3 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby tpw_rules » Sun Nov 30, 2025 6:11 pm

I have updated the SDMMCv1 patches to take the start bit error detection logic from SDIOv1. It turns out that is the mysterious reserved status bit I saw asserted sometimes. I don't know why the reference manual documents it as reserved.

The new patches are attached (and replace the old ones). I also verified that it applies to current ChibiOS master.
Attachments
chibios-sdmmcv1-fixes-2025-11-30.zip
(2.41 KiB) Downloaded 26 times

tpw_rules
Posts: 6
Joined: Wed Nov 12, 2025 4:35 am
Been thanked: 3 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby tpw_rules » Sun Nov 30, 2025 6:15 pm

Additionally, attached are the corresponding patches for SDIOv1. It already does the start bit error detection logic on some chips (but not all, this may be another fib on ST's part to correct later). I also incorporated a patch from ChibiOS to move transfer preparation before transfer start that was applied to SDMMCv1. I did not include the special error check, the reference manual does not mention it.

This has been tested in the same situations on an STM32F427 based flight controller and again brings it from hanging to working properly. It also applies to ChibiOS master.
Attachments
chibios-sdiov1-fixes-2025-11-30.zip
(1.82 KiB) Downloaded 33 times

User avatar
Giovanni
Site Admin
Posts: 14773
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1170 times
Been thanked: 974 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby Giovanni » Mon Dec 01, 2025 6:19 am

Hi,

It is "in queue", I need to finish some important changes in HAL, among those: waiting loops with timeout capability in HAL, this could be relevant in this problem.

Giovanni

tpw_rules
Posts: 6
Joined: Wed Nov 12, 2025 4:35 am
Been thanked: 3 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby tpw_rules » Tue Dec 02, 2025 2:58 am

Thanks for the information.

I think adding a timeout could be useful for the sleeps in these drivers. I'm not sure it's the best idea for the DMA waits, but could be useful insurance after the proposed adjustments. I am unsure why only this driver needs to do them?

User avatar
Giovanni
Site Admin
Posts: 14773
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1170 times
Been thanked: 974 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby Giovanni » Tue Dec 02, 2025 5:12 am

Loops with timeout will be gradually introduced for all drivers (there are not many drivers doing this anyway), the first use case is the clock initialization, those have been implemented in U0, U3 and H5, others to follow.

The idea is to move the whole HAL and RT toward a concept of functional safety.

About DMA, probably it is a good idea add timeouts in waiting loops, my understanding is that the peripheral stops triggering the DMA and the driver is stuck waiting for it.

Giovanni

User avatar
Giovanni
Site Admin
Posts: 14773
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1170 times
Been thanked: 974 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby Giovanni » Thu Dec 11, 2025 7:08 am

Hi,

I passed patches to Codex AI for a quick review, could you look into those findings?

- os/hal/ports/STM32/LLD/SDMMCv1/hal_sdc_lld.c:309-315 — The new RXDAVL drain
loop has no timeout and no early break on RXOVERR. If the DMA stalls and
RXDAVL stays set (e.g., FIFO overflow with DMA halted), this loop can spin
forever and reintroduce the hang the patch tries to avoid. Consider bounding
the wait or breaking out when error bits are present so the error cleanup
path can run.
- os/hal/ports/STM32/LLD/SDMMCv1/hal_sdc_lld.c:304-324 —
sdc_lld_wait_transaction_end() now returns success as soon as DATAEND is
set and will clear STA without inspecting other error flags (including
the newly-enabled STBITERR). If DATAEND and an error bit are set together,
the transfer reports success and the error is lost. Suggest checking for
SDMMC_STA_ERROR_MASK plus SDMMC_STA_STBITERR before clearing ICR, and
returning failure so sdc_lld_error_cleanup() can record the error.


I committed on trunk anyway, you can checkout from there.

Giovanni

tpw_rules
Posts: 6
Joined: Wed Nov 12, 2025 4:35 am
Been thanked: 3 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby tpw_rules » Mon Dec 15, 2025 6:56 pm

Giovanni wrote:Hi,

I passed patches to Codex AI for a quick review, could you look into those findings?

- os/hal/ports/STM32/LLD/SDMMCv1/hal_sdc_lld.c:309-315 — The new RXDAVL drain
loop has no timeout and no early break on RXOVERR. If the DMA stalls and
RXDAVL stays set (e.g., FIFO overflow with DMA halted), this loop can spin
forever and reintroduce the hang the patch tries to avoid. Consider bounding
the wait or breaking out when error bits are present so the error cleanup
path can run.
- os/hal/ports/STM32/LLD/SDMMCv1/hal_sdc_lld.c:304-324 —
sdc_lld_wait_transaction_end() now returns success as soon as DATAEND is
set and will clear STA without inspecting other error flags (including
the newly-enabled STBITERR). If DATAEND and an error bit are set together,
the transfer reports success and the error is lost. Suggest checking for
SDMMC_STA_ERROR_MASK plus SDMMC_STA_STBITERR before clearing ICR, and
returning failure so sdc_lld_error_cleanup() can record the error.


I committed on trunk anyway, you can checkout from there.

Giovanni


I would much prefer to correspond with you than an AI!

The first concern is not likely to be a problem because it's unclear how DMA could stall or halt. If it does, we are probably going to hang in the dmaWaitCompletion function anyway so a hang a few lines earlier is not a new concern. In any case, we will now only hang the thread instead of the system.

The second concern is not possible according to my interpretation of the reference manual. DATAEND being set implies no other errors occurred, except for the RX overflow that we specifically check for.

Both could certainly be a problem if the reference manual is incorrect (which this patch demonstrates it is in at least one regard!), but I will let you address those with the functional safety effort. I wonder how we determine an appropriate level of (mis)-trust of the hardware and manual :)

User avatar
Giovanni
Site Admin
Posts: 14773
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1170 times
Been thanked: 974 times

Re: SDIOv1/SDMMCv1 hangs system forever on transfer error

Postby Giovanni » Mon Dec 15, 2025 7:54 pm

Oh don't worry, you will be dealing with me. I am submitting entire subsystems for review to Codex-CLI, among all the noise it can sometimes finds real issues, of course I am deciding what is real and what is not. It is not bad for reviews, it just sucks at writing real code.

This is why I gave you the result, human-in-the-loop rule.

Giovanni


Return to “Bug Reports”

Who is online

Users browsing this forum: No registered users and 43 guests