Threads are dying, I cannot find the issue :(
Posted: Fri Feb 27, 2015 4:59 am
Sorry for bothering, boring issue I am fighting with for the last five days. I've created a stress-test for my application and it's reliably killing it in about 20 minutes and I am failing to figure out why.
Symptoms: all of my "while(true) {doJob(); chThdSleepMilliseconds(X);} threads are getting lost. My only explicit VirtualTimer looks to be OK, my interrupts look to be processed fine.
Some details:
I know that stack overflow is the main suspect so I have a LOT of assert(getRemainingStack() > XX); statements all over my code and none of them trigger. I know that main stack (total size 4k) does not ever use even 1Kb.
Here's the typical state of chSysTimerHandlerI while things are OK, note all my threads in the list:

And then in 20 minutes I get

I am currently using 2.6.7, I believe I had the same issue with 2.6.6
If that would be a stack overflow, I would probably expect a more random damaged vtlist - but I am seeing a valid list, I just do not have my sleeping threads in it.
Could it be anything but a stack overflow? I really want to catch the issue so that I knowingly fix it. I really believe that I have so many stack assertions which were saving be before that I would catch it by now.
What kind of additional state validation or troubleshooting technique can I try?
Symptoms: all of my "while(true) {doJob(); chThdSleepMilliseconds(X);} threads are getting lost. My only explicit VirtualTimer looks to be OK, my interrupts look to be processed fine.
Some details:
Code: Select all
__main_stack_size__ = 0x1000;
__process_stack_size__ = 0x0600;
#define PORT_IDLE_THREAD_STACK_SIZE 1024
#define PORT_INT_REQUIRED_STACK 32
#define CH_DBG_ENABLE_STACK_CHECK TRUE
#define CH_DBG_ENABLE_CHECKS TRUE
#define CH_DBG_ENABLE_ASSERTS TRUE
#define CH_DBG_SYSTEM_STATE_CHECK TRUE
I know that stack overflow is the main suspect so I have a LOT of assert(getRemainingStack() > XX); statements all over my code and none of them trigger. I know that main stack (total size 4k) does not ever use even 1Kb.
Here's the typical state of chSysTimerHandlerI while things are OK, note all my threads in the list:

And then in 20 minutes I get

I am currently using 2.6.7, I believe I had the same issue with 2.6.6
If that would be a stack overflow, I would probably expect a more random damaged vtlist - but I am seeing a valid list, I just do not have my sleeping threads in it.
Could it be anything but a stack overflow? I really want to catch the issue so that I knowingly fix it. I really believe that I have so many stack assertions which were saving be before that I would catch it by now.
What kind of additional state validation or troubleshooting technique can I try?