Page 1 of 6

Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 4:59 am
by russian
Sorry for bothering, boring issue I am fighting with for the last five days. I've created a stress-test for my application and it's reliably killing it in about 20 minutes and I am failing to figure out why.

Symptoms: all of my "while(true) {doJob(); chThdSleepMilliseconds(X);} threads are getting lost. My only explicit VirtualTimer looks to be OK, my interrupts look to be processed fine.

Some details:

Code: Select all

__main_stack_size__     = 0x1000;
__process_stack_size__  = 0x0600;

#define PORT_IDLE_THREAD_STACK_SIZE     1024

#define PORT_INT_REQUIRED_STACK         32

#define CH_DBG_ENABLE_STACK_CHECK       TRUE
#define CH_DBG_ENABLE_CHECKS            TRUE
#define CH_DBG_ENABLE_ASSERTS           TRUE
#define CH_DBG_SYSTEM_STATE_CHECK       TRUE


I know that stack overflow is the main suspect so I have a LOT of assert(getRemainingStack() > XX); statements all over my code and none of them trigger. I know that main stack (total size 4k) does not ever use even 1Kb.

Here's the typical state of chSysTimerHandlerI while things are OK, note all my threads in the list:
Image

And then in 20 minutes I get
Image

I am currently using 2.6.7, I believe I had the same issue with 2.6.6

If that would be a stack overflow, I would probably expect a more random damaged vtlist - but I am seeing a valid list, I just do not have my sleeping threads in it.

Could it be anything but a stack overflow? I really want to catch the issue so that I knowingly fix it. I really believe that I have so many stack assertions which were saving be before that I would catch it by now.

What kind of additional state validation or troubleshooting technique can I try?

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 9:04 am
by Giovanni
Hi,

You could try the eclipse debug plugin to inspect the state of threads and the trace buffer.

Alternatively prepare a minimal application triggering the problem.

Giovanni

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 2:24 pm
by russian
http://www.chibios.org/dokuwiki/doku.ph ... bug_plugin says
Starting from versions 2.2.7 stable and 2.3.3 unstable the ChibiOS/RT distribution includes a Debug Plugin for eclipse enhancing it with RTOS awareness.


but I only see the plugin inside ChibiOS_2.6.0.zip - that version I have installed but it does not show anything :(

I have just added
void assertVtList(void) {
if(!main_loop_started)
return;
VirtualTimer *first = vtlist.vt_next;
VirtualTimer *cur = first->vt_next;
int c = 0;
while(c++ < 20 && cur != first) {
cur = cur->vt_next;
}
efiAssertVoid(c > 3, "VT list?");
}

into chSysTimerHandlerI - I believe it gives me the exact moment then the threads disappear, here is the trace:
Image

I will now prepare a package which reproduces the problem.

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 2:38 pm
by Giovanni
Hi,

The plugin is part of ChibiStudio now.

Giovanni

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 4:17 pm
by russian
SVN: https://svn.code.sf.net/p/rusefi/code/branches/20150227_fatal_issue/
Same stuff as one zip: https://svn.code.sf.net/p/rusefi/code/branches/20150227_fatal_issue.zip

There I have the firmware, Makefile, Eclipse and IAR project. By the way I would need to check if I can actually reproduce this issue with IAR.

Once the firmware starts on stm32f4, it starts a virtual serial port over USB. Also in the bundle there is a java testing utility rusefi_console.jar

Code: Select all

java -cp rusefi_console.jar com.rusefi.EnduranceTest COM41

where COM41 is the serial port name.

Blue LED is blinking to show that the code is alive. Once it does not blink that means the code is dead :( RED led means fatal error, it now goes on because of assertVtList in chSysTimerHandlerI

This time it took 62 minutes and 456 cycles of test to get to the error :(

Fri Feb 27 09:02:14 EST 2015<EOT>: Starting COM41
Fri Feb 27 09:02:14 EST 2015<EOT>: SerialConnector: connecting
Fri Feb 27 09:02:14 EST 2015<EOT>: Sending command [set_engine_type 3]
Fri Feb 27 09:02:14 EST 2015<EOT>: postMessage CommandQueue: SerialIO started
Fri Feb 27 09:02:14 EST 2015<EOT>: Opening COM41 @ 115200
Fri Feb 27 09:02:27 EST 2015<EOT>: Starting COM41
Fri Feb 27 09:02:27 EST 2015<EOT>: SerialConnector: connecting
...
...
...
...
...

Fri Feb 27 10:06:41 EST 2015<EOT>: postMessage EngineState: setting fan No
Fri Feb 27 10:06:41 EST 2015<EOT>: postMessage EngineState: setting pump No
Fri Feb 27 10:06:41 EST 2015<EOT>: postMessage EngineState: setting fan No
Fri Feb 27 10:06:41 EST 2015<EOT>: ++++++++++++++++++++++++++++++++++++ 456 +++++++++++++++
Fri Feb 27 10:06:41 EST 2015<EOT>: Sending command [set_engine_type 3]
Fri Feb 27 10:06:41 EST 2015<EOT>: Sending [sec!17!set_engine_type 3]
Fri Feb 27 10:06:41 EST 2015<EOT>: postMessage PortHolder: Sending [sec!17!set_engine_type 3]
Fri Feb 27 10:06:41 EST 2015<EOT>: EngineState: unexpected header: sec!17!set_engine_type 3 while looking for line:
Fri Feb 27 10:06:42 EST 2015<EOT>: msg,setting pump No,msg,setting fan No,msg,setting pump No,msg,setting fan No,msg,setting pump No,msg,setting fan No,msg,confirmation_set_engine_type 3:17,msg,applyNonPersistentConfiguration(),msg,initializeTriggerShape(),msg, !!!!!!!!!!!!!!!!!!!! BE SURE NOT WRITE WITH IGNITION ON !!!!!!!!!!!!!!!!!!!!,msg,flash compatible with 6667,msg,Reseting flash: size=15172,msg,Flashing with CRC=208,msg,Flash programmed in (ms): 65,msg,Flashing result: 0,msg,Template Aspire/3 trigger TT_FORD_ASPIRE/LM_PLAIN_MAF,msg,configurationVersion=928,msg,RPM bin: 800.00 1213.32 1626.65 2040.00 2453.32 2866.65 3280.00 3693.32 4106.65 4520.00 4933.33 5346.65 5760.00 6173.33 6586.65 7000.00 ,msg,Y bin: 1.19 1.40 1.62 1.83 2.04 2.25 2.48 2.69 2.89 3.11 3.32 3.53 3.75 3.97 4.17 4.40 ,msg,CLT: 1.50 1.50 1.41 1.36 1.27 1.19 1.12 1.10 1.05 1.05 1.02 1.00 1.00 1.00 1.00 1.00 ,msg,CLT bins: -40.00 -30.00 -20.00 -10.00 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 110.00 ,msg,IAT: 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 ,msg,IAT bins: -40.00 -30.00 -20.00 -10.00 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 110.00 ,msg,vBatt: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ,


This is where I am stuck. I am pretty sure it could be my bug but I need advice on how to catch it while it develops - that's if I corrupt Chibi memory region.

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 4:35 pm
by Giovanni
Hi,

The images you posted are not a list of threads but a list of virtual timers. What do you mean for "thread dying"? in ChibiOS threads are static and cannot disappear.

Giovanni

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 5:49 pm
by russian
I have one virtual timer which I control explicitly:

Code: Select all

   chVTSetAny(&periodicTimer, period * TICKS_IN_MS, (vtfunc_t) &periodicCallback, engine);


and I have about 15 threads, each of which follows the same pattern:

Code: Select all

static void blinkingThread(void *arg) {
   while (true) {
      int delay = isConsoleReady() ? 3 * blinkingPeriod : blinkingPeriod;
      chThdSleepMilliseconds(delay);
   }

chThdSleepMilliseconds is implemented via a VirtualTimer vt; on the stack of executed thread in my understanding. I am expecting that with 15 threads like that I should always have a long list of virtual timers in the vtlist. That's true for about an hour, and then suddenly somehow my explicit periodicTimer is still there, while all the implicit VirtualTimer vt; for all my utility threads are not in the vtlist.

I will now try ChibiStudio. Should I create a ticket for the wiki update? Looks like current http://www.chibios.org/dokuwiki/doku.php?id=chibios:guides:debug_guide#debug_plugin is not up-to-date.

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 6:15 pm
by Giovanni
Always about one hour or is it random?

Giovanni

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 6:34 pm
by russian
Giovanni wrote:Always about one hour or is it random?

Random. Sometimes it's 15 minutes, sometimes I have a good 3 hours run. I am trying to isolate the issue to a particular layer of my code - I can conditionally compile or not compile some layers of functionality, same idea as in halconf.h
With most of my functional modules off I have a copy running for 13 hours and counting.

Re: Threads are dying, I cannot find the issue :(

Posted: Fri Feb 27, 2015 6:46 pm
by Giovanni
What is chVTSetAny() ?

Giovanni