[NOTES] Multi Core support

Postby **Giovanni** » Tue Jun 04, 2019 3:58 pm

Hi,

Just sharing some idea about evolving RT for multi core support.

- The ch structure is renamed ch0, other cores would use ch1, ch2 etc.
- The ch structure is accessed using a macro this way currcore->xxxx, this allows to access the right structure for the "current" core. The macro hides an identity register or some other architecture-specific core-private variable. For single cores currcore is defined as &ch0. Replace all instances of "ch." with "currcore->".
- Add a new chSysObjectInit(&chx), chSysInit() will be retained as chSysObjectInit(&ch0) plus system-wide initializations (oslib);
- Each core initializes its own OS instance from its own main(), there are multiple main threads, one per core.
- Threads are tied to the cores that created them (a pointer to the owner ch structure is added to thread_t). Threads never migrate core.
- Virtual timers are tied to cores that started them.
- Stats, trace buffer, system state are per-core.
- MC support will require inter-core notifications as SW interrupts (one core makes a thread ready then notifies the other core to reschedule).
- MC support will require some kind of HW semaphore for accessing "ch" structures or synchronization objects, all cores can manipulate those.
- chSysLock()/chSysUnlock() also take/release the HW semaphore.

Problems:
- Any core can make threads ready for other cores, this means that the priority order condition can be violated: current thread has lower priority that highest priority in ready list. Notification interrupts should have highest priority.
- In HAL chSysLock/Unlock is used for just disabling interrupts, this can become an overhead, HAL should be reworked a bit to use enable/disable when no OS functions are called.

Giovanni

mobyfab · Postby **mobyfab** » Tue Jun 04, 2019 6:00 pm

I think it's better to have one instance of ch running across all cores dynamically.
Also less duplicate ch code in memory, only one main, probably easier to manage peripheral mutexes, etc.

References:
https://www.cs.york.ac.uk/fp/multicore- ... mistry.pdf
https://docs.espressif.com/projects/esp ... s-smp.html

Postby **Giovanni** » Tue Jun 04, 2019 8:13 pm

Hi,

You cannot escape multiple "main", those cores have to start somewhere, regardless the name it is a main function.

About having no affinity, there are several problems with that:

- You would have to have one ready list for each core for threads with affinity, plus another ready list for threads with no affinity. The ready list is basically that "ch" structure. You would have N+1 instead of N. you could use a single ready list but then threads with affinity should be picked in the middle of the list, very complex and inefficient (basically you would search into the list instead of picking the one on the top).

- Having to check for threads with and without affinity makes the OS have to check two lists in order to decide which thread is next, this makes context switch much slower, it is a critical path.

- There is also a determinism and predictability problem, you need to consider threads that could or not being executed by other cores. It makes estimations even harder. This is an RTOS not a generic OS so it is quite a critical point. Desktop and server OSes have no such problems, they try to maximize respectively reactivity and bandwidth.

- Threads need to have affinity because micros are not perfect SMP machines (most of those I seen), there are RAM areas closer to certain cores. If I put the stack of thread A in the TCM of core 0 then I don't want it to be executed by core 1. Assuming core 0 and core 1 can access each other TCMs, it is not a given.

- There are micros with different CPUs, for example Cortex-A with M or R. Definitely you cannot have threads migrate core, you still want those threads to manipulate the same objects like semaphores, queues etc.

I am sure there are also other problems but the above are the ones on top of my mind right now.

I would have no problems with threads able to switch cores voluntarily using an API (and with limitations), but dynamic execution is problematic, IMHO.

Giovanni

mobyfab · Postby **mobyfab** » Tue Jun 04, 2019 8:49 pm

I see, indeed a more static and deterministic approach would better fit ChibiOS in the end.

Thanks for the explanation!

Postby **Giovanni** » Wed Jun 05, 2019 10:26 am

I am going to "objectify" all internal initializers while other details get more clear to me, it does not hurt and increases readability.

Giovanni

Postby **Giovanni** » Sat Jun 08, 2019 6:26 pm

Few more notes:

- I created a branch for RT7, it is where I will experiment with new things.
- RT7 will be able of multi core operations in two modes "loose" and "strong", in loose mode the various OS instances cannot interact with each other, it is like having multiple RTOSes without any overhead. In strong mode threads of all instances can use the same synchronization primitives and coordinate but there is some inter core communication overhead.
- There will be strong threads/timers affinity, no automatic migration, however, it would be possible to introduce a migration API, threads could switch side by calling it. Load balancing could also work.
- There will be one global spinlock, I considered having one spinlock per instance and one for each synchronization object but apparently there is no real advantage in doing so, just the overhead of having to take multiple spinlocks.
- Changes to the kernel probably are going to be easier than I initially thought. Single core and multi core RT are going to be the same product, just different settings. Most things are handled in the port layer.

There are problems also:
- Asymmetric systems, for example M4+M7, would have problems with compile options, FPU compiler options are different between the two. The M7 would have to be used as an M4 if it is a single compilation unit.
- Cache coherence is also a problem on Cortex-Ms, shared objects would have to be placed in non-cacheable zones, same for Cortex-Rs. Cortex-As have a cache coherence mechanism in HW.

Giovanni

steved · Postby **steved** » Tue Jun 18, 2019 1:26 pm

I'm thinking of applications where a multi-processor device might replace several individual processors which intercommunicate using some form of comms link (serial, I2C, CAN etc)

Giovanni wrote:- RT7 will be able of multi core operations in two modes "loose" and "strong", in loose mode the various OS instances cannot interact with each other, it is like having multiple RTOSes without any overhead. In strong mode threads of all instances can use the same synchronization primitives and coordinate but there is some inter core communication overhead.

How would non-time-critical inter-core communication using, say, a shared memory area fit into this scheme? At one level it's not really synchronisation/interaction - potentially just signal an event to the processor which must respond. Depends how you interpret the words.

Giovanni wrote:- Asymmetric systems, for example M4+M7, would have problems with compile options, FPU compiler options are different between the two. The M7 would have to be used as an M4 if it is a single compilation unit.

This could constrain performance; I can imagine that some applications will be using the higher performing cores for processor-intensive tasks, while the smaller cores do some of the less demanding stuff. Is this solvable with a cleverer compiler?

Postby **Giovanni** » Tue Jun 18, 2019 1:43 pm

Hi,

steved wrote:How would non-time-critical inter-core communication using, say, a shared memory area fit into this scheme? At one level it's not really synchronisation/interaction - potentially just signal an event to the processor which must respond. Depends how you interpret the words.

For mutual exclusion on shared areas you can use a spin-lock (strex, ldrex in ARM) or some kind of HW semaphore, latest STM32H7 include those already. Exchanging SW-triggered IRQs would be an option for events (for signaling an other-OS-instance semaphore for example, you would still be able to wake-up a thread on the other side).

I suspect this loose mode will be quite useful and still be lightweight.

steved wrote:This could constrain performance; I can imagine that some applications will be using the higher performing cores for processor-intensive tasks, while the smaller cores do some of the less demanding stuff. Is this solvable with a cleverer compiler?

Not so easy, you can use tricks like compiling different code with different options then linking but this introduces a lot of other problems (libraries, LTO etc).

With asymmetric cores I would make 2 separate images (and accept space overhead). Shared areas could be handled in the .ld file, there is also cache coherence to consider so you will want to make those shareable (non cached).

Giovanni

Postby **Giovanni** » Sun Jun 23, 2019 6:18 am

Update, the change is mostly complete in the RT7 branch.

Now I need to set up a platform for testing, a dual Cortex-Ax or Cortex-R52 or dual PowerPC. I have access to several but it is automotive stuff not easily available. Suggestions?

Giovanni

alex31 · Postby **alex31** » Sun Jun 23, 2019 10:27 pm

Hello,

You are probably aware of the STM32MP1 : 2xA7+CM4

devboard (STM32MP157A-DK1) should be available soon at reasonable price (69$)

But it's perhaps too much work to port ChibiOS to cortex A architecture ...

Alexandre

ChibiOS Free Embedded RTOS

[NOTES] Multi Core support

[NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Re: [NOTES] Multi Core support

Who is online