Stmlib: startup files and global C++ constructors


#1

I’m playing around with the stmlib makefiles and linker scripts and found that my C++ constructors for static objects are not called. After some diving into the linker and startup files provided with stmlib, I found that there is no .ctors section in the linker script and no late initialization routine in the startup files.

I guess this on purpose to allow for specific initialization sequences? If so, I’d love to know why. Initializing things in a strict sequence in main() doesn’t neccessarily require global constructors to be disabled.

Or am I just doing something wrong here?


#2

The linker scripts for STM32F currently used in stmlib are discarding .init_array. This means, among other things, that constructors of static objects are not called, and that their vtbl pointer is not initialized. The vtbl are not generated anyway. So virtuals don’t work either.

Then why don’t I fix the linker scripts to include .init_array? The machinery that is invoked through .init_array is kind of heavy:

  • It generates extra code for initializing static objects, in particular assigning their vtables.
  • It generates extra code for registering destructors of static objects (__cxa_atexit, __register_exitproc etc) which is useless because the objects are never destroyed.
  • This cleanup code also uses some RAM.
  • The “garbage collection” of unused code sections seems to be less efficient whenever virtuals are allowed.

it’s just not worth incurring this cost (in RAM and code size) to any project using stmlib, especially since the constructors in my code very rarely do anything useful (explicit Init), and since most projects don’t need anything that looks like virtuals.


#3

@TheSlowGrowth The linker scripts/startup files provided by ST and in stmlib don’t call .init_array, so static objects constructors are never called. It’s quite easy to add (see e.g. https://bitbashing.io/embedded-cpp.html), and that’s what I do now: I think the advantages surpass largely the costs mentioned by @pichenettes.

It generates extra code for initializing static objects, in particular assigning their vtables.

This is probably a couple instructions per class executed only once on startup; it simply doesn’t count. Plus AFAIK the vtable pointer is always allocated even if it is not assigned.

It generates extra code for registering destructors of static objects (__cxa_atexit, __register_exitproc etc) which is useless because the objects are never destroyed. This cleanup code also uses some RAM.

Why would dtor code not be garbage-collected by ld, since atexit is never called?

The “garbage collection” of unused code sections seems to be less efficient whenever virtuals are allowed.

What do you mean by “virtuals are allowed”? How would gcc know?


#4

Thank you both for your valuable input. Unfortunately, for what I have in mind, I need virtuals. Sigh. I’ll have to dive into this mess and figure out how to properly deal with this stuff.

From what I read so far, you can disable the creation of those destructors with a compiler argument.

Ah, finally a good read on this topic. The web is full of forum posts that barely give any detailed explanation on what’s happening. Thank you very much.

If you have any other insights or useful web sources, please let me know. Things like this can be a pain to figure out, as documentation is quite sparse. I’m still hoping to stumble across a complete, good (!) linker script & corresponding startup file, but I’m afraid I’ll have to make them myself.

On a side note...

There are many occasions where I find no way around dynamic memory allocation. I know how much pain comes with that in embedded systems … but how would I proberly implement callback messages and message queues without dynamic allocation?

class CallbackMessage
{
   ... 
   virtual void callback() = 0;
};


// somewhere in my code
if (somethingAsynchronousHasToBeDone)
{
   // I can - on the fly - define a new custom message that handles some UI interaction or update event, etc.
   class MyCustomCallbackMessage: public CallbackMessage 
   {
      ...
      virtual void callback() override 
      { 
         doSomethingInterestingHere();
      }
   };
   MessageQueue::post(new MyCustomCallbackMessage());
}

I know I can use function pointers and instead of inserting objects of CallbackMessage into the queue, I could insert function pointers to static functions. But my above example is just so much more readable and open for any use case…


#5

Glad it helps. About dynamic allocation:
1/ make your callback instance a static member of the enclosing class, or a global variable.
2/ implement your queues with ring buffers.
If you want to translate code written with malloc into competely statically allocated code, just remember that each object will have to be assigned the maximum space it will ever need: one CallbackMessage instance in your case (the maximum between 0 and 1), 42 instances of CallbackMessage instances for your queue (42 being the maximum queue size allowed).


#6

I needed virtuals for a project I was working on, made the (small) change in the linkerscript in stmlib, and found incompatibilities with another project (code size was inflated by about 1k).

No idea!

What I mean: Classes A and B are both subclasses of S. S has a virtual method DoThing(), implemented by A and B. A project contains a call to s->DoThing(); – but there is no code path through which s can be an instance of B. B::DoThing is still linked, because gcc cannot do a deep enough analysis to figure out that s cannot be an instance of B.


#7

If I remember well, the change to include back the vtables and all the initialization machinery simply consists in adding KEEP (*(.init_array*)) in the .text : block of the linkerscript.


#8

But I don’t see any calls to the init_array members in the startup file, so it must be more complicated than that, I think. I hope to find some time today to check this. I’ll report back


#9

I can publish the patch to stmlib when I get home if you need. It’s in two parts: include .init_array in the binary and expose the global name for its start and end (ld script) and then iterate over the array, in your own code or in the HAL initialization procedure (SystemConfig i think it was called)


#10

Okay, so I managed to get virtuals working. It was actually very simple, following the advice in the previously posted link.
Here’s my changes for anyone reading along.

I started with stmlib/linker_scripts/stm32f4xx_flash.ld as provided by stmlib. Then I added a section into the .text: part like this:

.text :
  {
    . = ALIGN(16);
    *(.text)                   /* remaining code */
    *(.text.*)                 /* remaining code */
    *(.rodata)                 /* read-only data (constants) */
    *(.rodata*)
    *(.glue_7)
    *(.glue_7t)
    KEEP (*(.init))
    KEEP (*(.fini))

    . = ALIGN(4);
    __init_array_start = .;
    KEEP(*(.init_array))      /* C++ constructors */
    KEEP(*(.ctors))           /* and vtable init */
    __init_array_end = .;

    . = ALIGN(16);
     _etext = .;
     _sidata = _etext;
  } >FLASH

I decided to not call the constructors in the startup file, but in my main(), so I can add RAM initialization and other low level init functions before allowing the C++ machinery to kick off. The function was basically copied from the mentioned link.

static void callConstructors()
{
    // Start and end points of the constructor list,
    // defined by the linker script.
    extern void (*__init_array_start)();
    extern void (*__init_array_end)();

    // Call each function in the list.
    // We have to take the address of the symbols, as __init_array_start *is*
    // the first function pointer, not the address of it.
    for (void (**p)() = &__init_array_start; p < &__init_array_end; ++p) {
        (*p)();
    }
}

void main() 
{
   // low level init things here
   callConstructors(); 

   ...
}

The only thing left now is to fix some of the complaints the linker had at this point. I was missing a operator delete( void* ) (mentioned in the link as well) and a __cxa_pure_virtual() handler. Both of which I added to the project like this (a good place to put those is where you have your handlers for hard fault, systick, bus fault, etc.):

extern "C" {
    void __cxa_pure_virtual() { while (1); } // nice to pick this up with the debugger
    void operator delete(void* p) { while (1); } // will never need this in this project, so this is a dummy
}

This seems to do it for me. Constructors for globals are called and virtual functions seem to work as well.
If you get undefined reference to 'vtable for XYZ', then you probably forgot to give an implementation for a non-pure virtual function of that class (had that error before & had it again… some mistakes you make over and over…)

Thanks @pichenettes and @mqtthiqs!


#11

Ok, I’ve spent a(nother) good two hours tonight to get to the bottom of the virtuals vs. ctors vs. dtors debate. What I found is that there is virtually no reason not to use C++ facilities, if you build your code correctly:

  1. the size of the .init_array section is exactly 4 * the number of global objects that have constructor code. In other words, it costs exactly the same as manually calling Init() functions (and is safer since it’s automatic).
  2. no need for a custom callConstructors function, a function from libc does exactly this: __libc_init_array. In the startup file given by ST, it is called right before main. Incidentally, it also calls the functions in .preinit_array, but I don’t know what this one contains.
  3. the global destructors code is linked because of libc: the _exit function, which is what is called by the OS when main() returns, is forced in. It traverses the section .fini_array and call all its functions. Of course this makes sense only in the context of an OS, which we don’t have so we’d better get rid of it.
  4. By default, arm-none-eabi-gcc links a small, portable libc called newlib. To reduce code size, you can:
    • swap newlib for its super-compact equivalent newlib-nano. It’s a matter of passing --specs=nano.specs to ld and it will reclaim ~1KB code size.
    • get rid of libc altogether (pass -nostdlib to ld). In this case, the destructor code is not referenced anymore so it is discarded: no more .fini_array nor _exit function. Of course, in this case you’ll have to use your own memcpy, printf etc. This will reclaim ~100B more than using newlib.nano (… which is proof that the destructor code was really not that big in the first place).
  5. For some reason, function __libc_init_array will still be linked even if you don’t have a libc (-nostdlib). When it’s done doing its job, __libc_init_array calls _init, which is normally provided to you by libc but which you now have to define yourself.
  6. Now to the interesting part: using virtuals will have little impact on binary size (~200B), provided that you use neither RTTI (-fno-rtti) nor exceptions (-fno-exceptions). The code below compiles to 1.47KB in -O0, against 1.27KB if I modify it trivially to not use virtuals. I still have to test the runtime overhead.

So here is the code I used, a simple blinky that runs on the stm32f3discovery board:

#include "stm32f3xx.h"

struct C {
  virtual void delay() = 0;
};

struct D1 : C {
  void delay() {
  for (float x=0.0f; x<1.0f; x += 0.00001f);
  }
} d1;

struct D2 : C {
  void delay() { // 3x faster
  for (float x=0.0f; x<1.0f; x += 0.00003f);
  }
} d2;

struct Main {
  Main() {
    /* LEDs initialization */
    RCC->AHBENR |= (1 << 21);     /* enable GPIO E clock */
    GPIOE->MODER |= 0x55550000;   /* configure E8-E15 for output */

    /* Button initialization */
    RCC->AHBENR |= (1 << 17);     /* enable GPIO A clock */
    GPIOA->MODER |= 0x00000000;   /* configure A0 for digital input */

    while(1) {
      GPIOE->ODR ^= 0x0000FF00;   /* invert pin 8-15 to 1 */
      C *t[2] = {&d1, &d2};
      t[GPIOA->IDR & 1]->delay();
    }
  }
} main;

extern "C" {
  __weak void _init() { }
  void __cxa_pure_virtual() { while (1); }
}

#12

This is very interesting, thanks for sharing this information.

The benefit from this is that you can have low level initialization of critical board functionality (external RAM, power delivery stuff for other hardware, external watchdogs or whatever else must be initialized at the very beginning) before your global ctors.
Yes, of course you can add all those things to your startup file before the call to __libc_init_array.
I guess ultimately it’s a matter of personal taste and project requirements where the initialization should go.

That’s assuming all global constructors are independent from each other. Custom Init() functions give you the freedom to define the exact sequence of initialization. (By the way: If I have two global objects, where one references the other in its ctor - will the linker make sure they are called in the right order? Same goes for dtors. Just thinking loudly here. I’ll try this later and see what happens).

What means “using virtuals” and “not using virtuals” in this case? Are we talking about including/not including the whole .init_array or simply modifying the code not to use any virtuals?


#13

The benefit from this is that you can have low level initialization of critical board functionality (external RAM, power delivery stuff for other hardware, external watchdogs or whatever else must be initialized at the very beginning) before your global ctors.

Well, “low level initialization” is only what you make of it… Why do you need it to be done before the ctors? Why not run this code in a constructor, so as to encapsulate a “driver” for each peripheral in a class? Then just by putting an instance of it in your Main you initialize it… seems more elegant to me.

That’s assuming all global constructors are independent from each other. Custom Init() functions give you the freedom to define the exact sequence of initialization. (By the way: If I have two global objects, where one references the other in its ctor - will the linker make sure they are called in the right order? Same goes for dtors. Just thinking loudly here. I’ll try this later and see what happens).

By standard, the constructors are called in the order they are declared in a class or at top level (globally, in the link order) and recursively, so you have complete control over what executes when. Custom Init() functions give you nothing more except the freedom to forget to call them, or call them multiple times by mistake :slight_smile:

What means “using virtuals” and “not using virtuals” in this case? Are we talking about including/not including the whole .init_array or simply modifying the code not to use any virtuals?

Yes sorry, I was a bit quick. .init_array is completely orthogonal to virtuals; it is only used to call global constructors.

“Using virtuals” here meant writing code that forces the compiler to include a vtable in the runtime representation of your class’s object. Just because you write “virtual” in your code doesn’t mean that the compiler will have to distinguish methods at run time; it could optimize away the “virtualness” of it by for example realizing that the virtual method is never called, or that there is only one implementation of it etc.

In my example, the call to a particular instance of delay() depends on something that can only be determined at run time (the state of an input pin GPIOA->IDR & 1), so we have to be able to distinguish at run time the two instances d1 and d1. If I change this line into simply d1.delay() the compiler is smart enough to see that only the code from D1 will ever be called, and therefore does not include the vtables.

Hope it helps.


#14

I think the issue arises when you have globals that somehow rely on certian peripherals to be initialized. I’m trying to think of an example.
Lets assume I have a class (A) that manages large amounts of storage that sit in external RAM. The constraint is that the memory controller must be initialized before this class starts its work. I see multiple ways of achieving this:

  1. Instantiate my class A in main() and do the RAM initialization before that (drawback: My class A is not global, so I have to pass references to everything using it - or have a globally defined pointer to it somewhere - but then I have to make sure no global ctor tries to access it. Meh.)
  2. Have my class A instantiated as a global (via .init_array) and the RAM initialization in the startup file before the .init_array is involved.
  3. Have my class A instantiated as a global and the RAM initialization in my main(), followed by a custom callConstructors()
  4. Have my class A instantiated as a global and my RAM initialization inside a “driver” class B that is also instantiated at top level. Now I must make sure that the latter is called first. No problem if the two global instances are defined in the same file, then I can simply put one before the other. But if they are in different files and are - as you said - instantiated in link order, then I’d have to make sure the link order is right. That sounds like a horrible solution to me.
  5. Do the RAM initialization in the class A's ctor. That only works if A is the only thing ever dealing with any data in the external RAM. But even then I would consider this a horrible design.

To my (dumb) understanding option 2/3 sounds best here. But maybe I’m not seeing the whole picture clearly enough yet. I guess it boils down to if you want the benefits of global ctors all together. And this is all probably mostly dependent ont he specific requriements of a project and the prefered style.

Oh yes it does! Once again, thanks for sharing. I’m learning a ton right now.


#15

Yes I agree with your point about option 4; it can get ugly if you start putting globals everywhere. But if you restrict to using them in one file only (main.cc) you get something like:

RamDriver.cc: 
  struct RamDriver { RamDriver() { your_init_code_here() }};
ADriver.cc:
  struct ADriver { ADriver() { uses_ram(); } };
main.cc:
  RamDriver ram;
  ADriver a;
  struct Main { Main() { do_stuff_using_ram_and_a(); }} main;

It’s very clear that Ram will be initialized before A, isn’t it?


#16

Super helpful tip! Seems to push the code size of the project right back to what it was before I enabled the ctors/virtuals.

Now the dilemma: should I deploy the code to the new batch of modules, or should I keep the old (virtual-free) version that got many, many more hours of tests…