[Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
@ 2012-12-20 15:04 freddie_chopin at op dot pl
  2012-12-20 15:24 ` [Bug rtl-optimization/55757] " freddie_chopin at op dot pl
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: freddie_chopin at op dot pl @ 2012-12-20 15:04 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

             Bug #: 55757
           Summary: Suboptimal interrupt prologue/epilogue for ARMv7-M
                    (Cortex-M3)
    Classification: Unclassified
           Product: gcc
           Version: 4.6.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: freddie_chopin@op.pl


With a Cortex-M3 microcontroller (ARMv7-M) and an empty interrupt handler
function:

void DMA_IRQHandler(void) __attribute((interrupt));
void DMA_IRQHandler(void)
{

}

Results in suboptimal prologue/epilogue:

000023b4 <DMA_IRQHandler>:
void DMA_IRQHandler(void) __attribute((interrupt));
void DMA_IRQHandler(void)
{
    23b4:    4668          mov    r0, sp
    23b6:    f020 0107     bic.w    r1, r0, #7
    23ba:    468d          mov    sp, r1
    23bc:    b501          push    {r0, lr}
}
    23be:    e8bd 4001     ldmia.w    sp!, {r0, lr}
    23c2:    4685          mov    sp, r0
    23c4:    4770          bx    lr
    ...

Without the __attribute__ the function is fine, just a single "bx lr".

This behavior is incorrect not only because r0 and lr are unused, but also
because on ARMv7-M these registers (r0-r3, lr, ip, sp, pc, psr) are saved by
hardware, so there's no point in saving them again.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
@ 2012-12-20 15:24 ` freddie_chopin at op dot pl
  2012-12-20 16:52 ` rearnsha at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: freddie_chopin at op dot pl @ 2012-12-20 15:24 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

--- Comment #1 from Freddie Chopin <freddie_chopin at op dot pl> 2012-12-20 15:23:25 UTC ---
BTW - it seems that optimization settings don't make any difference here - the
code below was compiled with -Os, on all other levels (1,2,3) the assembly
looks like this:

00002e90 <DMA_IRQHandler>:
void DMA_IRQHandler(void) __attribute((interrupt));
void DMA_IRQHandler(void)
{
    2e90:    4668          mov    r0, sp
    2e92:    f020 0107     bic.w    r1, r0, #7
    2e96:    468d          mov    sp, r1
    2e98:    b401          push    {r0}
}
    2e9a:    bc01          pop    {r0}
    2e9c:    4685          mov    sp, r0
    2e9e:    4770          bx    lr


So it just saves r0 only, without saving lr. It's actually 2 bytes smaller than
the assembly done for size optimizations (;

Without optimization (-O0) I get:

0000473c <DMA_IRQHandler>:
void DMA_IRQHandler(void) __attribute((interrupt));
void DMA_IRQHandler(void)
{
    473c:    4668          mov    r0, sp
    473e:    f020 0107     bic.w    r1, r0, #7
    4742:    468d          mov    sp, r1
    4744:    b481          push    {r0, r7}
    4746:    af00          add    r7, sp, #0
}
    4748:    46bd          mov    sp, r7
    474a:    bc81          pop    {r0, r7}
    474c:    4685          mov    sp, r0
    474e:    4770          bx    lr

The commandline options used to compile:

arm-none-eabi-gcc -c -mcpu=cortex-m3 -mthumb -O0 -ffunction-sections
-fdata-sections -Wall -Wstrict-prototypes -Wextra -std=gnu99 -g -ggdb3
-fverbose-asm -Wa,-ahlms=out/uart.lst   -MD -MP -MF out/uart.d <some include
dirs> <input file> <output file>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
  2012-12-20 15:24 ` [Bug rtl-optimization/55757] " freddie_chopin at op dot pl
@ 2012-12-20 16:52 ` rearnsha at gcc dot gnu.org
  2012-12-20 17:08 ` freddie_chopin at op dot pl
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2012-12-20 16:52 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P5
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2012-12-20
     Ever Confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> 2012-12-20 16:52:05 UTC ---
The code is there to re-align the stack to 64-bit alignment as required by the
ABI (early versions of the M3 did not have the ability to do this in HW).  The
reason two registers are pushed, rather than one is that this is also needed to
keep the stack aligned and pushing two registers uses less code than adjusting
the stack in a separate insn.

Of course, in this trivial case, the stack realignment isn't necessary as the
compiler should be able to tell that nothing requires re-alignment of the
stack.  But it's a corner case and it's much more common for this to be needed.

If you really know that you don't need stack-alignment on an M3, then just
remove the interrupt attribute.  It really doesn't serve any other purpose on
M-profile cores other than to cause the stack realignment.

Marking as an enhancement.  The code generated today is correct, but
sub-optimal.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
  2012-12-20 15:24 ` [Bug rtl-optimization/55757] " freddie_chopin at op dot pl
  2012-12-20 16:52 ` rearnsha at gcc dot gnu.org
@ 2012-12-20 17:08 ` freddie_chopin at op dot pl
  2012-12-21  3:23 ` joey.ye at arm dot com
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: freddie_chopin at op dot pl @ 2012-12-20 17:08 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

--- Comment #3 from Freddie Chopin <freddie_chopin at op dot pl> 2012-12-20 17:07:47 UTC ---
Indeed that's a trivial case, but other - useful - cases also show strange
behavior which I cannot clearly explain, so while we're at it I'd be grateful
for some explanation...

An interrupt handler function (void something(void)), but without attribute,
doing something inside (posts a FreeRTOS semaphore, calls vPortYieldFromISR()
if it's needed) actually saves a lot of registers on entry:
    23b4:    b507          push    {r0, r1, r2, lr}
>From what I know r0-r3 as scratch registers don't need to be saved on entry, as
it's the callers duty. There are also no parameters to be saved, as it's a void
function...

I observed the same behavior with some non-trivial functions from the lwIP
TCP/IP stack - they are also save scratch registers on entry, even when they
are void ...(void):

00005d00 <dns_init>:
void
dns_init()
{
    5d00:    b537          push    {r0, r1, r2, r4, r5, lr}

Is that a bug or maybe I don't understand the calling conventions? <;

BTW:
> The reason two registers are pushed, rather than one is that this is also needed to
> keep the stack aligned and pushing two registers uses less code than adjusting the stack in a separate insn.

But for optimization level 1, 2 and 3 only one reg is pushed...

Thx in advance!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
                   ` (2 preceding siblings ...)
  2012-12-20 17:08 ` freddie_chopin at op dot pl
@ 2012-12-21  3:23 ` joey.ye at arm dot com
  2012-12-21  3:32 ` joey.ye at arm dot com
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: joey.ye at arm dot com @ 2012-12-21  3:23 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

Joey Ye <joey.ye at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |joey.ye at arm dot com

--- Comment #4 from Joey Ye <joey.ye at arm dot com> 2012-12-21 03:23:07 UTC ---
> An interrupt handler function (void something(void)), but without attribute,
> doing something inside (posts a FreeRTOS semaphore, calls vPortYieldFromISR()
> if it's needed) actually saves a lot of registers on entry:
>     23b4:    b507          push    {r0, r1, r2, lr}
Pushing of scratch registers can be used to 
1. align stack, which Richard has explained
2. allocate stack frame, as a code size optimization of sub sp, #x

Explain with following example:
extern void bar(int *, int *);
void foo()
{
    int a, b;
    bar(&a, &b);
}
Built with -Os -mcpu=cortex-m3:
push {r0, r1, r2, lr} 

Here, pushing of r0 and r1 allocates a 8-byte frame for local variables.
Pushing of r2 is to make sp aligned to 8 bytes together with pushing lr. Values
of r0-r2 pushed to stack don't really matter.

But built with -O2:
    push    {lr}
    sub sp, sp, #12

Former is better on code size, latter wins on performance. Hopefully this
explains everything.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
                   ` (3 preceding siblings ...)
  2012-12-21  3:23 ` joey.ye at arm dot com
@ 2012-12-21  3:32 ` joey.ye at arm dot com
  2012-12-21  7:09 ` freddie_chopin at op dot pl
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: joey.ye at arm dot com @ 2012-12-21  3:32 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

--- Comment #5 from Joey Ye <joey.ye at arm dot com> 2012-12-21 03:32:21 UTC ---
However, there is room to improve both performance and stack consumption in
case of Os:

extern void bar(int *);

void foo()
{
    int a;
    bar(&a);
}

Built with -mcpu=cortex-m3 -Os:
    push    {r0, r1, r2, lr}
    add    r0, sp, #4
    bl    bar
    pop    {r1, r2, r3, pc}

Apparently it should be optimized to save 8 bytes of stack consumption and two
stores:
    push    {r0, lr}
    mov    r0, sp
    bl    bar
    pop    {r1, pc}


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
                   ` (4 preceding siblings ...)
  2012-12-21  3:32 ` joey.ye at arm dot com
@ 2012-12-21  7:09 ` freddie_chopin at op dot pl
  2013-03-02 15:23 ` web at brolinembedded dot se
  2023-11-28 23:10 ` yann at poupet dot eu
  7 siblings, 0 replies; 9+ messages in thread
From: freddie_chopin at op dot pl @ 2012-12-21  7:09 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

--- Comment #6 from Freddie Chopin <freddie_chopin at op dot pl> 2012-12-21 07:08:59 UTC ---
(In reply to comment #4)
> Former is better on code size, latter wins on performance. Hopefully this
> explains everything.

Indeed, it's clear now. Thank you for your time!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
                   ` (5 preceding siblings ...)
  2012-12-21  7:09 ` freddie_chopin at op dot pl
@ 2013-03-02 15:23 ` web at brolinembedded dot se
  2023-11-28 23:10 ` yann at poupet dot eu
  7 siblings, 0 replies; 9+ messages in thread
From: web at brolinembedded dot se @ 2013-03-02 15:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

Timmy Brolin <web at brolinembedded dot se> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |web at brolinembedded dot
                   |                            |se

--- Comment #7 from Timmy Brolin <web at brolinembedded dot se> 2013-03-02 15:23:11 UTC ---
(In reply to comment #2)
> If you really know that you don't need stack-alignment on an M3, then just
> remove the interrupt attribute.  It really doesn't serve any other purpose on
> M-profile cores other than to cause the stack realignment.

What you suggest requires a change in the C-code depending on the processor.
That is, one piece of C-code will not compile optimally for different Cortex-M3
revisions without modifications to the C-code itself. This is not good for code
which is intended to be used on multiple platforms.

Cortex-M3 r0p0 needs the prologue/epilogue.

Cortex-M3 r1p0 has a new configuration bit called STKALIGN which when enabled
makes the prologue/epilogue unnecessary. (But the default setting is that it
still needs the prologue/epilogue)

Cortex-M3 r2p0 changed the default setting of STKALIGN so that the
prologue/epilogue are unnecessary by default.

I would suggest that the prologue/epilogue should be removed by default when
compiling for r2p0 or higher, but be kept by default for older revisions.
There should also be a compilation switch to manually enable/disable the
prologue/epilogue according to the chosen setting of STKALIGN.

Interrupts can often be time critical, so ISR entry is probably the worst
possible place for extra instructions.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug rtl-optimization/55757] Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3)
  2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
                   ` (6 preceding siblings ...)
  2013-03-02 15:23 ` web at brolinembedded dot se
@ 2023-11-28 23:10 ` yann at poupet dot eu
  7 siblings, 0 replies; 9+ messages in thread
From: yann at poupet dot eu @ 2023-11-28 23:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55757

Yann Poupet <yann at poupet dot eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yann at poupet dot eu

--- Comment #8 from Yann Poupet <yann at poupet dot eu> ---
Hi,

I had the same issue and modified GCC so that the prologue/epilogue do not
save/restore R4-R11 if it's not required - assuming these are caller-saved
(same effect as -fcall-used-[r4...r11]), with a new function attribute. I'm not
sure if this has a chance to be accepted upstream though.

I'm using it in 2 cases:

- ISR as described above
- for the tasks launched by my own home made microkernel. Indeed, when the
kernel starts a task starting its entry function, there's no need to save any
register, it's just a waste of stack space.

Anyone still interested with a solution ?
The patch is very small, maybe 10 lines.

Cheers
Yann

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-11-28 23:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-20 15:04 [Bug rtl-optimization/55757] New: Suboptimal interrupt prologue/epilogue for ARMv7-M (Cortex-M3) freddie_chopin at op dot pl
2012-12-20 15:24 ` [Bug rtl-optimization/55757] " freddie_chopin at op dot pl
2012-12-20 16:52 ` rearnsha at gcc dot gnu.org
2012-12-20 17:08 ` freddie_chopin at op dot pl
2012-12-21  3:23 ` joey.ye at arm dot com
2012-12-21  3:32 ` joey.ye at arm dot com
2012-12-21  7:09 ` freddie_chopin at op dot pl
2013-03-02 15:23 ` web at brolinembedded dot se
2023-11-28 23:10 ` yann at poupet dot eu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).