Non-optimal code generated for H8

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Non-optimal code generated for H8
@ 2019-10-29 20:03 Mikael Tillenius
  2019-10-29 20:19 ` Jeff Law
  0 siblings, 1 reply; 20+ messages in thread
From: Mikael Tillenius @ 2019-10-29 20:03 UTC (permalink / raw)
  To: gcc-help

Hi

I am using a cross compiler for Renesas H8S. In a few places it 
generates really bad code. Given the following program:

struct s {
 Â Â Â  char a, b;
 Â Â Â  char c[11];
} x[2];

void test(int n)
{
 Â Â Â  struct s *sp = &x[n];

 Â Â Â  sp->a = 1;
 Â Â Â  sp->b = 1;
}

I would expect that the pointer "sp" is calculated once and reused to 
access the fields "a" and "b". But instead the pointer is recalculated 
for each access. This generates a lot of extra code, including calls to 
__mulhi3. I have tested with gcc 8.2 and 9.2 and with different 
optimization levels (-O1, -O2, -Os) all with the same result. With -O0 
"sp" is only calculated once and kept as a variable on the stack but the 
rest of the code is not as good as it could be. The best work around 
seems to be to declare "sp" as volatile: "struct s *volatile sp = 
&x[n];". Then "sp" is only calculated once and kept on the stack and the 
surrounding code can be optimized.

So my question is: where should I start looking for a fix to this? The 
other targets I tried (ARM, x86, x86_64) behave as expected and 
calculates the pointer once and keeps it in a register.

/Mikael

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-29 20:03 Non-optimal code generated for H8 Mikael Tillenius
@ 2019-10-29 20:19 ` Jeff Law
  2019-10-30  0:34   ` Segher Boessenkool
  2019-10-30 16:12   ` Mikael Tillenius
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff Law @ 2019-10-29 20:19 UTC (permalink / raw)
  To: Mikael Tillenius, gcc-help

On 10/29/19 2:03 PM, Mikael Tillenius wrote:
> Hi
> 
> I am using a cross compiler for Renesas H8S. In a few places it
> generates really bad code. Given the following program:
> 
> struct s {
>     char a, b;
>     char c[11];
> } x[2];
> 
> void test(int n)
> {
>     struct s *sp = &x[n];
> 
>     sp->a = 1;
>     sp->b = 1;
> }
> 
> I would expect that the pointer "sp" is calculated once and reused to
> access the fields "a" and "b". But instead the pointer is recalculated
> for each access. This generates a lot of extra code, including calls to
> __mulhi3. I have tested with gcc 8.2 and 9.2 and with different
> optimization levels (-O1, -O2, -Os) all with the same result. With -O0
> "sp" is only calculated once and kept as a variable on the stack but the
> rest of the code is not as good as it could be. The best work around
> seems to be to declare "sp" as volatile: "struct s *volatile sp =
> &x[n];". Then "sp" is only calculated once and kept on the stack and the
> surrounding code can be optimized.
> 
> So my question is: where should I start looking for a fix to this? The
> other targets I tried (ARM, x86, x86_64) behave as expected and
> calculates the pointer once and keeps it in a register.
As we leave gimple the code looks like:

  MEM <struct s[2]> [(struct s *)&x][n_1(D)].a = 1;
  MEM <struct s[2]> [(struct s *)&x][n_1(D)].b = 1;

One might argue that DOM or FRE should have created a common
subexpression for the address arithmetic here.  Even so it's not bad.

CSE doesn't do its job though.  THere's clearly a REG_EQUAL note which
should have allowed it to at least cleanup the redundant multiplication
for the address calculation.

I recommend filing a bug report.

Note that the H8 port is on the list of ports that are be deprecated in
gcc-10 unless someone steps forward to take care of some significant
maintenance tasks.   Deprecation in gcc-10 would mean the port would be
removed in gcc-11 unless someone steps up to take care of those
maintenance tasks.

Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-29 20:19 ` Jeff Law
@ 2019-10-30  0:34   ` Segher Boessenkool
  2019-10-30  9:07     ` David Brown
  2019-10-30 16:12   ` Mikael Tillenius
  1 sibling, 1 reply; 20+ messages in thread
From: Segher Boessenkool @ 2019-10-30  0:34 UTC (permalink / raw)
  To: Jeff Law; +Cc: Mikael Tillenius, gcc-help

On Tue, Oct 29, 2019 at 02:19:25PM -0600, Jeff Law wrote:
> On 10/29/19 2:03 PM, Mikael Tillenius wrote:
> > I am using a cross compiler for Renesas H8S. In a few places it
> > generates really bad code. Given the following program:
> > 
> > struct s {
> > Â Â Â  char a, b;
> > Â Â Â  char c[11];
> > } x[2];
> > 
> > void test(int n)
> > {
> > Â Â Â  struct s *sp = &x[n];
> > 
> > Â Â Â  sp->a = 1;
> > Â Â Â  sp->b = 1;
> > }

> As we leave gimple the code looks like:
> 
>   MEM <struct s[2]> [(struct s *)&x][n_1(D)].a = 1;
>   MEM <struct s[2]> [(struct s *)&x][n_1(D)].b = 1;
> 
> One might argue that DOM or FRE should have created a common
> subexpression for the address arithmetic here.  Even so it's not bad.
> 
> CSE doesn't do its job though.  THere's clearly a REG_EQUAL note which
> should have allowed it to at least cleanup the redundant multiplication
> for the address calculation.

And on other targets it does do its job fine, say riscv32, or m68k -O1
(the -O1 to prevent the two stores from being optimised into one).

I haven't managed to find another target where multiplication by 13 is
done with a libcall though.  Maybe I should look harder.

> I recommend filing a bug report.

+1


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30  0:34   ` Segher Boessenkool
@ 2019-10-30  9:07     ` David Brown
  2019-10-30 16:32       ` Mikael Tillenius
                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: David Brown @ 2019-10-30  9:07 UTC (permalink / raw)
  To: Segher Boessenkool, Jeff Law; +Cc: Mikael Tillenius, gcc-help

On 30/10/2019 01:34, Segher Boessenkool wrote:
> On Tue, Oct 29, 2019 at 02:19:25PM -0600, Jeff Law wrote:
>> On 10/29/19 2:03 PM, Mikael Tillenius wrote:
>>> I am using a cross compiler for Renesas H8S. In a few places it
>>> generates really bad code. Given the following program:
>>>
>>> struct s {
>>> Â Â Â  char a, b;
>>> Â Â Â  char c[11];
>>> } x[2];
>>>
>>> void test(int n)
>>> {
>>> Â Â Â  struct s *sp = &x[n];
>>>
>>> Â Â Â  sp->a = 1;
>>> Â Â Â  sp->b = 1;
>>> }
> 
>> As we leave gimple the code looks like:
>>
>>   MEM <struct s[2]> [(struct s *)&x][n_1(D)].a = 1;
>>   MEM <struct s[2]> [(struct s *)&x][n_1(D)].b = 1;
>>
>> One might argue that DOM or FRE should have created a common
>> subexpression for the address arithmetic here.  Even so it's not bad.
>>
>> CSE doesn't do its job though.  THere's clearly a REG_EQUAL note which
>> should have allowed it to at least cleanup the redundant multiplication
>> for the address calculation.
> 
> And on other targets it does do its job fine, say riscv32, or m68k -O1
> (the -O1 to prevent the two stores from being optimised into one).
> 
> I haven't managed to find another target where multiplication by 13 is
> done with a libcall though.  Maybe I should look harder.
> 

I checked on the 8-bit AVR, which is (I think) the smallest and simplest
device targeted by gcc, and where only some devices have multiplication
instructions.  If you optimise for size (-Os), it uses a library call
__mulhi3 for the multiplication:

test:
        ldi r22,lo8(13)
        ldi r23,0
        rcall __mulhi3
        subi r24,lo8(-(x))
        sbci r25,hi8(-(x))
        ldi r18,lo8(1)
        mov r26,r24
        mov r27,r25
        st X,r18
        mov r30,r26
        mov r31,r27
        std Z+1,r18
        ret

With -O1 or -O2, it uses shifts and adds for the multiply:

test:
        mov r30,r24
        mov r31,r25
        lsl r30
        rol r31
        add r30,r24
        adc r31,r25
        lsl r30
        rol r31
        lsl r30
        rol r31
        add r24,r30
        adc r25,r31
        mov r30,r24
        mov r31,r25
        subi r30,lo8(-(x))
        sbci r31,hi8(-(x))
        ldi r24,lo8(1)
        st Z,r24
        std Z+1,r24
        ret

(I used an old version of avr-gcc here, version 5.4.0, just because it
is on the extremely useful <https://godbolt.org> online compiler site.
Things may have changed for later versions, but usually the AVR port is
fairly stable.)

One thing to note here is that with -O1, the compiler calculates "sp" in
the "Z" register, then stores 1 into [sp] and [sp+1].  With the
multiplication, sp is calculated in a non-pointer register pair and the
compiler generates sub-optimal code for storing in [sp] and [sp+1].  If
that is still the case for modern gcc, it could be filed as a missed
optimisation bug for the AVR backend.

But this also brings up another idea.  Is the OP using "-Os"
optimisation?  My experience (especially with AVR, msp430, and ARM
Cortex-M targets) is that "-Os" optimisation is often quite poor
compared to "-O2".  It can result in very significantly slower code for
a saving of a couple of bytes, and in some cases the code can be
significantly /bigger/ than with -O2.  I don't know whether this is a
backend issue, or a general problem with "-Os", but I no longer use or
recommend "-Os" even for tiny embedded systems.

So perhaps simply changing from "-Os" to "-O2" will fix the OP's problems.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30  9:07     ` David Brown
@ 2019-10-30 16:32       ` Mikael Tillenius
  2019-10-30 19:53       ` Mikael Tillenius
  2019-10-30 22:49       ` Segher Boessenkool
  2 siblings, 0 replies; 20+ messages in thread
From: Mikael Tillenius @ 2019-10-30 16:32 UTC (permalink / raw)
  To: gcc-help


On 30/10/2019 01:34, Segher Boessenkool wrote:
>> And on other targets it does do its job fine, say riscv32, or m68k -O1
>> (the -O1 to prevent the two stores from being optimised into one).
>>
>> I haven't managed to find another target where multiplication by 13 is
>> done with a libcall though.Â  Maybe I should look harder.
>>
Just to be clear. The call to __mulhi3 by itself is reasonable since the 
H8 has quite weak shift instructions. And no 32 bit by 32 bit 
multiplication instruction. It is the repeated re-calculation of the 
pointer that is the big problem. The code gets both big and slow.

On 10/30/19 10:06 AM, David Brown wrote:
> But this also brings up another idea.Â  Is the OP using "-Os"
> optimisation?Â  My experience (especially with AVR, msp430, and ARM
> Cortex-M targets) is that "-Os" optimisation is often quite poor
> compared to "-O2".Â  It can result in very significantly slower code for
> a saving of a couple of bytes, and in some cases the code can be
> significantly /bigger/ than with -O2.Â  I don't know whether this is a
> backend issue, or a general problem with "-Os", but I no longer use or
> recommend "-Os" even for tiny embedded systems.

Yes, I tried all of -O1, -O2 and -Os and probably a few more.

My experience with -OsÂ  on ARM Thumb2 is much more positive than your. 
"-Os -mslow-flash-data" gives small and quite fast code most of the time.

/Mikael

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30  9:07     ` David Brown
  2019-10-30 16:32       ` Mikael Tillenius
@ 2019-10-30 19:53       ` Mikael Tillenius
  2019-10-30 19:59         ` Jeff Law
  2019-10-30 22:49       ` Segher Boessenkool
  2 siblings, 1 reply; 20+ messages in thread
From: Mikael Tillenius @ 2019-10-30 19:53 UTC (permalink / raw)
  To: gcc-help

The AVR example was interesting since it generates reasonable code.

After a few experiments I strongly believe that it is the call to 
__mulhi3 that stops GCC from doing CSE on the pointer expressions. If I 
add a fake mulhi3 insn CSE seems to do its work and just generates the 
address calculation once. But of course the code will not work since the 
instruction is not present in the hardware. I see at least three 
possible solutions:

1) Teach CSE to work even for calls to __mulhi3

2) Add an insn that generates the H8S mulxu.w (16x16 => 32) instruction 
in this case. "int" is 16-bit and the size of the struct is also small 
in this case.

3) Do something similar to AVR. It seems to have an mulhi3 insn that 
expands to a call to __mulhi3 after CSE.

Now the only problem is that I have no idea how to do any of the 
above;-) Any hints?

/Mikael

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30 19:53       ` Mikael Tillenius
@ 2019-10-30 19:59         ` Jeff Law
  2019-10-31  1:01           ` Segher Boessenkool
  2019-11-01 13:16           ` Mikael Tillenius
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff Law @ 2019-10-30 19:59 UTC (permalink / raw)
  To: Mikael Tillenius, gcc-help

On 10/30/19 1:53 PM, Mikael Tillenius wrote:
> The AVR example was interesting since it generates reasonable code.
> 
> After a few experiments I strongly believe that it is the call to
> __mulhi3 that stops GCC from doing CSE on the pointer expressions. If I
> add a fake mulhi3 insn CSE seems to do its work and just generates the
> address calculation once. But of course the code will not work since the
> instruction is not present in the hardware. I see at least three
> possible solutions:
> 
> 1) Teach CSE to work even for calls to __mulhi3
THe name shouldn't matter at all to CSE.  What's important to CSE is
interpreting the REG_EQUAL note in this case and finding the bounds of
the insn chain that computes the value in the REG_EQUAL note.

> 
> 2) Add an insn that generates the H8S mulxu.w (16x16 => 32) instruction
> in this case. "int" is 16-bit and the size of the struct is also small
> in this case.
Note that GCC should know how to generate mulxu.w for the H8/300H and
above variants.  Just quickly looking at the MD file I see:


> (define_expand "umulhisi3"
>   [(set (match_operand:SI 0 "register_operand" "")
>         (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
>                  ;; intentionally-mismatched modes
>                  (match_operand:HI 2 "reg_or_nibble_operand" "")))]
>   "TARGET_H8300H || TARGET_H8300S"
>   {
>     if (GET_MODE (operands[2]) != VOIDmode)
>       operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]);
>   })

Note the HImode on operand 2.  That seems wrong.  Hopefully it wasn't
something *I* did eons ago.  Anyway, that might be a place to start
looking.  I wouldn't be surprised if all the work we've done in the last
decade to tighten up our checking of predicates is somehow causing us to
not use umulhisi3 in this context.



Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30 19:59         ` Jeff Law
@ 2019-10-31  1:01           ` Segher Boessenkool
  2019-11-01 13:16           ` Mikael Tillenius
  1 sibling, 0 replies; 20+ messages in thread
From: Segher Boessenkool @ 2019-10-31  1:01 UTC (permalink / raw)
  To: Jeff Law; +Cc: Mikael Tillenius, gcc-help

On Wed, Oct 30, 2019 at 01:59:26PM -0600, Jeff Law wrote:
> On 10/30/19 1:53 PM, Mikael Tillenius wrote:
> > The AVR example was interesting since it generates reasonable code.
> > 
> > After a few experiments I strongly believe that it is the call to
> > __mulhi3 that stops GCC from doing CSE on the pointer expressions. If I
> > add a fake mulhi3 insn CSE seems to do its work and just generates the
> > address calculation once. But of course the code will not work since the
> > instruction is not present in the hardware. I see at least three
> > possible solutions:
> > 
> > 1) Teach CSE to work even for calls to __mulhi3
> THe name shouldn't matter at all to CSE.  What's important to CSE is
> interpreting the REG_EQUAL note in this case and finding the bounds of
> the insn chain that computes the value in the REG_EQUAL note.

Yeah, and the call insns *are* properly marked as "const" as well, so I
don't see what is going on (and the CSE dump file is content-less).

> > (define_expand "umulhisi3"
> >   [(set (match_operand:SI 0 "register_operand" "")
> >         (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
> >                  ;; intentionally-mismatched modes
> >                  (match_operand:HI 2 "reg_or_nibble_operand" "")))]
> >   "TARGET_H8300H || TARGET_H8300S"
> >   {
> >     if (GET_MODE (operands[2]) != VOIDmode)
> >       operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]);
> >   })
> 
> Note the HImode on operand 2.  That seems wrong.  Hopefully it wasn't
> something *I* did eons ago.  Anyway, that might be a place to start
> looking.  I wouldn't be surprised if all the work we've done in the last
> decade to tighten up our checking of predicates is somehow causing us to
> not use umulhisi3 in this context.

I get a __mulsi3 even, probably need more flags than just -ms :-)
(I used h8300-linux).


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30 19:59         ` Jeff Law
  2019-10-31  1:01           ` Segher Boessenkool
@ 2019-11-01 13:16           ` Mikael Tillenius
  2019-11-01 14:38             ` Jeff Law
  1 sibling, 1 reply; 20+ messages in thread
From: Mikael Tillenius @ 2019-11-01 13:16 UTC (permalink / raw)
  To: gcc-help


On 10/30/19 8:59 PM, Jeff Law wrote:
>> 1) Teach CSE to work even for calls to __mulhi3
> THe name shouldn't matter at all to CSE.  What's important to CSE is
> interpreting the REG_EQUAL note in this case and finding the bounds of
> the insn chain that computes the value in the REG_EQUAL note.
My guess would be that it is the *call* that is the problem, not exactly 
what it calls. But I really don't know what I am talking about. I never 
looked at the GCC sources until yesterday.
> Note that GCC should know how to generate mulxu.w for the H8/300H and
> above variants.  Just quickly looking at the MD file I see:
>
>
>> (define_expand "umulhisi3"
>>    [(set (match_operand:SI 0 "register_operand" "")
>>          (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" ""))
>>                   ;; intentionally-mismatched modes
>>                   (match_operand:HI 2 "reg_or_nibble_operand" "")))]
>>    "TARGET_H8300H || TARGET_H8300S"
>>    {
>>      if (GET_MODE (operands[2]) != VOIDmode)
>>        operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]);
>>    })
> Note the HImode on operand 2.  That seems wrong.  Hopefully it wasn't
> something *I* did eons ago.  Anyway, that might be a place to start
> looking.  I wouldn't be surprised if all the work we've done in the last
> decade to tighten up our checking of predicates is somehow causing us to
> not use umulhisi3 in this context.

I think the problem is the predicate "reg_or_nibble_operand". It only 
matches integers <= 15 and only on H8SX. So it cannot match 13 on my 
target (which is H8S) and cannot match sizeof(...) in general even on H8SX.

Also see my next message.

/Mikael


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-11-01 13:16           ` Mikael Tillenius
@ 2019-11-01 14:38             ` Jeff Law
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Law @ 2019-11-01 14:38 UTC (permalink / raw)
  To: Mikael Tillenius, gcc-help

On 11/1/19 7:16 AM, Mikael Tillenius wrote:
> 
> On 10/30/19 8:59 PM, Jeff Law wrote:
>>> 1) Teach CSE to work even for calls to __mulhi3
>> THe name shouldn't matter at all to CSE.  What's important to CSE is
>> interpreting the REG_EQUAL note in this case and finding the bounds of
>> the insn chain that computes the value in the REG_EQUAL note.
> My guess would be that it is the *call* that is the problem, not exactly
> what it calls. But I really don't know what I am talking about. I never
> looked at the GCC sources until yesterday.
There is certainly a problem in that CSE should know how to remove the
redundant call and failure to do so is likely a (low priority) bug.
But at least on the H8/300H and later processors the calls should be
totally unnecessary as we should be generating the multiply instruction.


>> Note that GCC should know how to generate mulxu.w for the H8/300H and
>> above variants.  Just quickly looking at the MD file I see:
>>
>>
>>> (define_expand "umulhisi3"
>>>    [(set (match_operand:SI 0 "register_operand" "")
>>>          (mult:SI (zero_extend:SI (match_operand:HI 1
>>> "register_operand" ""))
>>>                   ;; intentionally-mismatched modes
>>>                   (match_operand:HI 2 "reg_or_nibble_operand" "")))]
>>>    "TARGET_H8300H || TARGET_H8300S"
>>>    {
>>>      if (GET_MODE (operands[2]) != VOIDmode)
>>>        operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]);
>>>    })
>> Note the HImode on operand 2.  That seems wrong.  Hopefully it wasn't
>> something *I* did eons ago.  Anyway, that might be a place to start
>> looking.  I wouldn't be surprised if all the work we've done in the last
>> decade to tighten up our checking of predicates is somehow causing us to
>> not use umulhisi3 in this context.
> 
> I think the problem is the predicate "reg_or_nibble_operand". It only
> matches integers <= 15 and only on H8SX. So it cannot match 13 on my
> target (which is H8S) and cannot match sizeof(...) in general even on H8SX.
It's not the range of values, but the *modes* of those values.  ie, it's
perfectly fine to have a predicate that restricts the values to some
integer range.

The general structure of that expander is suspect because of the
difference in the structure of the two operands of the MULT rtx.  ie,
one is a (zero_extend:SI ...) and the other is a HImode
reg_or_nibble_operand.  I strongly suspect the generic expansion code is
claiming that umulhisi3 isn't supported for any cases because of this issue.

To further complicate matters GCC does not generally allow for
constructs such as ({zero,sign}_extend (const_int ...))) as those are
inherently constants and would be simplified by removing the extension.
 So it's actually fairly difficult to write that expander correctly.

Anyway, someone would need to debug the expander and recognition code.
It's not a high priority issue though.  But at least with the bug filed
it won't get lost.

Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30  9:07     ` David Brown
  2019-10-30 16:32       ` Mikael Tillenius
  2019-10-30 19:53       ` Mikael Tillenius
@ 2019-10-30 22:49       ` Segher Boessenkool
  2019-10-31  8:10         ` David Brown
  2019-11-03 12:24         ` Oleg Endo
  2 siblings, 2 replies; 20+ messages in thread
From: Segher Boessenkool @ 2019-10-30 22:49 UTC (permalink / raw)
  To: David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help

On Wed, Oct 30, 2019 at 10:06:57AM +0100, David Brown wrote:
> But this also brings up another idea.  Is the OP using "-Os"
> optimisation?  My experience (especially with AVR, msp430, and ARM
> Cortex-M targets) is that "-Os" optimisation is often quite poor
> compared to "-O2".  It can result in very significantly slower code for
> a saving of a couple of bytes, and in some cases the code can be
> significantly /bigger/ than with -O2.  I don't know whether this is a
> backend issue, or a general problem with "-Os", but I no longer use or
> recommend "-Os" even for tiny embedded systems.

I have done the same for many years now, on not so tiny systems as well:
if you want compact code, you use -O2, and tune both the code and the
compiler flags manually.  Some of the --params are very useful.  But do
not forget about tuning the code itself, that is where the biggest gains
are possible, always.  (If a better solution is hard to come up with,
try finding a better problem, instead!)

But if you have no time to waste, -Os usually give you smaller code than
-O2, even if some of the choices on the size <-> speed tradeoff are a bit
questionable (bug reports or patches welcome, of course!)  And sometimes
-Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after
all.  But if you see this often (for some specific target?), please file
a PR for that as well.


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30 22:49       ` Segher Boessenkool
@ 2019-10-31  8:10         ` David Brown
  2019-10-31 16:24           ` Richard Earnshaw (lists)
  2019-10-31 17:30           ` Segher Boessenkool
  2019-11-03 12:24         ` Oleg Endo
  1 sibling, 2 replies; 20+ messages in thread
From: David Brown @ 2019-10-31  8:10 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Jeff Law, Mikael Tillenius, gcc-help

On 30/10/2019 23:49, Segher Boessenkool wrote:
> On Wed, Oct 30, 2019 at 10:06:57AM +0100, David Brown wrote:
>> But this also brings up another idea.  Is the OP using "-Os"
>> optimisation?  My experience (especially with AVR, msp430, and ARM
>> Cortex-M targets) is that "-Os" optimisation is often quite poor
>> compared to "-O2".  It can result in very significantly slower code for
>> a saving of a couple of bytes, and in some cases the code can be
>> significantly /bigger/ than with -O2.  I don't know whether this is a
>> backend issue, or a general problem with "-Os", but I no longer use or
>> recommend "-Os" even for tiny embedded systems.
> 
> I have done the same for many years now, on not so tiny systems as well:
> if you want compact code, you use -O2, and tune both the code and the
> compiler flags manually.  Some of the --params are very useful.  But do
> not forget about tuning the code itself, that is where the biggest gains
> are possible, always.  (If a better solution is hard to come up with,
> try finding a better problem, instead!)
> 

Getting the smallest and fastest code from the compiler is always a bit
of an artform - tweaking parameters or options, tweaking code
arrangements, adding hints like "always_inline" attributes.  No compiler
can be expected to make the best choices at all times.  But gcc
generally does a fine job - it's rare that I need to do anything like
this with gcc, but sometimes it is fun :-)  (With some of the non-gcc
compilers I have used, it is a necessity, and that's less fun.)

I think there is a general attitude of "I'm compiling for a small
microcontroller with only 32 KB flash - so I must use flags to make the
code as small as possible".  In fact, if you have 32 KB flash then you
need to make your code at most 32 KB in size - code of 31 KB is just as
good as code of 17 KB.  As long as your code fits, any further reduction
in size is unnecessary.

> But if you have no time to waste, -Os usually give you smaller code than
> -O2, even if some of the choices on the size <-> speed tradeoff are a bit
> questionable (bug reports or patches welcome, of course!)  And sometimes
> -Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after
> all.  But if you see this often (for some specific target?), please file
> a PR for that as well.
> 

I haven't reported any "-O2 smaller than -Os" bugs, as far as I can
remember, but have merely noted it as a pattern that I see sometimes.  I
will try to keep an eye out for when I see it again.

I have one bug report filed for an "-O2 slower than -Os" bug, the other
direction.  This was for ARM Cortex-M targets, which I filed at the "GNU
Arm Embedded Toolchain" tracker.  They are the main source of gcc
toolchains for small-system ARM embedded development - in my business,
it's important to have specific "toolchain releases" covering the
compiler, library, and other tools.  The folks running the project are
quite good at fixing things or punting them upstream to the main gcc
project, but like every other project bugs are reported faster than they
can be handled!

<https://bugs.launchpad.net/gcc-arm-embedded/+bug/1646883>

Do you have any reference for the heuristic parameters set by -Os ?  (I
am perfectly happy with "it's in this file in the gcc source code", if
that is the answer.)

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-31  8:10         ` David Brown
@ 2019-10-31 16:24           ` Richard Earnshaw (lists)
  2019-10-31 18:21             ` David Brown
  2019-10-31 17:30           ` Segher Boessenkool
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Earnshaw (lists) @ 2019-10-31 16:24 UTC (permalink / raw)
  To: David Brown, Segher Boessenkool; +Cc: Jeff Law, Mikael Tillenius, gcc-help

On 31/10/2019 08:10, David Brown wrote:
> I have one bug report filed for an "-O2 slower than -Os" bug, the other
> direction.  This was for ARM Cortex-M targets, which I filed at the "GNU
> Arm Embedded Toolchain" tracker.  They are the main source of gcc
> toolchains for small-system ARM embedded development - in my business,
> it's important to have specific "toolchain releases" covering the
> compiler, library, and other tools.  The folks running the project are
> quite good at fixing things or punting them upstream to the main gcc
> project, but like every other project bugs are reported faster than they
> can be handled!
> 
> <https://bugs.launchpad.net/gcc-arm-embedded/+bug/1646883>

I don't tend to look at launchpad, so I hadn't seen this.  But a simple 
fix is quite easy and will now be in gcc-10.

https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02234.html

If you encounter problems with the Arm embedded tools and it can easily 
be reproduced on a standard FSF build, then it's a good idea to report 
the bug in GCC's own bugzilla, even if you also report it elsewhere to 
the vendor of your toolchain.

GCC-10 will produce the following at -Os for your test1 case:

         ldr     r3, .L2
         movs    r2, #1
         str     r2, [r3, #2060]
         movs    r2, #2
         str     r2, [r3, #2064]
         movs    r2, #3
         str     r2, [r3, #2052]
         movs    r2, #4
         str     r2, [r3, #2076]
         movs    r2, #6
         str     r2, [r3, #2048]
         bx      lr

This isn't perfect: we could use a different base and then we'd be able 
to make use of the 16-bit variants of the STR instruction.  But it's 
about balance here - the code that tries to form the base address 
doesn't know whether it will be re-used and if so, what other offsets 
will be wanted, so we have to guess.  The more bits we put in the base 
address the lower the probability of finding CSEs (as we get more 
bases), but if we put fewer bits in the base address and more in the 
offset we'll be more likely to find CSEs, but less likely to be able to 
use the smaller instructions.

R.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-31 16:24           ` Richard Earnshaw (lists)
@ 2019-10-31 18:21             ` David Brown
  0 siblings, 0 replies; 20+ messages in thread
From: David Brown @ 2019-10-31 18:21 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Segher Boessenkool
  Cc: Jeff Law, Mikael Tillenius, gcc-help



On 31/10/2019 17:24, Richard Earnshaw (lists) wrote:
> On 31/10/2019 08:10, David Brown wrote:
>> I have one bug report filed for an "-O2 slower than -Os" bug, the other
>> direction.Â  This was for ARM Cortex-M targets, which I filed at the "GNU
>> Arm Embedded Toolchain" tracker.Â  They are the main source of gcc
>> toolchains for small-system ARM embedded development - in my business,
>> it's important to have specific "toolchain releases" covering the
>> compiler, library, and other tools.Â  The folks running the project are
>> quite good at fixing things or punting them upstream to the main gcc
>> project, but like every other project bugs are reported faster than they
>> can be handled!
>>
>> <https://bugs.launchpad.net/gcc-arm-embedded/+bug/1646883>
> 
> I don't tend to look at launchpad, so I hadn't seen this.Â  But a simple 
> fix is quite easy and will now be in gcc-10.
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02234.html
> 
> If you encounter problems with the Arm embedded tools and it can easily 
> be reproduced on a standard FSF build, then it's a good idea to report 
> the bug in GCC's own bugzilla, even if you also report it elsewhere to 
> the vendor of your toolchain.
> 
> GCC-10 will produce the following at -Os for your test1 case:
> 
>  Â Â Â Â Â Â Â  ldrÂ Â Â Â  r3, .L2
>  Â Â Â Â Â Â Â  movsÂ Â Â  r2, #1
>  Â Â Â Â Â Â Â  strÂ Â Â Â  r2, [r3, #2060]
>  Â Â Â Â Â Â Â  movsÂ Â Â  r2, #2
>  Â Â Â Â Â Â Â  strÂ Â Â Â  r2, [r3, #2064]
>  Â Â Â Â Â Â Â  movsÂ Â Â  r2, #3
>  Â Â Â Â Â Â Â  strÂ Â Â Â  r2, [r3, #2052]
>  Â Â Â Â Â Â Â  movsÂ Â Â  r2, #4
>  Â Â Â Â Â Â Â  strÂ Â Â Â  r2, [r3, #2076]
>  Â Â Â Â Â Â Â  movsÂ Â Â  r2, #6
>  Â Â Â Â Â Â Â  strÂ Â Â Â  r2, [r3, #2048]
>  Â Â Â Â Â Â Â  bxÂ Â Â Â Â  lr
> 
> This isn't perfect: we could use a different base and then we'd be able 
> to make use of the 16-bit variants of the STR instruction.Â  But it's 
> about balance here - the code that tries to form the base address 
> doesn't know whether it will be re-used and if so, what other offsets 
> will be wanted, so we have to guess.Â  The more bits we put in the base 
> address the lower the probability of finding CSEs (as we get more 
> bases), but if we put fewer bits in the base address and more in the 
> offset we'll be more likely to find CSEs, but less likely to be able to 
> use the smaller instructions.
> 
> R.

Thank you - that is a big improvement.  It is not always easy to guess 
what might be solvable with a simple patch, and what might be more 
effort.  I guess it is better to file the bug reports, and let you folks 
decide if it is something quickly fixable, or if it is more effort than 
can be reasonably justified.

I've added a link to your patch mail to the Launchpad bug, so that the 
powers that be on that toolchain project can mark it as resolved.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-31  8:10         ` David Brown
  2019-10-31 16:24           ` Richard Earnshaw (lists)
@ 2019-10-31 17:30           ` Segher Boessenkool
  2019-10-31 17:47             ` Richard Earnshaw (lists)
  1 sibling, 1 reply; 20+ messages in thread
From: Segher Boessenkool @ 2019-10-31 17:30 UTC (permalink / raw)
  To: David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help

On Thu, Oct 31, 2019 at 09:10:34AM +0100, David Brown wrote:
> I think there is a general attitude of "I'm compiling for a small
> microcontroller with only 32 KB flash - so I must use flags to make the
> code as small as possible".  In fact, if you have 32 KB flash then you
> need to make your code at most 32 KB in size - code of 31 KB is just as
> good as code of 17 KB.  As long as your code fits, any further reduction
> in size is unnecessary.

Yeah, unless you're already at 31kB and now have to add a feature :-)

> I haven't reported any "-O2 smaller than -Os" bugs, as far as I can
> remember, but have merely noted it as a pattern that I see sometimes.  I
> will try to keep an eye out for when I see it again.

Thanks!

> Do you have any reference for the heuristic parameters set by -Os ?  (I
> am perfectly happy with "it's in this file in the gcc source code", if
> that is the answer.)

It's not just parameters, it's all over the code:

$ grep optimize_[a-z]*_for_ *.c | wc -l
219

$ grep -l optimize_[a-z]*_for_ *.c | wc -l
66

(and that does not include the target code -- there are about 300 more in
all targets combined).


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-31 17:30           ` Segher Boessenkool
@ 2019-10-31 17:47             ` Richard Earnshaw (lists)
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Earnshaw (lists) @ 2019-10-31 17:47 UTC (permalink / raw)
  To: Segher Boessenkool, David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help

On 31/10/2019 17:30, Segher Boessenkool wrote:
> On Thu, Oct 31, 2019 at 09:10:34AM +0100, David Brown wrote:
>> I think there is a general attitude of "I'm compiling for a small
>> microcontroller with only 32 KB flash - so I must use flags to make the
>> code as small as possible".  In fact, if you have 32 KB flash then you
>> need to make your code at most 32 KB in size - code of 31 KB is just as
>> good as code of 17 KB.  As long as your code fits, any further reduction
>> in size is unnecessary.
> 
> Yeah, unless you're already at 31kB and now have to add a feature :-)
> 
>> I haven't reported any "-O2 smaller than -Os" bugs, as far as I can
>> remember, but have merely noted it as a pattern that I see sometimes.  I
>> will try to keep an eye out for when I see it again.
> 
> Thanks!
> 
>> Do you have any reference for the heuristic parameters set by -Os ?  (I
>> am perfectly happy with "it's in this file in the gcc source code", if
>> that is the answer.)
> 
> It's not just parameters, it's all over the code:
> 
> $ grep optimize_[a-z]*_for_ *.c | wc -l
> 219
> 
> $ grep -l optimize_[a-z]*_for_ *.c | wc -l
> 66
> 
> (and that does not include the target code -- there are about 300 more in
> all targets combined).
> 
> 
> Segher
> 


And that's not to mention all the places like rtx_costs that turn that 
into a local function argument with a different name - often something 
like time_p, or size_p.

R.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30 22:49       ` Segher Boessenkool
  2019-10-31  8:10         ` David Brown
@ 2019-11-03 12:24         ` Oleg Endo
  2019-11-03 18:17           ` Segher Boessenkool
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Endo @ 2019-11-03 12:24 UTC (permalink / raw)
  To: Segher Boessenkool, David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help

On Wed, 2019-10-30 at 17:49 -0500, Segher Boessenkool wrote:
> 
> But if you have no time to waste, -Os usually give you smaller code than
> -O2, even if some of the choices on the size <-> speed tradeoff are a bit
> questionable (bug reports or patches welcome, of course!)  And sometimes
> -Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after
> all.  But if you see this often (for some specific target?), please file
> a PR for that as well.
> 

I've got at least one MCU project which needs to fit into certain
amount of ROM.  As long as it stays within that limit, anything is
fine.  Each time I updated the compiler, starting from GCC 6, then 7
and finally on GCC 8,  there were issues with the code size -- the
compiled output just kept getting bigger and bigger.  At some point I
even had to roll my own GCC patch to reverse one of the changes to the
heuristics in some place.  But the need for that was only temporary.

How to report this kind of regression as a PR?  Obviously I can't
attach the full source code and build system and whatnot (which is not
open source to begin with).  And for the same reason, trying to extract
a reproducer also seems too time consuming and not feasible.

Any ideas, suggestions or strategies how to report these kinds of
problems?

Cheers,
Oleg

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-11-03 12:24         ` Oleg Endo
@ 2019-11-03 18:17           ` Segher Boessenkool
  0 siblings, 0 replies; 20+ messages in thread
From: Segher Boessenkool @ 2019-11-03 18:17 UTC (permalink / raw)
  To: Oleg Endo; +Cc: David Brown, Jeff Law, Mikael Tillenius, gcc-help

Hi!

On Sun, Nov 03, 2019 at 09:24:17PM +0900, Oleg Endo wrote:
> On Wed, 2019-10-30 at 17:49 -0500, Segher Boessenkool wrote:
> > But if you have no time to waste, -Os usually give you smaller code than
> > -O2, even if some of the choices on the size <-> speed tradeoff are a bit
> > questionable (bug reports or patches welcome, of course!)  And sometimes
> > -Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after
> > all.  But if you see this often (for some specific target?), please file
> > a PR for that as well.
> 
> I've got at least one MCU project which needs to fit into certain
> amount of ROM.  As long as it stays within that limit, anything is
> fine.  Each time I updated the compiler, starting from GCC 6, then 7
> and finally on GCC 8,  there were issues with the code size -- the
> compiled output just kept getting bigger and bigger.  At some point I
> even had to roll my own GCC patch to reverse one of the changes to the
> heuristics in some place.  But the need for that was only temporary.
> 
> How to report this kind of regression as a PR?  Obviously I can't
> attach the full source code and build system and whatnot (which is not
> open source to begin with).  And for the same reason, trying to extract
> a reproducer also seems too time consuming and not feasible.
> 
> Any ideas, suggestions or strategies how to report these kinds of
> problems?

You can still report it, of course.  Someone might have an idea what
causes it, esp.  if you give enough info.  It also helps if you can at
least partially bisect it, find when the problem was introduced.

It also helps a lot if you know the area in the compiler that regressed,
or have a motivated guess for that anyway :-)


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-29 20:19 ` Jeff Law
  2019-10-30  0:34   ` Segher Boessenkool
@ 2019-10-30 16:12   ` Mikael Tillenius
  2019-11-01 13:22     ` Mikael Tillenius
  1 sibling, 1 reply; 20+ messages in thread
From: Mikael Tillenius @ 2019-10-30 16:12 UTC (permalink / raw)
  To: Jeff Law, gcc-help

Thanks, I have filed a bug report now.

/Mikael


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Non-optimal code generated for H8
  2019-10-30 16:12   ` Mikael Tillenius
@ 2019-11-01 13:22     ` Mikael Tillenius
  0 siblings, 0 replies; 20+ messages in thread
From: Mikael Tillenius @ 2019-11-01 13:22 UTC (permalink / raw)
  To: gcc-help

Ok, I have looked more into the problem. For now I have a decent work 
around in my code using "volatile" and an almost working patch for GCC. 
I am not sure if I have time to work more on the GCC patch right now.

But just to sum up, here is what I found so far. Given the code

struct s {
 Â Â Â  char a, b;
 Â Â Â  char c[11];
} x[2];

void test(int n)
{
 Â Â Â  struct s *sp = &x[n];

 Â Â Â  sp->a = 1;
 Â Â Â  sp->b = 1;
}

For plain H8/300 (no specific options) and H8S (-ms) GCC does not find a 
suitable multiplication instruction and instead generates a call to 
__mulhi3/__mulsi3. For some reason CSE is not able to do its work so the 
pointer value is recalculated for each access. H8300 has no 
multiplication instruction. H8S has only a 16x16 => 32 bit 
multiplication (mulxu.w).

 Â Â Â  stm.lÂ Â Â  er4-er5,@-er7
 Â Â Â  mov.wÂ Â Â  r0,r4
 Â Â Â  extu.lÂ Â Â  er4
 Â Â Â  sub.lÂ Â Â  er1,er1
 Â Â Â  add.bÂ Â Â  #13,r1l
 Â Â Â  mov.lÂ Â Â  er4,er0
 Â Â Â  jsrÂ Â Â  @___mulsi3
 Â Â Â  mov.bÂ Â Â  #1,r5l
 Â Â Â  mov.bÂ Â Â  r5l,@(_x,er0)
 Â Â Â  sub.lÂ Â Â  er1,er1
 Â Â Â  add.bÂ Â Â  #13,r1l
 Â Â Â  mov.lÂ Â Â  er4,er0
 Â Â Â  jsrÂ Â Â  @___mulsi3
 Â Â Â  mov.bÂ Â Â  r5l,@(_x+1,er0)
 Â Â Â  ldm.lÂ Â Â  @er7+,er4-er5
 Â Â Â  rts

For H8/300H in 16-bit mode (-mh -mn) and H8SX (-mx) GCC finds the 
appropriate instructions and CSE does its work. Only one mulxu.w/mulu.l 
is generated and the pointer value is reused.

If I add a suitable multiplication insn that generates the call to 
__mulsi3 for H8S GCC will use it and CSE will work. "b" and "e" are new 
constraints that require r0 and r1. The problem is that this insn 
collides with the already exsiting insn for H8SX.

(define_insn "mulsi3"
 Â  [(set (match_operand:SI 0 "register_operand" "=b")
 Â Â Â Â Â Â Â  (mult:SI (match_operand:SI 1 "register_operand" "%0")
 Â Â Â  Â Â Â  Â (match_operand:SI 2 "register_operand" "e")))]
 Â  "TARGET_H8300S"
 Â  "jsr\\t@___mulsi3"
 Â  [(set_attr "length" "2")
 Â Â  (set_attr "cc" "set_zn")])

---

 Â Â Â  extu.lÂ Â Â  er0
 Â Â Â  sub.lÂ Â Â  er1,er1
 Â Â Â  add.bÂ Â Â  #13,r1l
 Â Â Â  jsrÂ Â Â  @___mulsi3
 Â Â Â  add.lÂ Â Â  #_x,er0
 Â Â Â  mov.bÂ Â Â  #1,r2l
 Â Â Â  mov.bÂ Â Â  r2l,@er0
 Â Â Â  mov.bÂ Â Â  r2l,@(1,er0)
 Â Â Â  rts

If I also add the following I get almost perfect code. "small_operand" 
is a new predicate that requires a 16-bit value.

(define_insn "*mulhisi3"
 Â  [(set (match_operand:SI 0 "register_operand" "=r")
 Â Â Â  (mult:SI (sign_extend:SI (match_operand:HI 1 "register_operand" "%0"))
 Â Â Â  Â Â Â  Â (match_operand:SI 2 "small_operand" "")))]
 Â  "TARGET_H8300H || TARGET_H8300S"
 Â  "mov.w\\t%T2,%e0\;mulxs.w\\t%e0,%S0"
 Â  [(set_attr "length" "2")
 Â Â  (set_attr "cc" "none_0hit")])

---

 Â Â Â  mov.wÂ Â Â  #13,e0
 Â Â Â  mulxs.wÂ Â Â  e0,er0
 Â Â Â  add.lÂ Â Â  #_x,er0
 Â Â Â  mov.bÂ Â Â  #1,r2l
 Â Â Â  mov.bÂ Â Â  r2l,@er0
 Â Â Â  mov.bÂ Â Â  r2l,@(1,er0)
 Â Â Â  rts

Thanks for the help.

/Mikael

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-11-03 18:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-29 20:03 Non-optimal code generated for H8 Mikael Tillenius
2019-10-29 20:19 ` Jeff Law
2019-10-30  0:34   ` Segher Boessenkool
2019-10-30  9:07     ` David Brown
2019-10-30 16:32       ` Mikael Tillenius
2019-10-30 19:53       ` Mikael Tillenius
2019-10-30 19:59         ` Jeff Law
2019-10-31  1:01           ` Segher Boessenkool
2019-11-01 13:16           ` Mikael Tillenius
2019-11-01 14:38             ` Jeff Law
2019-10-30 22:49       ` Segher Boessenkool
2019-10-31  8:10         ` David Brown
2019-10-31 16:24           ` Richard Earnshaw (lists)
2019-10-31 18:21             ` David Brown
2019-10-31 17:30           ` Segher Boessenkool
2019-10-31 17:47             ` Richard Earnshaw (lists)
2019-11-03 12:24         ` Oleg Endo
2019-11-03 18:17           ` Segher Boessenkool
2019-10-30 16:12   ` Mikael Tillenius
2019-11-01 13:22     ` Mikael Tillenius

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).