* Non-optimal code generated for H8 @ 2019-10-29 20:03 Mikael Tillenius 2019-10-29 20:19 ` Jeff Law 0 siblings, 1 reply; 20+ messages in thread From: Mikael Tillenius @ 2019-10-29 20:03 UTC (permalink / raw) To: gcc-help Hi I am using a cross compiler for Renesas H8S. In a few places it generates really bad code. Given the following program: struct s { Â Â Â char a, b; Â Â Â char c[11]; } x[2]; void test(int n) { Â Â Â struct s *sp = &x[n]; Â Â Â sp->a = 1; Â Â Â sp->b = 1; } I would expect that the pointer "sp" is calculated once and reused to access the fields "a" and "b". But instead the pointer is recalculated for each access. This generates a lot of extra code, including calls to __mulhi3. I have tested with gcc 8.2 and 9.2 and with different optimization levels (-O1, -O2, -Os) all with the same result. With -O0 "sp" is only calculated once and kept as a variable on the stack but the rest of the code is not as good as it could be. The best work around seems to be to declare "sp" as volatile: "struct s *volatile sp = &x[n];". Then "sp" is only calculated once and kept on the stack and the surrounding code can be optimized. So my question is: where should I start looking for a fix to this? The other targets I tried (ARM, x86, x86_64) behave as expected and calculates the pointer once and keeps it in a register. /Mikael ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-29 20:03 Non-optimal code generated for H8 Mikael Tillenius @ 2019-10-29 20:19 ` Jeff Law 2019-10-30 0:34 ` Segher Boessenkool 2019-10-30 16:12 ` Mikael Tillenius 0 siblings, 2 replies; 20+ messages in thread From: Jeff Law @ 2019-10-29 20:19 UTC (permalink / raw) To: Mikael Tillenius, gcc-help On 10/29/19 2:03 PM, Mikael Tillenius wrote: > Hi > > I am using a cross compiler for Renesas H8S. In a few places it > generates really bad code. Given the following program: > > struct s { > char a, b; > char c[11]; > } x[2]; > > void test(int n) > { > struct s *sp = &x[n]; > > sp->a = 1; > sp->b = 1; > } > > I would expect that the pointer "sp" is calculated once and reused to > access the fields "a" and "b". But instead the pointer is recalculated > for each access. This generates a lot of extra code, including calls to > __mulhi3. I have tested with gcc 8.2 and 9.2 and with different > optimization levels (-O1, -O2, -Os) all with the same result. With -O0 > "sp" is only calculated once and kept as a variable on the stack but the > rest of the code is not as good as it could be. The best work around > seems to be to declare "sp" as volatile: "struct s *volatile sp = > &x[n];". Then "sp" is only calculated once and kept on the stack and the > surrounding code can be optimized. > > So my question is: where should I start looking for a fix to this? The > other targets I tried (ARM, x86, x86_64) behave as expected and > calculates the pointer once and keeps it in a register. As we leave gimple the code looks like: MEM <struct s[2]> [(struct s *)&x][n_1(D)].a = 1; MEM <struct s[2]> [(struct s *)&x][n_1(D)].b = 1; One might argue that DOM or FRE should have created a common subexpression for the address arithmetic here. Even so it's not bad. CSE doesn't do its job though. THere's clearly a REG_EQUAL note which should have allowed it to at least cleanup the redundant multiplication for the address calculation. I recommend filing a bug report. Note that the H8 port is on the list of ports that are be deprecated in gcc-10 unless someone steps forward to take care of some significant maintenance tasks. Deprecation in gcc-10 would mean the port would be removed in gcc-11 unless someone steps up to take care of those maintenance tasks. Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-29 20:19 ` Jeff Law @ 2019-10-30 0:34 ` Segher Boessenkool 2019-10-30 9:07 ` David Brown 2019-10-30 16:12 ` Mikael Tillenius 1 sibling, 1 reply; 20+ messages in thread From: Segher Boessenkool @ 2019-10-30 0:34 UTC (permalink / raw) To: Jeff Law; +Cc: Mikael Tillenius, gcc-help On Tue, Oct 29, 2019 at 02:19:25PM -0600, Jeff Law wrote: > On 10/29/19 2:03 PM, Mikael Tillenius wrote: > > I am using a cross compiler for Renesas H8S. In a few places it > > generates really bad code. Given the following program: > > > > struct s { > > Â Â Â char a, b; > > Â Â Â char c[11]; > > } x[2]; > > > > void test(int n) > > { > > Â Â Â struct s *sp = &x[n]; > > > > Â Â Â sp->a = 1; > > Â Â Â sp->b = 1; > > } > As we leave gimple the code looks like: > > MEM <struct s[2]> [(struct s *)&x][n_1(D)].a = 1; > MEM <struct s[2]> [(struct s *)&x][n_1(D)].b = 1; > > One might argue that DOM or FRE should have created a common > subexpression for the address arithmetic here. Even so it's not bad. > > CSE doesn't do its job though. THere's clearly a REG_EQUAL note which > should have allowed it to at least cleanup the redundant multiplication > for the address calculation. And on other targets it does do its job fine, say riscv32, or m68k -O1 (the -O1 to prevent the two stores from being optimised into one). I haven't managed to find another target where multiplication by 13 is done with a libcall though. Maybe I should look harder. > I recommend filing a bug report. +1 Segher ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 0:34 ` Segher Boessenkool @ 2019-10-30 9:07 ` David Brown 2019-10-30 16:32 ` Mikael Tillenius ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: David Brown @ 2019-10-30 9:07 UTC (permalink / raw) To: Segher Boessenkool, Jeff Law; +Cc: Mikael Tillenius, gcc-help On 30/10/2019 01:34, Segher Boessenkool wrote: > On Tue, Oct 29, 2019 at 02:19:25PM -0600, Jeff Law wrote: >> On 10/29/19 2:03 PM, Mikael Tillenius wrote: >>> I am using a cross compiler for Renesas H8S. In a few places it >>> generates really bad code. Given the following program: >>> >>> struct s { >>> Â Â Â char a, b; >>> Â Â Â char c[11]; >>> } x[2]; >>> >>> void test(int n) >>> { >>> Â Â Â struct s *sp = &x[n]; >>> >>> Â Â Â sp->a = 1; >>> Â Â Â sp->b = 1; >>> } > >> As we leave gimple the code looks like: >> >> MEM <struct s[2]> [(struct s *)&x][n_1(D)].a = 1; >> MEM <struct s[2]> [(struct s *)&x][n_1(D)].b = 1; >> >> One might argue that DOM or FRE should have created a common >> subexpression for the address arithmetic here. Even so it's not bad. >> >> CSE doesn't do its job though. THere's clearly a REG_EQUAL note which >> should have allowed it to at least cleanup the redundant multiplication >> for the address calculation. > > And on other targets it does do its job fine, say riscv32, or m68k -O1 > (the -O1 to prevent the two stores from being optimised into one). > > I haven't managed to find another target where multiplication by 13 is > done with a libcall though. Maybe I should look harder. > I checked on the 8-bit AVR, which is (I think) the smallest and simplest device targeted by gcc, and where only some devices have multiplication instructions. If you optimise for size (-Os), it uses a library call __mulhi3 for the multiplication: test: ldi r22,lo8(13) ldi r23,0 rcall __mulhi3 subi r24,lo8(-(x)) sbci r25,hi8(-(x)) ldi r18,lo8(1) mov r26,r24 mov r27,r25 st X,r18 mov r30,r26 mov r31,r27 std Z+1,r18 ret With -O1 or -O2, it uses shifts and adds for the multiply: test: mov r30,r24 mov r31,r25 lsl r30 rol r31 add r30,r24 adc r31,r25 lsl r30 rol r31 lsl r30 rol r31 add r24,r30 adc r25,r31 mov r30,r24 mov r31,r25 subi r30,lo8(-(x)) sbci r31,hi8(-(x)) ldi r24,lo8(1) st Z,r24 std Z+1,r24 ret (I used an old version of avr-gcc here, version 5.4.0, just because it is on the extremely useful <https://godbolt.org> online compiler site. Things may have changed for later versions, but usually the AVR port is fairly stable.) One thing to note here is that with -O1, the compiler calculates "sp" in the "Z" register, then stores 1 into [sp] and [sp+1]. With the multiplication, sp is calculated in a non-pointer register pair and the compiler generates sub-optimal code for storing in [sp] and [sp+1]. If that is still the case for modern gcc, it could be filed as a missed optimisation bug for the AVR backend. But this also brings up another idea. Is the OP using "-Os" optimisation? My experience (especially with AVR, msp430, and ARM Cortex-M targets) is that "-Os" optimisation is often quite poor compared to "-O2". It can result in very significantly slower code for a saving of a couple of bytes, and in some cases the code can be significantly /bigger/ than with -O2. I don't know whether this is a backend issue, or a general problem with "-Os", but I no longer use or recommend "-Os" even for tiny embedded systems. So perhaps simply changing from "-Os" to "-O2" will fix the OP's problems. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 9:07 ` David Brown @ 2019-10-30 16:32 ` Mikael Tillenius 2019-10-30 19:53 ` Mikael Tillenius 2019-10-30 22:49 ` Segher Boessenkool 2 siblings, 0 replies; 20+ messages in thread From: Mikael Tillenius @ 2019-10-30 16:32 UTC (permalink / raw) To: gcc-help On 30/10/2019 01:34, Segher Boessenkool wrote: >> And on other targets it does do its job fine, say riscv32, or m68k -O1 >> (the -O1 to prevent the two stores from being optimised into one). >> >> I haven't managed to find another target where multiplication by 13 is >> done with a libcall though. Maybe I should look harder. >> Just to be clear. The call to __mulhi3 by itself is reasonable since the H8 has quite weak shift instructions. And no 32 bit by 32 bit multiplication instruction. It is the repeated re-calculation of the pointer that is the big problem. The code gets both big and slow. On 10/30/19 10:06 AM, David Brown wrote: > But this also brings up another idea. Is the OP using "-Os" > optimisation? My experience (especially with AVR, msp430, and ARM > Cortex-M targets) is that "-Os" optimisation is often quite poor > compared to "-O2". It can result in very significantly slower code for > a saving of a couple of bytes, and in some cases the code can be > significantly /bigger/ than with -O2. I don't know whether this is a > backend issue, or a general problem with "-Os", but I no longer use or > recommend "-Os" even for tiny embedded systems. Yes, I tried all of -O1, -O2 and -Os and probably a few more. My experience with -Os on ARM Thumb2 is much more positive than your. "-Os -mslow-flash-data" gives small and quite fast code most of the time. /Mikael ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 9:07 ` David Brown 2019-10-30 16:32 ` Mikael Tillenius @ 2019-10-30 19:53 ` Mikael Tillenius 2019-10-30 19:59 ` Jeff Law 2019-10-30 22:49 ` Segher Boessenkool 2 siblings, 1 reply; 20+ messages in thread From: Mikael Tillenius @ 2019-10-30 19:53 UTC (permalink / raw) To: gcc-help The AVR example was interesting since it generates reasonable code. After a few experiments I strongly believe that it is the call to __mulhi3 that stops GCC from doing CSE on the pointer expressions. If I add a fake mulhi3 insn CSE seems to do its work and just generates the address calculation once. But of course the code will not work since the instruction is not present in the hardware. I see at least three possible solutions: 1) Teach CSE to work even for calls to __mulhi3 2) Add an insn that generates the H8S mulxu.w (16x16 => 32) instruction in this case. "int" is 16-bit and the size of the struct is also small in this case. 3) Do something similar to AVR. It seems to have an mulhi3 insn that expands to a call to __mulhi3 after CSE. Now the only problem is that I have no idea how to do any of the above;-) Any hints? /Mikael ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 19:53 ` Mikael Tillenius @ 2019-10-30 19:59 ` Jeff Law 2019-10-31 1:01 ` Segher Boessenkool 2019-11-01 13:16 ` Mikael Tillenius 0 siblings, 2 replies; 20+ messages in thread From: Jeff Law @ 2019-10-30 19:59 UTC (permalink / raw) To: Mikael Tillenius, gcc-help On 10/30/19 1:53 PM, Mikael Tillenius wrote: > The AVR example was interesting since it generates reasonable code. > > After a few experiments I strongly believe that it is the call to > __mulhi3 that stops GCC from doing CSE on the pointer expressions. If I > add a fake mulhi3 insn CSE seems to do its work and just generates the > address calculation once. But of course the code will not work since the > instruction is not present in the hardware. I see at least three > possible solutions: > > 1) Teach CSE to work even for calls to __mulhi3 THe name shouldn't matter at all to CSE. What's important to CSE is interpreting the REG_EQUAL note in this case and finding the bounds of the insn chain that computes the value in the REG_EQUAL note. > > 2) Add an insn that generates the H8S mulxu.w (16x16 => 32) instruction > in this case. "int" is 16-bit and the size of the struct is also small > in this case. Note that GCC should know how to generate mulxu.w for the H8/300H and above variants. Just quickly looking at the MD file I see: > (define_expand "umulhisi3" > [(set (match_operand:SI 0 "register_operand" "") > (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" "")) > ;; intentionally-mismatched modes > (match_operand:HI 2 "reg_or_nibble_operand" "")))] > "TARGET_H8300H || TARGET_H8300S" > { > if (GET_MODE (operands[2]) != VOIDmode) > operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]); > }) Note the HImode on operand 2. That seems wrong. Hopefully it wasn't something *I* did eons ago. Anyway, that might be a place to start looking. I wouldn't be surprised if all the work we've done in the last decade to tighten up our checking of predicates is somehow causing us to not use umulhisi3 in this context. Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 19:59 ` Jeff Law @ 2019-10-31 1:01 ` Segher Boessenkool 2019-11-01 13:16 ` Mikael Tillenius 1 sibling, 0 replies; 20+ messages in thread From: Segher Boessenkool @ 2019-10-31 1:01 UTC (permalink / raw) To: Jeff Law; +Cc: Mikael Tillenius, gcc-help On Wed, Oct 30, 2019 at 01:59:26PM -0600, Jeff Law wrote: > On 10/30/19 1:53 PM, Mikael Tillenius wrote: > > The AVR example was interesting since it generates reasonable code. > > > > After a few experiments I strongly believe that it is the call to > > __mulhi3 that stops GCC from doing CSE on the pointer expressions. If I > > add a fake mulhi3 insn CSE seems to do its work and just generates the > > address calculation once. But of course the code will not work since the > > instruction is not present in the hardware. I see at least three > > possible solutions: > > > > 1) Teach CSE to work even for calls to __mulhi3 > THe name shouldn't matter at all to CSE. What's important to CSE is > interpreting the REG_EQUAL note in this case and finding the bounds of > the insn chain that computes the value in the REG_EQUAL note. Yeah, and the call insns *are* properly marked as "const" as well, so I don't see what is going on (and the CSE dump file is content-less). > > (define_expand "umulhisi3" > > [(set (match_operand:SI 0 "register_operand" "") > > (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" "")) > > ;; intentionally-mismatched modes > > (match_operand:HI 2 "reg_or_nibble_operand" "")))] > > "TARGET_H8300H || TARGET_H8300S" > > { > > if (GET_MODE (operands[2]) != VOIDmode) > > operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]); > > }) > > Note the HImode on operand 2. That seems wrong. Hopefully it wasn't > something *I* did eons ago. Anyway, that might be a place to start > looking. I wouldn't be surprised if all the work we've done in the last > decade to tighten up our checking of predicates is somehow causing us to > not use umulhisi3 in this context. I get a __mulsi3 even, probably need more flags than just -ms :-) (I used h8300-linux). Segher ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 19:59 ` Jeff Law 2019-10-31 1:01 ` Segher Boessenkool @ 2019-11-01 13:16 ` Mikael Tillenius 2019-11-01 14:38 ` Jeff Law 1 sibling, 1 reply; 20+ messages in thread From: Mikael Tillenius @ 2019-11-01 13:16 UTC (permalink / raw) To: gcc-help On 10/30/19 8:59 PM, Jeff Law wrote: >> 1) Teach CSE to work even for calls to __mulhi3 > THe name shouldn't matter at all to CSE. What's important to CSE is > interpreting the REG_EQUAL note in this case and finding the bounds of > the insn chain that computes the value in the REG_EQUAL note. My guess would be that it is the *call* that is the problem, not exactly what it calls. But I really don't know what I am talking about. I never looked at the GCC sources until yesterday. > Note that GCC should know how to generate mulxu.w for the H8/300H and > above variants. Just quickly looking at the MD file I see: > > >> (define_expand "umulhisi3" >> [(set (match_operand:SI 0 "register_operand" "") >> (mult:SI (zero_extend:SI (match_operand:HI 1 "register_operand" "")) >> ;; intentionally-mismatched modes >> (match_operand:HI 2 "reg_or_nibble_operand" "")))] >> "TARGET_H8300H || TARGET_H8300S" >> { >> if (GET_MODE (operands[2]) != VOIDmode) >> operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]); >> }) > Note the HImode on operand 2. That seems wrong. Hopefully it wasn't > something *I* did eons ago. Anyway, that might be a place to start > looking. I wouldn't be surprised if all the work we've done in the last > decade to tighten up our checking of predicates is somehow causing us to > not use umulhisi3 in this context. I think the problem is the predicate "reg_or_nibble_operand". It only matches integers <= 15 and only on H8SX. So it cannot match 13 on my target (which is H8S) and cannot match sizeof(...) in general even on H8SX. Also see my next message. /Mikael ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-11-01 13:16 ` Mikael Tillenius @ 2019-11-01 14:38 ` Jeff Law 0 siblings, 0 replies; 20+ messages in thread From: Jeff Law @ 2019-11-01 14:38 UTC (permalink / raw) To: Mikael Tillenius, gcc-help On 11/1/19 7:16 AM, Mikael Tillenius wrote: > > On 10/30/19 8:59 PM, Jeff Law wrote: >>> 1) Teach CSE to work even for calls to __mulhi3 >> THe name shouldn't matter at all to CSE. What's important to CSE is >> interpreting the REG_EQUAL note in this case and finding the bounds of >> the insn chain that computes the value in the REG_EQUAL note. > My guess would be that it is the *call* that is the problem, not exactly > what it calls. But I really don't know what I am talking about. I never > looked at the GCC sources until yesterday. There is certainly a problem in that CSE should know how to remove the redundant call and failure to do so is likely a (low priority) bug. But at least on the H8/300H and later processors the calls should be totally unnecessary as we should be generating the multiply instruction. >> Note that GCC should know how to generate mulxu.w for the H8/300H and >> above variants. Just quickly looking at the MD file I see: >> >> >>> (define_expand "umulhisi3" >>> [(set (match_operand:SI 0 "register_operand" "") >>> (mult:SI (zero_extend:SI (match_operand:HI 1 >>> "register_operand" "")) >>> ;; intentionally-mismatched modes >>> (match_operand:HI 2 "reg_or_nibble_operand" "")))] >>> "TARGET_H8300H || TARGET_H8300S" >>> { >>> if (GET_MODE (operands[2]) != VOIDmode) >>> operands[2] = gen_rtx_ZERO_EXTEND (SImode, operands[2]); >>> }) >> Note the HImode on operand 2. That seems wrong. Hopefully it wasn't >> something *I* did eons ago. Anyway, that might be a place to start >> looking. I wouldn't be surprised if all the work we've done in the last >> decade to tighten up our checking of predicates is somehow causing us to >> not use umulhisi3 in this context. > > I think the problem is the predicate "reg_or_nibble_operand". It only > matches integers <= 15 and only on H8SX. So it cannot match 13 on my > target (which is H8S) and cannot match sizeof(...) in general even on H8SX. It's not the range of values, but the *modes* of those values. ie, it's perfectly fine to have a predicate that restricts the values to some integer range. The general structure of that expander is suspect because of the difference in the structure of the two operands of the MULT rtx. ie, one is a (zero_extend:SI ...) and the other is a HImode reg_or_nibble_operand. I strongly suspect the generic expansion code is claiming that umulhisi3 isn't supported for any cases because of this issue. To further complicate matters GCC does not generally allow for constructs such as ({zero,sign}_extend (const_int ...))) as those are inherently constants and would be simplified by removing the extension. So it's actually fairly difficult to write that expander correctly. Anyway, someone would need to debug the expander and recognition code. It's not a high priority issue though. But at least with the bug filed it won't get lost. Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 9:07 ` David Brown 2019-10-30 16:32 ` Mikael Tillenius 2019-10-30 19:53 ` Mikael Tillenius @ 2019-10-30 22:49 ` Segher Boessenkool 2019-10-31 8:10 ` David Brown 2019-11-03 12:24 ` Oleg Endo 2 siblings, 2 replies; 20+ messages in thread From: Segher Boessenkool @ 2019-10-30 22:49 UTC (permalink / raw) To: David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help On Wed, Oct 30, 2019 at 10:06:57AM +0100, David Brown wrote: > But this also brings up another idea. Is the OP using "-Os" > optimisation? My experience (especially with AVR, msp430, and ARM > Cortex-M targets) is that "-Os" optimisation is often quite poor > compared to "-O2". It can result in very significantly slower code for > a saving of a couple of bytes, and in some cases the code can be > significantly /bigger/ than with -O2. I don't know whether this is a > backend issue, or a general problem with "-Os", but I no longer use or > recommend "-Os" even for tiny embedded systems. I have done the same for many years now, on not so tiny systems as well: if you want compact code, you use -O2, and tune both the code and the compiler flags manually. Some of the --params are very useful. But do not forget about tuning the code itself, that is where the biggest gains are possible, always. (If a better solution is hard to come up with, try finding a better problem, instead!) But if you have no time to waste, -Os usually give you smaller code than -O2, even if some of the choices on the size <-> speed tradeoff are a bit questionable (bug reports or patches welcome, of course!) And sometimes -Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after all. But if you see this often (for some specific target?), please file a PR for that as well. Segher ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 22:49 ` Segher Boessenkool @ 2019-10-31 8:10 ` David Brown 2019-10-31 16:24 ` Richard Earnshaw (lists) 2019-10-31 17:30 ` Segher Boessenkool 2019-11-03 12:24 ` Oleg Endo 1 sibling, 2 replies; 20+ messages in thread From: David Brown @ 2019-10-31 8:10 UTC (permalink / raw) To: Segher Boessenkool; +Cc: Jeff Law, Mikael Tillenius, gcc-help On 30/10/2019 23:49, Segher Boessenkool wrote: > On Wed, Oct 30, 2019 at 10:06:57AM +0100, David Brown wrote: >> But this also brings up another idea. Is the OP using "-Os" >> optimisation? My experience (especially with AVR, msp430, and ARM >> Cortex-M targets) is that "-Os" optimisation is often quite poor >> compared to "-O2". It can result in very significantly slower code for >> a saving of a couple of bytes, and in some cases the code can be >> significantly /bigger/ than with -O2. I don't know whether this is a >> backend issue, or a general problem with "-Os", but I no longer use or >> recommend "-Os" even for tiny embedded systems. > > I have done the same for many years now, on not so tiny systems as well: > if you want compact code, you use -O2, and tune both the code and the > compiler flags manually. Some of the --params are very useful. But do > not forget about tuning the code itself, that is where the biggest gains > are possible, always. (If a better solution is hard to come up with, > try finding a better problem, instead!) > Getting the smallest and fastest code from the compiler is always a bit of an artform - tweaking parameters or options, tweaking code arrangements, adding hints like "always_inline" attributes. No compiler can be expected to make the best choices at all times. But gcc generally does a fine job - it's rare that I need to do anything like this with gcc, but sometimes it is fun :-) (With some of the non-gcc compilers I have used, it is a necessity, and that's less fun.) I think there is a general attitude of "I'm compiling for a small microcontroller with only 32 KB flash - so I must use flags to make the code as small as possible". In fact, if you have 32 KB flash then you need to make your code at most 32 KB in size - code of 31 KB is just as good as code of 17 KB. As long as your code fits, any further reduction in size is unnecessary. > But if you have no time to waste, -Os usually give you smaller code than > -O2, even if some of the choices on the size <-> speed tradeoff are a bit > questionable (bug reports or patches welcome, of course!) And sometimes > -Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after > all. But if you see this often (for some specific target?), please file > a PR for that as well. > I haven't reported any "-O2 smaller than -Os" bugs, as far as I can remember, but have merely noted it as a pattern that I see sometimes. I will try to keep an eye out for when I see it again. I have one bug report filed for an "-O2 slower than -Os" bug, the other direction. This was for ARM Cortex-M targets, which I filed at the "GNU Arm Embedded Toolchain" tracker. They are the main source of gcc toolchains for small-system ARM embedded development - in my business, it's important to have specific "toolchain releases" covering the compiler, library, and other tools. The folks running the project are quite good at fixing things or punting them upstream to the main gcc project, but like every other project bugs are reported faster than they can be handled! <https://bugs.launchpad.net/gcc-arm-embedded/+bug/1646883> Do you have any reference for the heuristic parameters set by -Os ? (I am perfectly happy with "it's in this file in the gcc source code", if that is the answer.) David ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-31 8:10 ` David Brown @ 2019-10-31 16:24 ` Richard Earnshaw (lists) 2019-10-31 18:21 ` David Brown 2019-10-31 17:30 ` Segher Boessenkool 1 sibling, 1 reply; 20+ messages in thread From: Richard Earnshaw (lists) @ 2019-10-31 16:24 UTC (permalink / raw) To: David Brown, Segher Boessenkool; +Cc: Jeff Law, Mikael Tillenius, gcc-help On 31/10/2019 08:10, David Brown wrote: > I have one bug report filed for an "-O2 slower than -Os" bug, the other > direction. This was for ARM Cortex-M targets, which I filed at the "GNU > Arm Embedded Toolchain" tracker. They are the main source of gcc > toolchains for small-system ARM embedded development - in my business, > it's important to have specific "toolchain releases" covering the > compiler, library, and other tools. The folks running the project are > quite good at fixing things or punting them upstream to the main gcc > project, but like every other project bugs are reported faster than they > can be handled! > > <https://bugs.launchpad.net/gcc-arm-embedded/+bug/1646883> I don't tend to look at launchpad, so I hadn't seen this. But a simple fix is quite easy and will now be in gcc-10. https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02234.html If you encounter problems with the Arm embedded tools and it can easily be reproduced on a standard FSF build, then it's a good idea to report the bug in GCC's own bugzilla, even if you also report it elsewhere to the vendor of your toolchain. GCC-10 will produce the following at -Os for your test1 case: ldr r3, .L2 movs r2, #1 str r2, [r3, #2060] movs r2, #2 str r2, [r3, #2064] movs r2, #3 str r2, [r3, #2052] movs r2, #4 str r2, [r3, #2076] movs r2, #6 str r2, [r3, #2048] bx lr This isn't perfect: we could use a different base and then we'd be able to make use of the 16-bit variants of the STR instruction. But it's about balance here - the code that tries to form the base address doesn't know whether it will be re-used and if so, what other offsets will be wanted, so we have to guess. The more bits we put in the base address the lower the probability of finding CSEs (as we get more bases), but if we put fewer bits in the base address and more in the offset we'll be more likely to find CSEs, but less likely to be able to use the smaller instructions. R. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-31 16:24 ` Richard Earnshaw (lists) @ 2019-10-31 18:21 ` David Brown 0 siblings, 0 replies; 20+ messages in thread From: David Brown @ 2019-10-31 18:21 UTC (permalink / raw) To: Richard Earnshaw (lists), Segher Boessenkool Cc: Jeff Law, Mikael Tillenius, gcc-help On 31/10/2019 17:24, Richard Earnshaw (lists) wrote: > On 31/10/2019 08:10, David Brown wrote: >> I have one bug report filed for an "-O2 slower than -Os" bug, the other >> direction. This was for ARM Cortex-M targets, which I filed at the "GNU >> Arm Embedded Toolchain" tracker. They are the main source of gcc >> toolchains for small-system ARM embedded development - in my business, >> it's important to have specific "toolchain releases" covering the >> compiler, library, and other tools. The folks running the project are >> quite good at fixing things or punting them upstream to the main gcc >> project, but like every other project bugs are reported faster than they >> can be handled! >> >> <https://bugs.launchpad.net/gcc-arm-embedded/+bug/1646883> > > I don't tend to look at launchpad, so I hadn't seen this. But a simple > fix is quite easy and will now be in gcc-10. > > https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02234.html > > If you encounter problems with the Arm embedded tools and it can easily > be reproduced on a standard FSF build, then it's a good idea to report > the bug in GCC's own bugzilla, even if you also report it elsewhere to > the vendor of your toolchain. > > GCC-10 will produce the following at -Os for your test1 case: > >        ldr    r3, .L2 >        movs   r2, #1 >        str    r2, [r3, #2060] >        movs   r2, #2 >        str    r2, [r3, #2064] >        movs   r2, #3 >        str    r2, [r3, #2052] >        movs   r2, #4 >        str    r2, [r3, #2076] >        movs   r2, #6 >        str    r2, [r3, #2048] >        bx     lr > > This isn't perfect: we could use a different base and then we'd be able > to make use of the 16-bit variants of the STR instruction. But it's > about balance here - the code that tries to form the base address > doesn't know whether it will be re-used and if so, what other offsets > will be wanted, so we have to guess. The more bits we put in the base > address the lower the probability of finding CSEs (as we get more > bases), but if we put fewer bits in the base address and more in the > offset we'll be more likely to find CSEs, but less likely to be able to > use the smaller instructions. > > R. Thank you - that is a big improvement. It is not always easy to guess what might be solvable with a simple patch, and what might be more effort. I guess it is better to file the bug reports, and let you folks decide if it is something quickly fixable, or if it is more effort than can be reasonably justified. I've added a link to your patch mail to the Launchpad bug, so that the powers that be on that toolchain project can mark it as resolved. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-31 8:10 ` David Brown 2019-10-31 16:24 ` Richard Earnshaw (lists) @ 2019-10-31 17:30 ` Segher Boessenkool 2019-10-31 17:47 ` Richard Earnshaw (lists) 1 sibling, 1 reply; 20+ messages in thread From: Segher Boessenkool @ 2019-10-31 17:30 UTC (permalink / raw) To: David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help On Thu, Oct 31, 2019 at 09:10:34AM +0100, David Brown wrote: > I think there is a general attitude of "I'm compiling for a small > microcontroller with only 32 KB flash - so I must use flags to make the > code as small as possible". In fact, if you have 32 KB flash then you > need to make your code at most 32 KB in size - code of 31 KB is just as > good as code of 17 KB. As long as your code fits, any further reduction > in size is unnecessary. Yeah, unless you're already at 31kB and now have to add a feature :-) > I haven't reported any "-O2 smaller than -Os" bugs, as far as I can > remember, but have merely noted it as a pattern that I see sometimes. I > will try to keep an eye out for when I see it again. Thanks! > Do you have any reference for the heuristic parameters set by -Os ? (I > am perfectly happy with "it's in this file in the gcc source code", if > that is the answer.) It's not just parameters, it's all over the code: $ grep optimize_[a-z]*_for_ *.c | wc -l 219 $ grep -l optimize_[a-z]*_for_ *.c | wc -l 66 (and that does not include the target code -- there are about 300 more in all targets combined). Segher ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-31 17:30 ` Segher Boessenkool @ 2019-10-31 17:47 ` Richard Earnshaw (lists) 0 siblings, 0 replies; 20+ messages in thread From: Richard Earnshaw (lists) @ 2019-10-31 17:47 UTC (permalink / raw) To: Segher Boessenkool, David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help On 31/10/2019 17:30, Segher Boessenkool wrote: > On Thu, Oct 31, 2019 at 09:10:34AM +0100, David Brown wrote: >> I think there is a general attitude of "I'm compiling for a small >> microcontroller with only 32 KB flash - so I must use flags to make the >> code as small as possible". In fact, if you have 32 KB flash then you >> need to make your code at most 32 KB in size - code of 31 KB is just as >> good as code of 17 KB. As long as your code fits, any further reduction >> in size is unnecessary. > > Yeah, unless you're already at 31kB and now have to add a feature :-) > >> I haven't reported any "-O2 smaller than -Os" bugs, as far as I can >> remember, but have merely noted it as a pattern that I see sometimes. I >> will try to keep an eye out for when I see it again. > > Thanks! > >> Do you have any reference for the heuristic parameters set by -Os ? (I >> am perfectly happy with "it's in this file in the gcc source code", if >> that is the answer.) > > It's not just parameters, it's all over the code: > > $ grep optimize_[a-z]*_for_ *.c | wc -l > 219 > > $ grep -l optimize_[a-z]*_for_ *.c | wc -l > 66 > > (and that does not include the target code -- there are about 300 more in > all targets combined). > > > Segher > And that's not to mention all the places like rtx_costs that turn that into a local function argument with a different name - often something like time_p, or size_p. R. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 22:49 ` Segher Boessenkool 2019-10-31 8:10 ` David Brown @ 2019-11-03 12:24 ` Oleg Endo 2019-11-03 18:17 ` Segher Boessenkool 1 sibling, 1 reply; 20+ messages in thread From: Oleg Endo @ 2019-11-03 12:24 UTC (permalink / raw) To: Segher Boessenkool, David Brown; +Cc: Jeff Law, Mikael Tillenius, gcc-help On Wed, 2019-10-30 at 17:49 -0500, Segher Boessenkool wrote: > > But if you have no time to waste, -Os usually give you smaller code than > -O2, even if some of the choices on the size <-> speed tradeoff are a bit > questionable (bug reports or patches welcome, of course!) And sometimes > -Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after > all. But if you see this often (for some specific target?), please file > a PR for that as well. > I've got at least one MCU project which needs to fit into certain amount of ROM. As long as it stays within that limit, anything is fine. Each time I updated the compiler, starting from GCC 6, then 7 and finally on GCC 8, there were issues with the code size -- the compiled output just kept getting bigger and bigger. At some point I even had to roll my own GCC patch to reverse one of the changes to the heuristics in some place. But the need for that was only temporary. How to report this kind of regression as a PR? Obviously I can't attach the full source code and build system and whatnot (which is not open source to begin with). And for the same reason, trying to extract a reproducer also seems too time consuming and not feasible. Any ideas, suggestions or strategies how to report these kinds of problems? Cheers, Oleg ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-11-03 12:24 ` Oleg Endo @ 2019-11-03 18:17 ` Segher Boessenkool 0 siblings, 0 replies; 20+ messages in thread From: Segher Boessenkool @ 2019-11-03 18:17 UTC (permalink / raw) To: Oleg Endo; +Cc: David Brown, Jeff Law, Mikael Tillenius, gcc-help Hi! On Sun, Nov 03, 2019 at 09:24:17PM +0900, Oleg Endo wrote: > On Wed, 2019-10-30 at 17:49 -0500, Segher Boessenkool wrote: > > But if you have no time to waste, -Os usually give you smaller code than > > -O2, even if some of the choices on the size <-> speed tradeoff are a bit > > questionable (bug reports or patches welcome, of course!) And sometimes > > -Os is unavoidably a bit bigger than -O2, it just sets *heuristics* after > > all. But if you see this often (for some specific target?), please file > > a PR for that as well. > > I've got at least one MCU project which needs to fit into certain > amount of ROM. As long as it stays within that limit, anything is > fine. Each time I updated the compiler, starting from GCC 6, then 7 > and finally on GCC 8, there were issues with the code size -- the > compiled output just kept getting bigger and bigger. At some point I > even had to roll my own GCC patch to reverse one of the changes to the > heuristics in some place. But the need for that was only temporary. > > How to report this kind of regression as a PR? Obviously I can't > attach the full source code and build system and whatnot (which is not > open source to begin with). And for the same reason, trying to extract > a reproducer also seems too time consuming and not feasible. > > Any ideas, suggestions or strategies how to report these kinds of > problems? You can still report it, of course. Someone might have an idea what causes it, esp. if you give enough info. It also helps if you can at least partially bisect it, find when the problem was introduced. It also helps a lot if you know the area in the compiler that regressed, or have a motivated guess for that anyway :-) Segher ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-29 20:19 ` Jeff Law 2019-10-30 0:34 ` Segher Boessenkool @ 2019-10-30 16:12 ` Mikael Tillenius 2019-11-01 13:22 ` Mikael Tillenius 1 sibling, 1 reply; 20+ messages in thread From: Mikael Tillenius @ 2019-10-30 16:12 UTC (permalink / raw) To: Jeff Law, gcc-help Thanks, I have filed a bug report now. /Mikael ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Non-optimal code generated for H8 2019-10-30 16:12 ` Mikael Tillenius @ 2019-11-01 13:22 ` Mikael Tillenius 0 siblings, 0 replies; 20+ messages in thread From: Mikael Tillenius @ 2019-11-01 13:22 UTC (permalink / raw) To: gcc-help Ok, I have looked more into the problem. For now I have a decent work around in my code using "volatile" and an almost working patch for GCC. I am not sure if I have time to work more on the GCC patch right now. But just to sum up, here is what I found so far. Given the code struct s {    char a, b;    char c[11]; } x[2]; void test(int n) {    struct s *sp = &x[n];    sp->a = 1;    sp->b = 1; } For plain H8/300 (no specific options) and H8S (-ms) GCC does not find a suitable multiplication instruction and instead generates a call to __mulhi3/__mulsi3. For some reason CSE is not able to do its work so the pointer value is recalculated for each access. H8300 has no multiplication instruction. H8S has only a 16x16 => 32 bit multiplication (mulxu.w).    stm.l   er4-er5,@-er7    mov.w   r0,r4    extu.l   er4    sub.l   er1,er1    add.b   #13,r1l    mov.l   er4,er0    jsr   @___mulsi3    mov.b   #1,r5l    mov.b   r5l,@(_x,er0)    sub.l   er1,er1    add.b   #13,r1l    mov.l   er4,er0    jsr   @___mulsi3    mov.b   r5l,@(_x+1,er0)    ldm.l   @er7+,er4-er5    rts For H8/300H in 16-bit mode (-mh -mn) and H8SX (-mx) GCC finds the appropriate instructions and CSE does its work. Only one mulxu.w/mulu.l is generated and the pointer value is reused. If I add a suitable multiplication insn that generates the call to __mulsi3 for H8S GCC will use it and CSE will work. "b" and "e" are new constraints that require r0 and r1. The problem is that this insn collides with the already exsiting insn for H8SX. (define_insn "mulsi3"  [(set (match_operand:SI 0 "register_operand" "=b")        (mult:SI (match_operand:SI 1 "register_operand" "%0")        (match_operand:SI 2 "register_operand" "e")))]  "TARGET_H8300S"  "jsr\\t@___mulsi3"  [(set_attr "length" "2")   (set_attr "cc" "set_zn")]) ---    extu.l   er0    sub.l   er1,er1    add.b   #13,r1l    jsr   @___mulsi3    add.l   #_x,er0    mov.b   #1,r2l    mov.b   r2l,@er0    mov.b   r2l,@(1,er0)    rts If I also add the following I get almost perfect code. "small_operand" is a new predicate that requires a 16-bit value. (define_insn "*mulhisi3"  [(set (match_operand:SI 0 "register_operand" "=r")    (mult:SI (sign_extend:SI (match_operand:HI 1 "register_operand" "%0"))        (match_operand:SI 2 "small_operand" "")))]  "TARGET_H8300H || TARGET_H8300S"  "mov.w\\t%T2,%e0\;mulxs.w\\t%e0,%S0"  [(set_attr "length" "2")   (set_attr "cc" "none_0hit")]) ---    mov.w   #13,e0    mulxs.w   e0,er0    add.l   #_x,er0    mov.b   #1,r2l    mov.b   r2l,@er0    mov.b   r2l,@(1,er0)    rts Thanks for the help. /Mikael ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2019-11-03 18:17 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-10-29 20:03 Non-optimal code generated for H8 Mikael Tillenius 2019-10-29 20:19 ` Jeff Law 2019-10-30 0:34 ` Segher Boessenkool 2019-10-30 9:07 ` David Brown 2019-10-30 16:32 ` Mikael Tillenius 2019-10-30 19:53 ` Mikael Tillenius 2019-10-30 19:59 ` Jeff Law 2019-10-31 1:01 ` Segher Boessenkool 2019-11-01 13:16 ` Mikael Tillenius 2019-11-01 14:38 ` Jeff Law 2019-10-30 22:49 ` Segher Boessenkool 2019-10-31 8:10 ` David Brown 2019-10-31 16:24 ` Richard Earnshaw (lists) 2019-10-31 18:21 ` David Brown 2019-10-31 17:30 ` Segher Boessenkool 2019-10-31 17:47 ` Richard Earnshaw (lists) 2019-11-03 12:24 ` Oleg Endo 2019-11-03 18:17 ` Segher Boessenkool 2019-10-30 16:12 ` Mikael Tillenius 2019-11-01 13:22 ` Mikael Tillenius
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).