On 12.10.22 19:57, Jeff Law wrote:
>
> On 10/4/22 11:06, Alexander Binzberger via Gcc-patches wrote:
>> Hi,
>> recently I used some arduino uno for a project and realized some areas
>> which do not output optimal asm code. Especially around shifts and 
>> function
>> calls.
>> With this as motivation and hacktoberfest I started patching things.
>> Since patch files do not provide a good overview and I hope for a
>> "hacktoberfest-accepted" label on the PR on github I also opened it 
>> there:
>> https://github.com/gcc-mirror/gcc/pull/73
>>
>> This patch improves shifts with const right hand operand. While 8bit and
>> 16bit shifts where mostly fine 24bit and 32bit where not handled well.
>>
>> Testing
>> I checked output with a local installation of compiler explorer in 
>> asm and
>> a tiny unit test comparing shifts with mul/div by 2.
>> I however did not write any testcases in gcc for it.
>>
>> Target
>> This patch is only targeting atmel avr family of chips.
>>
>> Changelog
>> improved const shifts for AVR targets
>
> It would be helpful if you could show the before/after code for the 
> cases you're changing.  Extra credit if you include cycles & size 
> information for those cases.  That would help someone like me who 
> knows GCC well, but isn't particularly well versed in the AVR target 
> evaluate the overarching goal of the patch (ie, better code).

about the avr family targets:

* consider every branch instruction = 1/2 cycles

* consider every 2byte/word instruction (besides move word if available) 
= 2 cycles

* consider multiplication (if available) = 2 cycles

* consider every load (beside load immediate "ldi" 1cylce) = 2cycles (+1 
for prog mem)

* pop and jump mostly = 2 cycles

* call is basically = 2-4 cycles

* ret is about =  4/5 cycles

* consider every instruction (bit/bit-test, most compare, arithmetic, 
logic, some other) = 1 cycle

* division does not exist

or as a summary for this patch: branches and such are 2 cycles the rest 
is 1 cycle

note that shifts are 1bit per cycle and the instructions are at least 
mostly byte based.

also note that operations using immediate do only work with the upper 
half of registers.


a description for the code before my change and what changed:

* shifts on 8bit (beside arithmetic shifts right) were optimized and 
always unrolled (only aligned with the rest of the code without actual 
change)

* arithmetic shift 8bit and 16bit shifts were mostly optimized and 
mostly unrolled - depending on registers and Os (I added the missing 
cases there)

* 24bit and 32bit shifts were basically not optimized at all and never 
unrolled (I added those cases and aligned the optimizer logic with the 
others. They also reuse the other shift code since they may reduce to 
those cases after a move for bigger shifts.)

* out_shift_with_cnt provides a fallback implementation as a loop over 
shifts which may get unrolled. in case of Os to about inner_len + 3,4 or 
5 and in other cases of optimizer e.g. O2 it gets unrolled if size is 
smaller 10. see max_len (basically unchanged)

* did not touch non const cases in this patch but may in a future patch 
for O2 and O3

note that in case of Os the smaller code is picked which is the loop at 
least in some cases but other optimizer cases profit a lot.

also note that it is debatable if Os needs to be that strict with size 
since the compute overhead of the loop is high with 5 per loop 
iteration/cycle- so per bit shift. A lot more cases could be covered 
with +1 or +2 more instructions.


about plen:

If plen is NULL the asm code gets returned.

If plen is a pointer the code does count the instruction count which I 
guess is used (or could be used) as a rough estimate of cycles as well 
as byte code size.

Some of the functions named this len. The 24bit functions mainly named 
this plen and used it like it is now in all functions. This is mostly a 
readability improvement.

I am not sure how this works together with the optimizer or the rest.

To my understanding however the functions may get called once by the 
optimizer with a length given, then to output code and potentially again 
with a len given over avr_adjust_length to return the size.

I may be wrong about this part but as far as I can tell I did not change 
the way it operates.


size and cycles summary:

The asm instruction count is used as size and cycle estimate. This gets 
close to the real deal for the shifts since the instructions are all 1 
cylce anyway and similar in byte code size.

8bit gets always optimized and unrolled to get max performance and less 
code size (beside shift of 6 with lower half registers used which is the 
worst case with +1 instruction).

16bit, 24bit and 32bit gets unrolled depending on optimizer setting - 
and registers used (see out_shift_with_cnt:max_len). So 16bit gets 
mostly optimized and unrolled in Os (see comments for plen/max_len) and 
always in O2 and such (max_len=10). Shift optimization and unroll for 
24bit and 32bit is mostly only relevant when not optimizing for size.

I think the move (for shifts bigger than a multiple of 8 optimization) 
should always be done if possible since this is a tiny space overhead 
but a big performance gain. Also bigger shifts on bigger variables may 
not be common any way.

>
> Changes should include a ChangeLog which indicates what changed. If 
> you look at git log you will see examples of what a ChangeLog should 
> look like.

are you fine with something like the following? one line for all 
functions then, same text?

gcc/ChangeLog:

* config/avr/avr.cc (ashlqi3_out): improved const shifts optimization

>
> The is large enough that you need either a  copyright assignment or 
> DCO certification.
>
> See this page for details:
>
> https://gcc.gnu.org/contribute.html

Signed-off-by: Alexander Binzberger <alexander.binzberger@gmail.com> 
<mailto:alexander.binzberger@gmail.com>

Is this enough or did I miss something?

>
>
>>
>> Patch
>> -----
>> diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
>> index 4ed390e4cf9..c7b70812d5c 100644
>> --- a/gcc/config/avr/avr.cc
>> +++ b/gcc/config/avr/avr.cc
>> @@ -6043,9 +6043,6 @@ out_shift_with_cnt (const char *templ, rtx_insn
>> *insn, rtx operands[],
>>     op[2] = operands[2];
>>     op[3] = operands[3];
>>
>> -  if (plen)
>> -    *plen = 0;
>> -
>
> Doesn't this leave *plen uninitialized for the case where the shift 
> count is held in memory or a register or is an out of range constant?  
> Is this really safe?

To my understanding this is basically how the code was working before - 
especially in the 24bit functions. It checks for NULL in avr_asm_len 
function which is why I use that now everywhere. This is why this looks 
fine to me but more eyes on this is a good idea ;)

Since I am not sure how this plen gets used in other parts of the code 
or by the optimizer I wonder if it should get set to 0 at all at this 
place. I guess this must happen somewhere up in the call tree anyway and 
should maybe not get set there. This however is consistent with the rest 
of the code in this file as far as I can tell.

I would be very happy if you could have a second look at this topic.

>
>
>
>>     if (CONST_INT_P (operands[2]))
>>       {
>>         /* Operand 3 is a scratch register if this is a
>> @@ -6150,96 +6147,68 @@ out_shift_with_cnt (const char *templ, rtx_insn
>> *insn, rtx operands[],
>>   /* 8bit shift left ((char)x << i)   */
>>
>>   const char *
>> -ashlqi3_out (rtx_insn *insn, rtx operands[], int *len)
>> +ashlqi3_out (rtx_insn *insn, rtx operands[], int *plen)
>>   {
>>     if (CONST_INT_P (operands[2]))
>>       {
>> -      int k;
>> -
>> -      if (!len)
>> - len = &k;
>> -
>
> Isn't this wrong for the call to ashlqi3_out from avr.md?  In that 
> call len/plen will be zero, which we then pass down.  So the question 
> is why did you remove this code?
About half of the code was written with plen and avr_asm_len the other 
parts like this. The plen variation was the more flexible one and made 
some changes down the road more easy so I made it use this one consistently.
>
>
> The patch as-is is relatively large and can easily be broken down into 
> more manageable chunks.  I would suggest a patch for each mode.  ie, 
> one which changes QImode shifts, another for HImode shifts, another 
> for PSImode shifts. etc.  It may seem like more work, but by breaking 
> it down reviewers can take action on each patch individually.  So for 
> example its relatively easy to work through the QImode changes and 
> those could go in fairly quick while the PSImode changes will require 
> considerably more time to review.

Due to the plen change (needed for some readability improvements, more 
consistent code but also more simple changes) and other minor 
readability improvements I decided to go with one patch for all const cases.

Are you sure you want this extra work and overhead?

Would you prefer them as a set of patches/mails (with patch 1/x) or 
single patch mails?

Maybe as github pull requests with multiple commits?

I am motivated to do what fits you best.

I will fix the spacing after your answer to this.

>
>
>>         switch (INTVAL (operands[2]))
>>    {
>>    default:
>>     if (INTVAL (operands[2]) < 8)
>>       break;
>>
>> -  *len = 1;
>> -  return "clr %0";
>> -
>> - case 1:
>> -  *len = 1;
>> -  return "lsl %0";
>> -
>> - case 2:
>> -  *len = 2;
>> -  return ("lsl %0" CR_TAB
>> -  "lsl %0");
>> -
>> - case 3:
>> -  *len = 3;
>> -  return ("lsl %0" CR_TAB
>> -  "lsl %0" CR_TAB
>> -  "lsl %0");
>> +    return avr_asm_len ("clr %0", operands, plen, 1);
>
> You've probably got a whitespace problem here.  I think the return 
> should line up in the came column as the IF statement.
yes, this tab=8 and is mixed with spaces thing confused me and the 
editor I think. Will fix this with the next iteration, see above comment.
> Conceptually this seems reasonable as cases 1, 2 and 3 can be 
> trivially handled by out_shift_with_cnt.  Tough routing more code 
> through out_shift_with_cnt means the comment might need to change 
> since we're routing more cases through it that were trivially handled 
> in ashlqi3_out before.

This is one of the readability improvements. There is no optimization 
for cases 1-3 I know of and out_shift_with_cnt will unroll them anyway 
in those cases resulting in the same asm code.

Maybe I should add a comment there.

>
>
>>
>>    case 4:
>>     if (test_hard_reg_class (LD_REGS, operands[0]))
>>       {
>> -      *len = 2;
>> -      return ("swap %0" CR_TAB
>> -      "andi %0,0xf0");
>> +        return avr_asm_len ("swap %0" CR_TAB
>> +          "andi %0,0xf0", operands, plen, 2);
> More indention problems here.  THe return should line up two spaces 
> inside the open curly brace.  Otherwise this case seems reasonable 
> since it's generating the same code as before.
see plen change explanation.
>>       }
>> -  *len = 4;
>> -  return ("lsl %0" CR_TAB
>> +    return avr_asm_len ("lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>> -  "lsl %0");
>> +      "lsl %0", operands, plen, 4);
>
> Gratuitous indentation changes.  Please don't do that unless you're 
> fixing cases where the indentation is wrong according to GNU/project 
> standards.
Will fix this with next iteration, see above comment.
>
>
>>
>>    case 5:
>>     if (test_hard_reg_class (LD_REGS, operands[0]))
>>       {
>> -      *len = 3;
>> -      return ("swap %0" CR_TAB
>> +        return avr_asm_len ("swap %0" CR_TAB
>>         "lsl %0"  CR_TAB
>> -      "andi %0,0xe0");
>> +          "andi %0,0xe0", operands, plen, 3);
>
> It looks like you're introducing gratuitous indentation changes here.  
> Please don't do that.  Otherwise this looks sensible as again, it 
> doesn't change the generated code.
>
>
>
>>       }
>> -  *len = 5;
>> -  return ("lsl %0" CR_TAB
>> +    return avr_asm_len ("lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>> -  "lsl %0");
>> +      "lsl %0", operands, plen, 5);
>
> Similarly.
>
>
>>
>>    case 6:
>>     if (test_hard_reg_class (LD_REGS, operands[0]))
>>       {
>> -      *len = 4;
>> -      return ("swap %0" CR_TAB
>> +        return avr_asm_len ("swap %0" CR_TAB
>>         "lsl %0"  CR_TAB
>>         "lsl %0"  CR_TAB
>> -      "andi %0,0xc0");
>> +          "andi %0,0xc0", operands, plen, 4);
>>       }
>> -  *len = 6;
>> -  return ("lsl %0" CR_TAB
>> +    return avr_asm_len ("lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>>     "lsl %0" CR_TAB
>> -  "lsl %0");
>> +      "lsl %0", operands, plen, 6);
>>
>>    case 7:
>> -  *len = 3;
>> -  return ("ror %0" CR_TAB
>> +    return avr_asm_len ("ror %0" CR_TAB
>>     "clr %0" CR_TAB
>> -  "ror %0");
>> +      "ror %0", operands, plen, 3);
>>    }
>
> Similarly for these cases.
>
>
> I don't have the time right now to go through the other modes. But I 
> would suggest you take the feedback above and apply the same concepts 
> to the changes for the other modes.
Thanks for your time!
>
>
> Note for other reviewers, without an assignment or DCO cert, this 
> can't go forward.
see sign-off above. is this fine? something else needed?
>
>
> jeff
>