[Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value
@ 2020-07-02 12:45 zhongyunde at tom dot com
  2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-07-02 12:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

            Bug ID: 96031
           Summary: suboptimal codegen for store low 16-bits value
           Product: gcc
           Version: 8.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: zhongyunde at tom dot com
  Target Milestone: ---

For the following code, as instruction strh only store the low 16-bits value,
so the 'and     w2, w2, 65535 ' is redundant.
test base on the ARM64 gcc 8.2 on https://gcc.godbolt.org/, so get complicated
assemble.

typedef unsigned int UINT32;
typedef unsigned short UINT16;


UINT16 array[12];

void foo (UINT32 len, UINT32 step)              
{
    UINT32 index = 1;

    for (index = 1 ; index < len; index++ )
        {
            array[index] = index * step;
        }
}

// the assemble of kernel loop body --------------------------
        b       .L4         //
.L6:
        add     x3, x3, 2 // ivtmp.6, ivtmp.6,
.L4:
        strh    w2, [x4, 2]     // ivtmp.4, MEM[base: _2, offset: 2B]
        add     w2, w1, w2        // tmp105, _12, ivtmp.4
        and     w2, w2, 65535     // ivtmp.4, tmp105 ????
        cmp     x3, x0    // ivtmp.6, _23
        mov     x4, x3    // ivtmp.6, ivtmp.6
        bne     .L6             //,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
  2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
@ 2020-07-06 11:01 ` zhongyunde at tom dot com
  2020-07-19 14:32 ` amker at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-07-06 11:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #1 from zhongyunde at tom dot com <zhongyunde at tom dot com> ---
this may can be enhance by ivopts.
If the case adjusted as following, then the 'and     w2, w2, 65535 ' will
disappear.


typedef unsigned int UINT32;
typedef unsigned short UINT16;


UINT16 array[12];

void foo (UINT32 len, UINT32 step)              
{
    UINT32 index = 0;
    UINT32 sum = 0;
    for (index = 0; index < len; index++ )
        {  
            array[index] = sum;
            sum += step;
        }
}

// the assemble of kernel loop body --------------------------
.L9:
        add     x2, x2, 2 // ivtmp.6, ivtmp.6,
.L3:
        strh    w3, [x4]        // sum, MEM[base: _12, offset: 0B]
        cmp     x2, x0    // ivtmp.6, _22
        add     w3, w3, w1        // sum, sum, step
        mov     x4, x2    // ivtmp.6, ivtmp.6
        bne     .L9             //,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
  2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
  2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
@ 2020-07-19 14:32 ` amker at gcc dot gnu.org
  2020-07-20 12:25 ` zhongyunde at tom dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: amker at gcc dot gnu.org @ 2020-07-19 14:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #2 from bin cheng <amker at gcc dot gnu.org> ---
Interesting case, I see two issues in generated asm.  One is the unnecessary
bitwise and, the other is allocating different registers for induction variable
and the base address.  However, looks like neither issue is caused by ivopts. 
Check the dump:

431   <bb 4> [local count: 105119324]:
432   _12 = (short unsigned int) step_8(D);
433   ivtmp.10_11 = (unsigned long) &array;
434   _18 = len_7(D) + 4294967294;
435   _19 = (unsigned long) _18;
436   _20 = _19 * 2;
437   _21 = (unsigned long) &array;
438   _22 = _21 + 2;
439   _23 = _20 + _22;
440
441   <bb 5> [local count: 955630224]:
442   # ivtmp.8_15 = PHI <_12(4), ivtmp.8_5(6)>
443   # ivtmp.10_16 = PHI <ivtmp.10_11(4), ivtmp.10_4(6)>
444   _3 = ivtmp.8_15;
445   _2 = (void *) ivtmp.10_16;
446   MEM[base: _2, offset: 2B] = _3;
447   ivtmp.8_5 = ivtmp.8_15 + _12;
448   ivtmp.10_4 = ivtmp.10_16 + 2;
449   if (ivtmp.10_4 != _23)
450     goto <bb 6>; [89.00%]
451   else
452     goto <bb 8>; [11.00%]
453
454   <bb 8> [local count: 105119324]:
455   goto <bb 3>; [100.00%]
456
457   <bb 6> [local count: 850510900]:
458   goto <bb 5>; [100.00%]

As far as I can tell, it's optimal.

The register allocation issue is introduced by rtl PRE, apparently we should
not save the add 2 instruction in the last iteration with a false dependence
which is more harmful.

As for ivopt, I can see a minor improvement by replacing != exit condition with
<=, thus saving add 2 instruction computing _22, which happens to "disable" the
wrong PRE transformation.

Ah, I see it's already classified as rtl-optimization.

Thanks

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
  2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
  2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
  2020-07-19 14:32 ` amker at gcc dot gnu.org
@ 2020-07-20 12:25 ` zhongyunde at tom dot com
  2020-08-26  4:04 ` zhongyunde at tom dot com
  2021-08-22  9:00 ` pinskia at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-07-20 12:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #3 from zhongyunde at tom dot com <zhongyunde at tom dot com> ---
I find there is some different between the two cases during in ivopts.

For the 2nd case, a UINT32 type iv sum is choosed
     <bb 3> [local count: 955630224]:
  # sum_15 = PHI <0(5), sum_9(6)>
  # ivtmp.10_17 = PHI <ivtmp.10_3(5), ivtmp.10_4(6)>
  _2 = (short unsigned int) sum_15;
  _1 = _2;
  _11 = (void *) ivtmp.10_17;
  MEM[base: _11, offset: 0B] = _1;
  sum_9 = step_8(D) + sum_15;
  ivtmp.10_4 = ivtmp.10_17 + 2;
  if (ivtmp.10_4 != _22)
    goto <bb 6>; [89.00%]

For the 1st case, a 'short unsigned int type' ivtmp.8 is choosed as your dump
showed, and there is no UINT32 type candidate with Step step.

typedef unsigned int UINT32;
typedef unsigned short UINT16;

UINT16 array[12];

void foo (UINT32 len, UINT32 step)              
{
    UINT32 index = 0;
    UINT32 sum = 0;
    for (index = 0; index < len; index++ )
        {  
            sum = index * step;
            array[index] = sum;
        }
}

I tried to add a UINT32 type temporary sum as above case （the 3rd case）, then
modify the gcc to add an UINT32 type candidate variable and adjust the cost to
choose the Candidate variable （do the similar things as the 2nd case in ivopt）,
then we can also optimize the 'and     w2, w2, 65535' insn.
But above method is not conformed to the implementation method of ivopt, may be
we need extend an UINT32 candidate variable base 'on short unsigned int' IV
struct ?

===== the change of gcc to add UINT32 type candidate variable
==================
@@ -3389,7 +3389,7 @@ add_iv_candidate_for_bivs (struct ivopts_data *data)
   EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi)
     {
       iv = ver_info (data, i)->iv;
-      if (iv && iv->biv_p && !integer_zerop (iv->step))
+      if (iv && !integer_zerop (iv->step))
        add_iv_candidate_for_biv (data, iv);
     }
 }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
  2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
                   ` (2 preceding siblings ...)
  2020-07-20 12:25 ` zhongyunde at tom dot com
@ 2020-08-26  4:04 ` zhongyunde at tom dot com
  2021-08-22  9:00 ` pinskia at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-08-26  4:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #4 from zhongyunde at tom dot com <zhongyunde at tom dot com> ---

> As for ivopt, I can see a minor improvement by replacing != exit condition
> with <=, thus saving add 2 instruction computing _22, which happens to
> "disable" the wrong PRE transformation.
> 
  I take a look at the function may_eliminate_iv, now iv_elimination_compare
will only return EQ_EXPR or NE_EXPR, so do you mean to do some extend for this
case?

5411   *bound = fold_convert (TREE_TYPE (cand->iv->base),
5412                          aff_combination_to_tree (&bnd));
5413   *comp = iv_elimination_compare (data, use);
5414

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
  2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
                   ` (3 preceding siblings ...)
  2020-08-26  4:04 ` zhongyunde at tom dot com
@ 2021-08-22  9:00 ` pinskia at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-22  9:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=57231
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-08-22
     Ever confirmed|0                           |1

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed. There might be a few others like this too.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-22  9:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
2020-07-19 14:32 ` amker at gcc dot gnu.org
2020-07-20 12:25 ` zhongyunde at tom dot com
2020-08-26  4:04 ` zhongyunde at tom dot com
2021-08-22  9:00 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).