[Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function
@ 2023-12-09  4:16 xry111 at gcc dot gnu.org
  2023-12-09  4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
                   ` (22 more replies)
  0 siblings, 23 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  4:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

            Bug ID: 112935
           Summary: [14 Regression] Performance regression in Coremarks
                    crcu8 function
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: xry111 at gcc dot gnu.org
  Target Milestone: ---

typedef __UINT8_TYPE__ ee_u8;
typedef __UINT16_TYPE__ ee_u16;

ee_u16 crcu8(ee_u8 data, ee_u16 crc) {
  ee_u8 i = 0, x16 = 0, carry = 0;

  for (i = 0; i < 8; i++) {
    x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
    data >>= 1;

    if (x16 == 1) {
      crc ^= 0x4002;
      carry = 1;
    } else
      carry = 0;
    crc >>= 1;
    if (carry)
      crc |= 0x8000;
    else
      crc &= 0x7fff;
  }
  return crc;
}

With GCC 13.2.0 -O2, on LoongArch we get:

.L2:
        xor     $r12,$r4,$r14
        andi    $r12,$r12,1
        sub.w   $r12,$r0,$r12
        srli.w  $r4,$r4,1
        and     $r12,$r12,$r15
        addi.w  $r13,$r13,-1
        xor     $r12,$r12,$r4
        bstrpick.w      $r13,$r13,7,0
        srli.d  $r14,$r14,1
        bstrpick.w      $r4,$r12,15,0
        bnez    $r13,.L2

With GCC 14.0.0 -O2:

.L2:
        xor     $r12,$r4,$r14
        andi    $r12,$r12,1
        mul.w   $r12,$r12,$r15
        srli.w  $r4,$r4,1
        addi.w  $r13,$r13,-1
        bstrpick.w      $r13,$r13,7,0
        srli.d  $r14,$r14,1
        xor     $r12,$r12,$r4
        bstrpick.w      $r4,$r12,15,0
        bnez    $r13,.L2

mul.w is slower than sub.w + and.

I'm now setting components to tree-optimization because the difference already
exists in 254t.optimized vs 263t.optimized.  But maybe the tree optimizer is
doing things correctly and we should just add a target-specific optimization.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
@ 2023-12-09  4:20 ` xry111 at gcc dot gnu.org
  2023-12-09  4:20 ` xry111 at gcc dot gnu.org
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  4:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #1 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
Note that for 

int t(int x, _Bool y)
{
        return x * y;
}

even GCC 13 is generating the sub-optimal mul.w instruction.  So perhaps this
is just a target issue after all...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
  2023-12-09  4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
@ 2023-12-09  4:20 ` xry111 at gcc dot gnu.org
  2023-12-09  4:22 ` pinskia at gcc dot gnu.org
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  4:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Xi Ruoyao <xry111 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2023-12-09
     Ever confirmed|0                           |1

--- Comment #2 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
Self-confirming as this report is actually from Jia Jie.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
  2023-12-09  4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
  2023-12-09  4:20 ` xry111 at gcc dot gnu.org
@ 2023-12-09  4:22 ` pinskia at gcc dot gnu.org
  2023-12-09  4:35 ` [Bug target/112935] " pinskia at gcc dot gnu.org
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  4:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-12-09  4:22 ` pinskia at gcc dot gnu.org
@ 2023-12-09  4:35 ` pinskia at gcc dot gnu.org
  2023-12-09  4:49 ` pinskia at gcc dot gnu.org
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  4:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|needs-bisection             |missed-optimization
             Target|loongarch64-*-*             |loongson
          Component|tree-optimization           |target

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Either r14-1655-g52c92fb3f40050 or r14-1654-g7ceed7e3e29c33 causes the
difference in the gimple level.

BUT
`a * boolean` is Canonical form and expr.cc expands it as `a & -boolean`:
```
      /* Expand X*Y as X&-Y when Y must be zero or one.  */
...
              bool speed = optimize_insn_for_speed_p ();
              int cost = add_cost (speed, mode) + neg_cost (speed, mode);
              struct algorithm algorithm;
              enum mult_variant variant;
              if (CONST_INT_P (op1)
                  ? !choose_mult_variant (mode, INTVAL (op1),
                                          &algorithm, &variant, cost)
                  : cost < mul_cost (speed, mode))
                {
                  target = bit0_p ? expand_and (mode, negate_rtx (mode, op0),
                                                op1, target)
                                  : expand_and (mode, op0,
                                                negate_rtx (mode, op1),
                                                target);
                  return REDUCE_BIT_FIELD (target);
                }
```

Which means the cost model for loongson is wrong here.
Though it is comparing the cost of doing + and - to a * here though it should
be & and - but most target's + and & have a similar cost.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-12-09  4:35 ` [Bug target/112935] " pinskia at gcc dot gnu.org
@ 2023-12-09  4:49 ` pinskia at gcc dot gnu.org
  2023-12-09  6:56 ` xry111 at gcc dot gnu.org
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  4:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
/* Default RTX cost initializer.  */
...
    int_mult_si (COSTS_N_INSNS (1)),
    int_mult_di (COSTS_N_INSNS (1)),


That seems wrong.
I suspect you will get other improvements when you touch this.

E.g.
```
int f(int t)
{
        return t * 17;
}
```
Should really be:
shift followed by an add.
But currently is just a mult.

What is interesting is I think -Os cost is the opposite from the -O2 cost ...

That is -Os produces the better code generation due to the cost for mult being
set to 4:
/* RTX costs to use when optimizing for size.  */
...
    .int_mult_si_ (4)
    .int_mult_di_ (4)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-12-09  4:49 ` pinskia at gcc dot gnu.org
@ 2023-12-09  6:56 ` xry111 at gcc dot gnu.org
  2023-12-09  7:02 ` xry111 at gcc dot gnu.org
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  6:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #5 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #4)
> /* Default RTX cost initializer.  */
> ...
>     int_mult_si (COSTS_N_INSNS (1)),
>     int_mult_di (COSTS_N_INSNS (1)),
> 
> 
> That seems wrong.
> I suspect you will get other improvements when you touch this.
> 
> E.g.
> ```
> int f(int t)
> {
>         return t * 17;
> }
> ```
> Should really be:
> shift followed by an add.
> But currently is just a mult.
> 
> What is interesting is I think -Os cost is the opposite from the -O2 cost ...
> 
> That is -Os produces the better code generation due to the cost for mult
> being set to 4:
> /* RTX costs to use when optimizing for size.  */
> ...
>     .int_mult_si_ (4)
>     .int_mult_di_ (4)

4 is just COSTS_N_INSNS(1), so in -Os we are making all instructions cost the
same.  This should be correct because in -Os we should minimize the number of
the instructions.  In loongarch_rtx_costs though we have:

    case MULT: 
      if (float_mode_p)
        *total = loongarch_fp_mult_cost (mode);
      else if (mode == DImode && !TARGET_64BIT)
        *total = (speed
                  ? loongarch_cost->int_mult_si * 3 + 6 
                  : COSTS_N_INSNS (7)); 
      else if (!speed)
        *total = COSTS_N_INSNS (1) + 1;
      else if (mode == DImode)
        *total = loongarch_cost->int_mult_di;
      else  
        *total = loongarch_cost->int_mult_si;
      return false;

so we still slightly penalty multiplication.  To me we should code
COSTS_N_INSNS (1) + 1 into loongarch_rtx_cost_optimize_size instead of special
casing it in loongarch_rtx_costs.

For the default value (used when -O2) I'll do some micro-benchmark...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-12-09  6:56 ` xry111 at gcc dot gnu.org
@ 2023-12-09  7:02 ` xry111 at gcc dot gnu.org
  2023-12-09  7:05 ` pinskia at gcc dot gnu.org
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  7:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #6 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
On a LA664 it seems a mul.w instruction costs 4 times a "simple" instruction
like add.w/sub.w/and, and div.w costs 5 times.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-12-09  7:02 ` xry111 at gcc dot gnu.org
@ 2023-12-09  7:05 ` pinskia at gcc dot gnu.org
  2023-12-09  7:48 ` xry111 at gcc dot gnu.org
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  7:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Xi Ruoyao from comment #5)
> 
> so we still slightly penalty multiplication.  To me we should code
> COSTS_N_INSNS (1) + 1 into loongarch_rtx_cost_optimize_size instead of
> special casing it in loongarch_rtx_costs.

Oh yes slightly penalty is definitely not going make a huge difference if the
cost of an mult instruction is worse than an and and an neg.

> 
> For the default value (used when -O2) I'll do some micro-benchmark...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2023-12-09  7:05 ` pinskia at gcc dot gnu.org
@ 2023-12-09  7:48 ` xry111 at gcc dot gnu.org
  2023-12-09  7:56 ` xry111 at gcc dot gnu.org
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  7:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #8 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #7)
> (In reply to Xi Ruoyao from comment #5)
> > 
> > so we still slightly penalty multiplication.  To me we should code
> > COSTS_N_INSNS (1) + 1 into loongarch_rtx_cost_optimize_size instead of
> > special casing it in loongarch_rtx_costs.
> 
> Oh yes slightly penalty is definitely not going make a huge difference if
> the cost of an mult instruction is worse than an and and an neg.
> 
> > 
> > For the default value (used when -O2) I'll do some micro-benchmark...

I've changed it to

/* Default RTX cost initializer.  */
loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
  : fp_add (COSTS_N_INSNS (5)),
    fp_mult_sf (COSTS_N_INSNS (5)),
    fp_mult_df (COSTS_N_INSNS (5)),
    fp_div_sf (COSTS_N_INSNS (8)),
    fp_div_df (COSTS_N_INSNS (8)),
    int_mult_si (COSTS_N_INSNS (4)),
    int_mult_di (COSTS_N_INSNS (4)),
    int_div_si (COSTS_N_INSNS (5)),
    int_div_di (COSTS_N_INSNS (5)),
    branch_cost (6),
    memory_latency (4) {}

based on micro-benchmark results.  This fixes the int * _Bool case and int * 17
case.  But for the original test case I still get a multiplication instruction.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2023-12-09  7:48 ` xry111 at gcc dot gnu.org
@ 2023-12-09  7:56 ` xry111 at gcc dot gnu.org
  2023-12-09  8:58 ` xry111 at gcc dot gnu.org
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  7:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #9 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
And in fact the optimal code for

int t(int x, _Bool y)
{
        return x * y;
}

should be

    maskeqz $r4,$r4,$r5
    jr  $r1 

like

int t(int x, _Bool y)
{
        return y ? x : 0;
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2023-12-09  7:56 ` xry111 at gcc dot gnu.org
@ 2023-12-09  8:58 ` xry111 at gcc dot gnu.org
  2023-12-09  8:58 ` xry111 at gcc dot gnu.org
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  8:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #10 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
For the original test case we get:

  x16_27 = _26 & 1;
  data_28 = data_15 >> 1;
  _29 = crc_18 >> 1;
  _21 = (short unsigned int) x16_27;
  _13 = _21 * 40961;

and tree_nonzero_bits fails to _21 is either 0 or 1, for some reason.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2023-12-09  8:58 ` xry111 at gcc dot gnu.org
@ 2023-12-09  8:58 ` xry111 at gcc dot gnu.org
  2023-12-09  9:06 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  8:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #11 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
> and tree_nonzero_bits fails to _21 is either 0 or 1, for some reason.
                                ^ report

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2023-12-09  8:58 ` xry111 at gcc dot gnu.org
@ 2023-12-09  9:06 ` xry111 at gcc dot gnu.org
  2023-12-09  9:07 ` [Bug middle-end/112935] " pinskia at gcc dot gnu.org
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  9:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Xi Ruoyao <xry111 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |tree-optimization
             Target|loongarch64-*-*             |

--- Comment #12 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
This also happens for x86_64, with 13.2:

.L2:
        movl    %edi, %edx
        shrb    %dil
        xorl    %eax, %edx
        shrw    %ax
        andl    $1, %edx
        negl    %edx
        andw    $-24575, %dx
        xorl    %edx, %eax
        subb    $1, %cl
        jne     .L2

With trunk:

.L2:
        mov     ecx, edi
        shr     dil
        xor     ecx, eax
        shr     ax
        and     ecx, 1
        lea     edx, [rcx+rcx*4]
        sal     edx, 13
        add     edx, ecx
        xor     eax, edx
        sub     sil, 1
        jne     .L2

So this seems not only a target issue.  I'll move it back to tree-optimization
and open LoongArch cost model issue as a new ticket.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2023-12-09  9:06 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
@ 2023-12-09  9:07 ` pinskia at gcc dot gnu.org
  2023-12-09  9:14 ` xry111 at gcc dot gnu.org
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  9:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |loongarch64-*-*
          Component|tree-optimization           |middle-end

--- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Xi Ruoyao from comment #11)
> > and tree_nonzero_bits fails to _21 is either 0 or 1, for some reason.
>                                 ^ report

Oh I see that on aarch64 too ...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2023-12-09  9:07 ` [Bug middle-end/112935] " pinskia at gcc dot gnu.org
@ 2023-12-09  9:14 ` xry111 at gcc dot gnu.org
  2023-12-09  9:16 ` pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  9:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #14 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
LoongArch cost model issue is now PR112936.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2023-12-09  9:14 ` xry111 at gcc dot gnu.org
@ 2023-12-09  9:16 ` pinskia at gcc dot gnu.org
  2023-12-09  9:21 ` xry111 at gcc dot gnu.org
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  9:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I see what is happening and kinda of see why it is not there ...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2023-12-09  9:16 ` pinskia at gcc dot gnu.org
@ 2023-12-09  9:21 ` xry111 at gcc dot gnu.org
  2023-12-09  9:24 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09  9:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #16 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
BTW is it possible to get the value range info of (x) and combine (and y, (neg
x)) back to maskeqz in LoongArch backend?  It will further improve the
performance.  Or maybe expr.cc should invoke a target hook to see if (mul x, y)
should be expanded to (if_then_else x, y, 0) instead when x is in {0, 1}.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2023-12-09  9:21 ` xry111 at gcc dot gnu.org
@ 2023-12-09  9:24 ` pinskia at gcc dot gnu.org
  2023-12-09  9:25 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  9:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |pinskia at gcc dot gnu.org

--- Comment #17 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So it was r14-1655-g52c92fb3f40050 which caused the issue.

It is a correct fix except we sometimes insert a convert and that new ssa name
does not have any global range info on it.
e.g.
_21 = (short unsigned int) x16_27;

So instead of:
          bool bit0_p = tree_nonzero_bits (treeop0) == 1;
          bool bit1_p = tree_nonzero_bits (treeop1) == 1;

We should use gimple_zero_one_valued_p


bool
gimple_zero_one_valued_p (tree t, tree (*valueize)(tree) ATTRIBUTE_UNUSED)


Which will look through one extra convert.


Let me fix this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2023-12-09  9:24 ` pinskia at gcc dot gnu.org
@ 2023-12-09  9:25 ` pinskia at gcc dot gnu.org
  2023-12-09 21:43 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09  9:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=111126

--- Comment #18 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Xi Ruoyao from comment #16)
> BTW is it possible to get the value range info of (x) and combine (and y,
> (neg x)) back to maskeqz in LoongArch backend?  It will further improve the
> performance.  Or maybe expr.cc should invoke a target hook to see if (mul x,
> y) should be expanded to (if_then_else x, y, 0) instead when x is in {0, 1}.

That would be PR 111126 because RISCV backend has the same issue. I will be
fixing that but most likely for GCC 15 as I am going to fix the regression
here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2023-12-09  9:25 ` pinskia at gcc dot gnu.org
@ 2023-12-09 21:43 ` pinskia at gcc dot gnu.org
  2023-12-10  9:22 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 21:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #19 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This patch gets us back to using `&-` rather than shifts/adds for x86_64:
```
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 6da51f2aca2..4686cacd22f 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -10209,8 +10209,9 @@ expand_expr_real_2 (sepops ops, rtx target,
machine_mode tmode,
       /* Expand X*Y as X&-Y when Y must be zero or one.  */
       if (SCALAR_INT_MODE_P (mode))
        {
-         bool bit0_p = tree_nonzero_bits (treeop0) == 1;
-         bool bit1_p = tree_nonzero_bits (treeop1) == 1;
+         bool gimple_zero_one_valued_p (tree, tree (*)(tree));
+         bool bit0_p = gimple_zero_one_valued_p (treeop0, nullptr);
+         bool bit1_p = gimple_zero_one_valued_p (treeop1, nullptr);

          /* Expand X*Y as X&Y when both X and Y must be zero or one.  */
          if (bit0_p && bit1_p)
```

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2023-12-09 21:43 ` pinskia at gcc dot gnu.org
@ 2023-12-10  9:22 ` pinskia at gcc dot gnu.org
  2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
  2023-12-11 15:57 ` pinskia at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-10  9:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |https://gcc.gnu.org/piperma
                   |                            |il/gcc-patches/2023-Decembe
                   |                            |r/640030.html
           Keywords|                            |patch

--- Comment #20 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Patch submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640030.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2023-12-10  9:22 ` pinskia at gcc dot gnu.org
@ 2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
  2023-12-11 15:57 ` pinskia at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-12-11 15:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

--- Comment #21 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>:

https://gcc.gnu.org/g:acbfb8b9495b802e414e6ab94b810ef7b0c8aa1d

commit r14-6418-gacbfb8b9495b802e414e6ab94b810ef7b0c8aa1d
Author: Andrew Pinski <quic_apinski@quicinc.com>
Date:   Sat Dec 9 13:43:23 2023 -0800

    expr: catch more `a*bool` while expanding  [PR 112935]

    After r14-1655-g52c92fb3f40050 (and the other commits
    which touch zero_one_valued_p), we end up with a with
    `bool * a` but where the bool is an SSA name that might not
    have non-zero bits set on it (to 0x1) even though it
    does the non-zero bits would be 0x1.
    The case of coremarks, it is only phiopt4 which adds the new
    ssa name and nothing afterwards updates the nonzero bits on it.
    This fixes the regression by using gimple_zero_one_valued_p
    rather than tree_nonzero_bits to match the cases where the
    SSA_NAME didn't have the non-zero bits set.
    gimple_zero_one_valued_p handles one level of cast and also
    and an `&`.

    Bootstrapped and tested on x86_64-linux-gnu.

    gcc/ChangeLog:

            PR middle-end/112935
            * expr.cc (expand_expr_real_2): Use
            gimple_zero_one_valued_p instead of tree_nonzero_bits
            to find boolean defined expressions.

    Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
  2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
@ 2023-12-11 15:57 ` pinskia at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-11 15:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #22 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-12-11 15:57 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-09  4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
2023-12-09  4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
2023-12-09  4:20 ` xry111 at gcc dot gnu.org
2023-12-09  4:22 ` pinskia at gcc dot gnu.org
2023-12-09  4:35 ` [Bug target/112935] " pinskia at gcc dot gnu.org
2023-12-09  4:49 ` pinskia at gcc dot gnu.org
2023-12-09  6:56 ` xry111 at gcc dot gnu.org
2023-12-09  7:02 ` xry111 at gcc dot gnu.org
2023-12-09  7:05 ` pinskia at gcc dot gnu.org
2023-12-09  7:48 ` xry111 at gcc dot gnu.org
2023-12-09  7:56 ` xry111 at gcc dot gnu.org
2023-12-09  8:58 ` xry111 at gcc dot gnu.org
2023-12-09  8:58 ` xry111 at gcc dot gnu.org
2023-12-09  9:06 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
2023-12-09  9:07 ` [Bug middle-end/112935] " pinskia at gcc dot gnu.org
2023-12-09  9:14 ` xry111 at gcc dot gnu.org
2023-12-09  9:16 ` pinskia at gcc dot gnu.org
2023-12-09  9:21 ` xry111 at gcc dot gnu.org
2023-12-09  9:24 ` pinskia at gcc dot gnu.org
2023-12-09  9:25 ` pinskia at gcc dot gnu.org
2023-12-09 21:43 ` pinskia at gcc dot gnu.org
2023-12-10  9:22 ` pinskia at gcc dot gnu.org
2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
2023-12-11 15:57 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).