public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function
@ 2023-12-09 4:16 xry111 at gcc dot gnu.org
2023-12-09 4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
` (22 more replies)
0 siblings, 23 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 4:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Bug ID: 112935
Summary: [14 Regression] Performance regression in Coremarks
crcu8 function
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: xry111 at gcc dot gnu.org
Target Milestone: ---
typedef __UINT8_TYPE__ ee_u8;
typedef __UINT16_TYPE__ ee_u16;
ee_u16 crcu8(ee_u8 data, ee_u16 crc) {
ee_u8 i = 0, x16 = 0, carry = 0;
for (i = 0; i < 8; i++) {
x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
data >>= 1;
if (x16 == 1) {
crc ^= 0x4002;
carry = 1;
} else
carry = 0;
crc >>= 1;
if (carry)
crc |= 0x8000;
else
crc &= 0x7fff;
}
return crc;
}
With GCC 13.2.0 -O2, on LoongArch we get:
.L2:
xor $r12,$r4,$r14
andi $r12,$r12,1
sub.w $r12,$r0,$r12
srli.w $r4,$r4,1
and $r12,$r12,$r15
addi.w $r13,$r13,-1
xor $r12,$r12,$r4
bstrpick.w $r13,$r13,7,0
srli.d $r14,$r14,1
bstrpick.w $r4,$r12,15,0
bnez $r13,.L2
With GCC 14.0.0 -O2:
.L2:
xor $r12,$r4,$r14
andi $r12,$r12,1
mul.w $r12,$r12,$r15
srli.w $r4,$r4,1
addi.w $r13,$r13,-1
bstrpick.w $r13,$r13,7,0
srli.d $r14,$r14,1
xor $r12,$r12,$r4
bstrpick.w $r4,$r12,15,0
bnez $r13,.L2
mul.w is slower than sub.w + and.
I'm now setting components to tree-optimization because the difference already
exists in 254t.optimized vs 263t.optimized. But maybe the tree optimizer is
doing things correctly and we should just add a target-specific optimization.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
@ 2023-12-09 4:20 ` xry111 at gcc dot gnu.org
2023-12-09 4:20 ` xry111 at gcc dot gnu.org
` (21 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 4:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #1 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
Note that for
int t(int x, _Bool y)
{
return x * y;
}
even GCC 13 is generating the sub-optimal mul.w instruction. So perhaps this
is just a target issue after all...
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
2023-12-09 4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
@ 2023-12-09 4:20 ` xry111 at gcc dot gnu.org
2023-12-09 4:22 ` pinskia at gcc dot gnu.org
` (20 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 4:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Xi Ruoyao <xry111 at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2023-12-09
Ever confirmed|0 |1
--- Comment #2 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
Self-confirming as this report is actually from Jia Jie.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
2023-12-09 4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
2023-12-09 4:20 ` xry111 at gcc dot gnu.org
@ 2023-12-09 4:22 ` pinskia at gcc dot gnu.org
2023-12-09 4:35 ` [Bug target/112935] " pinskia at gcc dot gnu.org
` (19 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 4:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |14.0
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (2 preceding siblings ...)
2023-12-09 4:22 ` pinskia at gcc dot gnu.org
@ 2023-12-09 4:35 ` pinskia at gcc dot gnu.org
2023-12-09 4:49 ` pinskia at gcc dot gnu.org
` (18 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 4:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords|needs-bisection |missed-optimization
Target|loongarch64-*-* |loongson
Component|tree-optimization |target
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Either r14-1655-g52c92fb3f40050 or r14-1654-g7ceed7e3e29c33 causes the
difference in the gimple level.
BUT
`a * boolean` is Canonical form and expr.cc expands it as `a & -boolean`:
```
/* Expand X*Y as X&-Y when Y must be zero or one. */
...
bool speed = optimize_insn_for_speed_p ();
int cost = add_cost (speed, mode) + neg_cost (speed, mode);
struct algorithm algorithm;
enum mult_variant variant;
if (CONST_INT_P (op1)
? !choose_mult_variant (mode, INTVAL (op1),
&algorithm, &variant, cost)
: cost < mul_cost (speed, mode))
{
target = bit0_p ? expand_and (mode, negate_rtx (mode, op0),
op1, target)
: expand_and (mode, op0,
negate_rtx (mode, op1),
target);
return REDUCE_BIT_FIELD (target);
}
```
Which means the cost model for loongson is wrong here.
Though it is comparing the cost of doing + and - to a * here though it should
be & and - but most target's + and & have a similar cost.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (3 preceding siblings ...)
2023-12-09 4:35 ` [Bug target/112935] " pinskia at gcc dot gnu.org
@ 2023-12-09 4:49 ` pinskia at gcc dot gnu.org
2023-12-09 6:56 ` xry111 at gcc dot gnu.org
` (17 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 4:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
/* Default RTX cost initializer. */
...
int_mult_si (COSTS_N_INSNS (1)),
int_mult_di (COSTS_N_INSNS (1)),
That seems wrong.
I suspect you will get other improvements when you touch this.
E.g.
```
int f(int t)
{
return t * 17;
}
```
Should really be:
shift followed by an add.
But currently is just a mult.
What is interesting is I think -Os cost is the opposite from the -O2 cost ...
That is -Os produces the better code generation due to the cost for mult being
set to 4:
/* RTX costs to use when optimizing for size. */
...
.int_mult_si_ (4)
.int_mult_di_ (4)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (4 preceding siblings ...)
2023-12-09 4:49 ` pinskia at gcc dot gnu.org
@ 2023-12-09 6:56 ` xry111 at gcc dot gnu.org
2023-12-09 7:02 ` xry111 at gcc dot gnu.org
` (16 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 6:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #5 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #4)
> /* Default RTX cost initializer. */
> ...
> int_mult_si (COSTS_N_INSNS (1)),
> int_mult_di (COSTS_N_INSNS (1)),
>
>
> That seems wrong.
> I suspect you will get other improvements when you touch this.
>
> E.g.
> ```
> int f(int t)
> {
> return t * 17;
> }
> ```
> Should really be:
> shift followed by an add.
> But currently is just a mult.
>
> What is interesting is I think -Os cost is the opposite from the -O2 cost ...
>
> That is -Os produces the better code generation due to the cost for mult
> being set to 4:
> /* RTX costs to use when optimizing for size. */
> ...
> .int_mult_si_ (4)
> .int_mult_di_ (4)
4 is just COSTS_N_INSNS(1), so in -Os we are making all instructions cost the
same. This should be correct because in -Os we should minimize the number of
the instructions. In loongarch_rtx_costs though we have:
case MULT:
if (float_mode_p)
*total = loongarch_fp_mult_cost (mode);
else if (mode == DImode && !TARGET_64BIT)
*total = (speed
? loongarch_cost->int_mult_si * 3 + 6
: COSTS_N_INSNS (7));
else if (!speed)
*total = COSTS_N_INSNS (1) + 1;
else if (mode == DImode)
*total = loongarch_cost->int_mult_di;
else
*total = loongarch_cost->int_mult_si;
return false;
so we still slightly penalty multiplication. To me we should code
COSTS_N_INSNS (1) + 1 into loongarch_rtx_cost_optimize_size instead of special
casing it in loongarch_rtx_costs.
For the default value (used when -O2) I'll do some micro-benchmark...
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (5 preceding siblings ...)
2023-12-09 6:56 ` xry111 at gcc dot gnu.org
@ 2023-12-09 7:02 ` xry111 at gcc dot gnu.org
2023-12-09 7:05 ` pinskia at gcc dot gnu.org
` (15 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 7:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #6 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
On a LA664 it seems a mul.w instruction costs 4 times a "simple" instruction
like add.w/sub.w/and, and div.w costs 5 times.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (6 preceding siblings ...)
2023-12-09 7:02 ` xry111 at gcc dot gnu.org
@ 2023-12-09 7:05 ` pinskia at gcc dot gnu.org
2023-12-09 7:48 ` xry111 at gcc dot gnu.org
` (14 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 7:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Xi Ruoyao from comment #5)
>
> so we still slightly penalty multiplication. To me we should code
> COSTS_N_INSNS (1) + 1 into loongarch_rtx_cost_optimize_size instead of
> special casing it in loongarch_rtx_costs.
Oh yes slightly penalty is definitely not going make a huge difference if the
cost of an mult instruction is worse than an and and an neg.
>
> For the default value (used when -O2) I'll do some micro-benchmark...
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (7 preceding siblings ...)
2023-12-09 7:05 ` pinskia at gcc dot gnu.org
@ 2023-12-09 7:48 ` xry111 at gcc dot gnu.org
2023-12-09 7:56 ` xry111 at gcc dot gnu.org
` (13 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 7:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #8 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #7)
> (In reply to Xi Ruoyao from comment #5)
> >
> > so we still slightly penalty multiplication. To me we should code
> > COSTS_N_INSNS (1) + 1 into loongarch_rtx_cost_optimize_size instead of
> > special casing it in loongarch_rtx_costs.
>
> Oh yes slightly penalty is definitely not going make a huge difference if
> the cost of an mult instruction is worse than an and and an neg.
>
> >
> > For the default value (used when -O2) I'll do some micro-benchmark...
I've changed it to
/* Default RTX cost initializer. */
loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
: fp_add (COSTS_N_INSNS (5)),
fp_mult_sf (COSTS_N_INSNS (5)),
fp_mult_df (COSTS_N_INSNS (5)),
fp_div_sf (COSTS_N_INSNS (8)),
fp_div_df (COSTS_N_INSNS (8)),
int_mult_si (COSTS_N_INSNS (4)),
int_mult_di (COSTS_N_INSNS (4)),
int_div_si (COSTS_N_INSNS (5)),
int_div_di (COSTS_N_INSNS (5)),
branch_cost (6),
memory_latency (4) {}
based on micro-benchmark results. This fixes the int * _Bool case and int * 17
case. But for the original test case I still get a multiplication instruction.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (8 preceding siblings ...)
2023-12-09 7:48 ` xry111 at gcc dot gnu.org
@ 2023-12-09 7:56 ` xry111 at gcc dot gnu.org
2023-12-09 8:58 ` xry111 at gcc dot gnu.org
` (12 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 7:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #9 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
And in fact the optimal code for
int t(int x, _Bool y)
{
return x * y;
}
should be
maskeqz $r4,$r4,$r5
jr $r1
like
int t(int x, _Bool y)
{
return y ? x : 0;
}
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (9 preceding siblings ...)
2023-12-09 7:56 ` xry111 at gcc dot gnu.org
@ 2023-12-09 8:58 ` xry111 at gcc dot gnu.org
2023-12-09 8:58 ` xry111 at gcc dot gnu.org
` (11 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 8:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #10 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
For the original test case we get:
x16_27 = _26 & 1;
data_28 = data_15 >> 1;
_29 = crc_18 >> 1;
_21 = (short unsigned int) x16_27;
_13 = _21 * 40961;
and tree_nonzero_bits fails to _21 is either 0 or 1, for some reason.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug target/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (10 preceding siblings ...)
2023-12-09 8:58 ` xry111 at gcc dot gnu.org
@ 2023-12-09 8:58 ` xry111 at gcc dot gnu.org
2023-12-09 9:06 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
` (10 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 8:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #11 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
> and tree_nonzero_bits fails to _21 is either 0 or 1, for some reason.
^ report
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug tree-optimization/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (11 preceding siblings ...)
2023-12-09 8:58 ` xry111 at gcc dot gnu.org
@ 2023-12-09 9:06 ` xry111 at gcc dot gnu.org
2023-12-09 9:07 ` [Bug middle-end/112935] " pinskia at gcc dot gnu.org
` (9 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 9:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Xi Ruoyao <xry111 at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|target |tree-optimization
Target|loongarch64-*-* |
--- Comment #12 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
This also happens for x86_64, with 13.2:
.L2:
movl %edi, %edx
shrb %dil
xorl %eax, %edx
shrw %ax
andl $1, %edx
negl %edx
andw $-24575, %dx
xorl %edx, %eax
subb $1, %cl
jne .L2
With trunk:
.L2:
mov ecx, edi
shr dil
xor ecx, eax
shr ax
and ecx, 1
lea edx, [rcx+rcx*4]
sal edx, 13
add edx, ecx
xor eax, edx
sub sil, 1
jne .L2
So this seems not only a target issue. I'll move it back to tree-optimization
and open LoongArch cost model issue as a new ticket.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (12 preceding siblings ...)
2023-12-09 9:06 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
@ 2023-12-09 9:07 ` pinskia at gcc dot gnu.org
2023-12-09 9:14 ` xry111 at gcc dot gnu.org
` (8 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 9:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |loongarch64-*-*
Component|tree-optimization |middle-end
--- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Xi Ruoyao from comment #11)
> > and tree_nonzero_bits fails to _21 is either 0 or 1, for some reason.
> ^ report
Oh I see that on aarch64 too ...
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (13 preceding siblings ...)
2023-12-09 9:07 ` [Bug middle-end/112935] " pinskia at gcc dot gnu.org
@ 2023-12-09 9:14 ` xry111 at gcc dot gnu.org
2023-12-09 9:16 ` pinskia at gcc dot gnu.org
` (7 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 9:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #14 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
LoongArch cost model issue is now PR112936.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (14 preceding siblings ...)
2023-12-09 9:14 ` xry111 at gcc dot gnu.org
@ 2023-12-09 9:16 ` pinskia at gcc dot gnu.org
2023-12-09 9:21 ` xry111 at gcc dot gnu.org
` (6 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 9:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I see what is happening and kinda of see why it is not there ...
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (15 preceding siblings ...)
2023-12-09 9:16 ` pinskia at gcc dot gnu.org
@ 2023-12-09 9:21 ` xry111 at gcc dot gnu.org
2023-12-09 9:24 ` pinskia at gcc dot gnu.org
` (5 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-12-09 9:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #16 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
BTW is it possible to get the value range info of (x) and combine (and y, (neg
x)) back to maskeqz in LoongArch backend? It will further improve the
performance. Or maybe expr.cc should invoke a target hook to see if (mul x, y)
should be expanded to (if_then_else x, y, 0) instead when x is in {0, 1}.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (16 preceding siblings ...)
2023-12-09 9:21 ` xry111 at gcc dot gnu.org
@ 2023-12-09 9:24 ` pinskia at gcc dot gnu.org
2023-12-09 9:25 ` pinskia at gcc dot gnu.org
` (4 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 9:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |pinskia at gcc dot gnu.org
--- Comment #17 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So it was r14-1655-g52c92fb3f40050 which caused the issue.
It is a correct fix except we sometimes insert a convert and that new ssa name
does not have any global range info on it.
e.g.
_21 = (short unsigned int) x16_27;
So instead of:
bool bit0_p = tree_nonzero_bits (treeop0) == 1;
bool bit1_p = tree_nonzero_bits (treeop1) == 1;
We should use gimple_zero_one_valued_p
bool
gimple_zero_one_valued_p (tree t, tree (*valueize)(tree) ATTRIBUTE_UNUSED)
Which will look through one extra convert.
Let me fix this.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (17 preceding siblings ...)
2023-12-09 9:24 ` pinskia at gcc dot gnu.org
@ 2023-12-09 9:25 ` pinskia at gcc dot gnu.org
2023-12-09 21:43 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 9:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=111126
--- Comment #18 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Xi Ruoyao from comment #16)
> BTW is it possible to get the value range info of (x) and combine (and y,
> (neg x)) back to maskeqz in LoongArch backend? It will further improve the
> performance. Or maybe expr.cc should invoke a target hook to see if (mul x,
> y) should be expanded to (if_then_else x, y, 0) instead when x is in {0, 1}.
That would be PR 111126 because RISCV backend has the same issue. I will be
fixing that but most likely for GCC 15 as I am going to fix the regression
here.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (18 preceding siblings ...)
2023-12-09 9:25 ` pinskia at gcc dot gnu.org
@ 2023-12-09 21:43 ` pinskia at gcc dot gnu.org
2023-12-10 9:22 ` pinskia at gcc dot gnu.org
` (2 subsequent siblings)
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-09 21:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #19 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This patch gets us back to using `&-` rather than shifts/adds for x86_64:
```
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 6da51f2aca2..4686cacd22f 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -10209,8 +10209,9 @@ expand_expr_real_2 (sepops ops, rtx target,
machine_mode tmode,
/* Expand X*Y as X&-Y when Y must be zero or one. */
if (SCALAR_INT_MODE_P (mode))
{
- bool bit0_p = tree_nonzero_bits (treeop0) == 1;
- bool bit1_p = tree_nonzero_bits (treeop1) == 1;
+ bool gimple_zero_one_valued_p (tree, tree (*)(tree));
+ bool bit0_p = gimple_zero_one_valued_p (treeop0, nullptr);
+ bool bit1_p = gimple_zero_one_valued_p (treeop1, nullptr);
/* Expand X*Y as X&Y when both X and Y must be zero or one. */
if (bit0_p && bit1_p)
```
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (19 preceding siblings ...)
2023-12-09 21:43 ` pinskia at gcc dot gnu.org
@ 2023-12-10 9:22 ` pinskia at gcc dot gnu.org
2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
2023-12-11 15:57 ` pinskia at gcc dot gnu.org
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-10 9:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
URL| |https://gcc.gnu.org/piperma
| |il/gcc-patches/2023-Decembe
| |r/640030.html
Keywords| |patch
--- Comment #20 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Patch submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640030.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (20 preceding siblings ...)
2023-12-10 9:22 ` pinskia at gcc dot gnu.org
@ 2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
2023-12-11 15:57 ` pinskia at gcc dot gnu.org
22 siblings, 0 replies; 24+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-12-11 15:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
--- Comment #21 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>:
https://gcc.gnu.org/g:acbfb8b9495b802e414e6ab94b810ef7b0c8aa1d
commit r14-6418-gacbfb8b9495b802e414e6ab94b810ef7b0c8aa1d
Author: Andrew Pinski <quic_apinski@quicinc.com>
Date: Sat Dec 9 13:43:23 2023 -0800
expr: catch more `a*bool` while expanding [PR 112935]
After r14-1655-g52c92fb3f40050 (and the other commits
which touch zero_one_valued_p), we end up with a with
`bool * a` but where the bool is an SSA name that might not
have non-zero bits set on it (to 0x1) even though it
does the non-zero bits would be 0x1.
The case of coremarks, it is only phiopt4 which adds the new
ssa name and nothing afterwards updates the nonzero bits on it.
This fixes the regression by using gimple_zero_one_valued_p
rather than tree_nonzero_bits to match the cases where the
SSA_NAME didn't have the non-zero bits set.
gimple_zero_one_valued_p handles one level of cast and also
and an `&`.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
PR middle-end/112935
* expr.cc (expand_expr_real_2): Use
gimple_zero_one_valued_p instead of tree_nonzero_bits
to find boolean defined expressions.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug middle-end/112935] [14 Regression] Performance regression in Coremarks crcu8 function
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
` (21 preceding siblings ...)
2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
@ 2023-12-11 15:57 ` pinskia at gcc dot gnu.org
22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-11 15:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112935
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #22 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Fixed.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2023-12-11 15:57 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-09 4:16 [Bug tree-optimization/112935] New: [14 Regression] Performance regression in Coremarks crcu8 function xry111 at gcc dot gnu.org
2023-12-09 4:20 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
2023-12-09 4:20 ` xry111 at gcc dot gnu.org
2023-12-09 4:22 ` pinskia at gcc dot gnu.org
2023-12-09 4:35 ` [Bug target/112935] " pinskia at gcc dot gnu.org
2023-12-09 4:49 ` pinskia at gcc dot gnu.org
2023-12-09 6:56 ` xry111 at gcc dot gnu.org
2023-12-09 7:02 ` xry111 at gcc dot gnu.org
2023-12-09 7:05 ` pinskia at gcc dot gnu.org
2023-12-09 7:48 ` xry111 at gcc dot gnu.org
2023-12-09 7:56 ` xry111 at gcc dot gnu.org
2023-12-09 8:58 ` xry111 at gcc dot gnu.org
2023-12-09 8:58 ` xry111 at gcc dot gnu.org
2023-12-09 9:06 ` [Bug tree-optimization/112935] " xry111 at gcc dot gnu.org
2023-12-09 9:07 ` [Bug middle-end/112935] " pinskia at gcc dot gnu.org
2023-12-09 9:14 ` xry111 at gcc dot gnu.org
2023-12-09 9:16 ` pinskia at gcc dot gnu.org
2023-12-09 9:21 ` xry111 at gcc dot gnu.org
2023-12-09 9:24 ` pinskia at gcc dot gnu.org
2023-12-09 9:25 ` pinskia at gcc dot gnu.org
2023-12-09 21:43 ` pinskia at gcc dot gnu.org
2023-12-10 9:22 ` pinskia at gcc dot gnu.org
2023-12-11 15:56 ` cvs-commit at gcc dot gnu.org
2023-12-11 15:57 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).