public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64
@ 2023-07-04 17:26 moncef.mechri at gmail dot com
2023-07-04 17:32 ` [Bug target/110551] [11/12/13/14 " pinskia at gcc dot gnu.org
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: moncef.mechri at gmail dot com @ 2023-07-04 17:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
Bug ID: 110551
Summary: [11 / 12 / 13 /14 regression] Suboptimal codegen for
128 bits multiplication on x86_64
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: moncef.mechri at gmail dot com
Target Milestone: ---
https://godbolt.org/z/3hdondY6n
Codegen for the code shared above (which is a mixing step in boost.Unordered
when a non-avalanching hash function is being used [1] ) regressed since GCC
11. I believe there are 2 regressions:
Regression 1:
A redundant move is introduced:
movabs rcx, -7046029254386353131
mov rax, rcx
The regression seems to be present at all optimization levels above -O0
(including -Os and -Og).
Possibly a duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94804
Regression 2
When using -march=haswell or newer, GCC >= 11 emits mulx. The resulting code is
longer (by 1 instruction) with no clear benefit to my untrained eyes. It looks
to me like the code generated by GCC 10 is optimal, even for haswell and newer.
I am reporting both issues in the same bug report because they seem related
enough. Let me know if you want me to split them into 2 bug reports instead.
[1]
https://github.com/boostorg/unordered/blob/9a7d1d336aaa73ad8e5f7c07bdb81b2e793f8d93/include/boost/unordered/detail/mulx.hpp#L111
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 regression] Suboptimal codegen for 128 bits multiplication on x86_64
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
@ 2023-07-04 17:32 ` pinskia at gcc dot gnu.org
2023-07-04 17:47 ` [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply pinskia at gcc dot gnu.org
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-04 17:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 55468
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55468&action=edit
testcase
Please next time attach (which you can do paste in the box) or paste inline the
testcase rather than just link to godbolt .
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
2023-07-04 17:32 ` [Bug target/110551] [11/12/13/14 " pinskia at gcc dot gnu.org
@ 2023-07-04 17:47 ` pinskia at gcc dot gnu.org
2023-07-04 18:12 ` moncef.mechri at gmail dot com
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-04 17:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Keywords| |ra
Status|UNCONFIRMED |NEW
Last reconfirmed| |2023-07-04
Known to fail| |12.1.0, 14.0, 5.1.0, 7.1.0
Summary|[11/12/13/14 regression] |[11/12/13/14 Regression] an
|Suboptimal codegen for 128 |extra mov when doing 128bit
|bits multiplication on |multiply
|x86_64 |
Known to work| |4.9.4
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
It is an older regression though.
```
#include <stdint.h>
void mulx64(uint64_t *x, uint64_t *t)
{
__uint128_t r = (__uint128_t)*x * 0x9E3779B97F4A7C15ull;
*t = (uint64_t)r ^ (uint64_t)( r >> 64 );
}
```
It is just an extra mov.
Also the mulx should have allowed the register allocator to do better but it
was worse ...
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
2023-07-04 17:32 ` [Bug target/110551] [11/12/13/14 " pinskia at gcc dot gnu.org
2023-07-04 17:47 ` [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply pinskia at gcc dot gnu.org
@ 2023-07-04 18:12 ` moncef.mechri at gmail dot com
2023-07-05 7:06 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: moncef.mechri at gmail dot com @ 2023-07-04 18:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
--- Comment #3 from Moncef Mechri <moncef.mechri at gmail dot com> ---
> Please next time attach (which you can do paste in the box) or paste inline
> the testcase rather than just link to godbolt .
Noted. Apologies.
> It is an older regression though.
> ```
> #include <stdint.h>
>
> void mulx64(uint64_t *x, uint64_t *t)
> {
> __uint128_t r = (__uint128_t)*x * 0x9E3779B97F4A7C15ull;
> *t = (uint64_t)r ^ (uint64_t)( r >> 64 );
> }
> ```
>
> It is just an extra mov.
>
> Also the mulx should have allowed the register allocator to do better but it
> was worse ...
It is true that with this new test case, all GCC versions (including GCC 10)
seem to suffer from both issues reported in the original post.
But the original test case only exhibits suboptimal codegen for GCC >= 11, as
shown in the godbolt link shared above.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (2 preceding siblings ...)
2023-07-04 18:12 ` moncef.mechri at gmail dot com
@ 2023-07-05 7:06 ` rguenth at gcc dot gnu.org
2023-10-18 19:11 ` roger at nextmovesoftware dot com
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-05 7:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |11.5
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (3 preceding siblings ...)
2023-07-05 7:06 ` rguenth at gcc dot gnu.org
@ 2023-10-18 19:11 ` roger at nextmovesoftware dot com
2023-10-27 9:05 ` cvs-commit at gcc dot gnu.org
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: roger at nextmovesoftware dot com @ 2023-10-18 19:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
Roger Sayle <roger at nextmovesoftware dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |roger at nextmovesoftware dot com
--- Comment #4 from Roger Sayle <roger at nextmovesoftware dot com> ---
Patch proposed:
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633337.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (4 preceding siblings ...)
2023-10-18 19:11 ` roger at nextmovesoftware dot com
@ 2023-10-27 9:05 ` cvs-commit at gcc dot gnu.org
2023-10-29 18:01 ` moncef.mechri at gmail dot com
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-10-27 9:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:89e5d902fc55ad375f149f25a84c516ad360a606
commit r14-4968-g89e5d902fc55ad375f149f25a84c516ad360a606
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Fri Oct 27 10:03:53 2023 +0100
PR target/110551: Fix reg allocation for widening multiplications on x86.
This patch contains clean-ups of the widening multiplication patterns in
i386.md, and provides variants of the existing highpart multiplication
peephole2 transformations (that tidy up register allocation after
reload), and thereby fixes PR target/110551, which is a superfluous
move instruction.
For the new test case, compiled on x86_64 with -O2.
Before:
mulx64: movabsq $-7046029254386353131, %rcx
movq %rcx, %rax
mulq %rdi
xorq %rdx, %rax
ret
After:
mulx64: movabsq $-7046029254386353131, %rax
mulq %rdi
xorq %rdx, %rax
ret
The clean-ups are (i) that operand 1 is consistently made register_operand
and operand 2 becomes nonimmediate_operand, so that predicates match the
constraints, (ii) the representation of the BMI2 mulx instruction is
updated to use the new umul_highpart RTX, and (iii) because operands
0 and 1 have different modes in widening multiplications, "a" is a more
appropriate constraint than "0" (which avoids spills/reloads containing
SUBREGs). The new peephole2 transformations are based upon those at
around line 9951 of i386.md, that begins with the comment
;; Highpart multiplication peephole2s to tweak register allocation.
;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx -> mov imm,%rax; imulq %rdi
2023-10-27 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR target/110551
* config/i386/i386.md (<u>mul<mode><dwi>3): Make operands 1 and
2 take "regiser_operand" and "nonimmediate_operand" respectively.
(<u>mulqihi3): Likewise.
(*bmi2_umul<mode><dwi>3_1): Operand 2 needs to be register_operand
matching the %d constraint. Use umul_highpart RTX to represent
the highpart multiplication.
(*umul<mode><dwi>3_1): Operand 2 should use regiser_operand
predicate, and "a" rather than "0" as operands 0 and 2 have
different modes.
(define_split): For mul to mulx conversion, use the new
umul_highpart RTX representation.
(*mul<mode><dwi>3_1): Operand 1 should be register_operand
and the constraint %a as operands 0 and 1 have different modes.
(*<u>mulqihi3_1): Operand 1 should be register_operand matching
the constraint %0.
(define_peephole2): Providing widening multiplication variants
of the peephole2s that tweak highpart multiplication register
allocation.
gcc/testsuite/ChangeLog
PR target/110551
* gcc.target/i386/pr110551.c: New test case.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (5 preceding siblings ...)
2023-10-27 9:05 ` cvs-commit at gcc dot gnu.org
@ 2023-10-29 18:01 ` moncef.mechri at gmail dot com
2023-11-01 10:06 ` ubizjak at gmail dot com
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: moncef.mechri at gmail dot com @ 2023-10-29 18:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
--- Comment #6 from Moncef Mechri <moncef.mechri at gmail dot com> ---
I confirm the extra mov disappears thanks to Roger's patch.
However, the codegen still seems suboptimal to me when using -march=haswell or
newer, even with Roger's patch:
uint64_t mulx64(uint64_t x)
{
__uint128_t r = (__uint128_t)x * 0x9E3779B97F4A7C15ull;
return (uint64_t)r ^ (uint64_t)( r >> 64 );
}
With -O2:
mulx64(unsigned long):
movabs rax, -7046029254386353131
mul rdi
xor rax, rdx
ret
With -O2 -march=haswell
mulx64(unsigned long):
movabs rdx, -7046029254386353131
mulx rdi, rsi, rdi
mov rax, rdi
xor rax, rsi
ret
So it looks like there is still one extra mov, since I think the optimal
codegen using mulx should be:
mulx64(unsigned long):
movabs rdx, -7046029254386353131
mulx rax, rsi, rdi
xor rax, rsi
ret
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (6 preceding siblings ...)
2023-10-29 18:01 ` moncef.mechri at gmail dot com
@ 2023-11-01 10:06 ` ubizjak at gmail dot com
2023-11-01 22:35 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: ubizjak at gmail dot com @ 2023-11-01 10:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
--- Comment #7 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to CVS Commits from comment #5)
> The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
>
> https://gcc.gnu.org/g:89e5d902fc55ad375f149f25a84c516ad360a606
>
> commit r14-4968-g89e5d902fc55ad375f149f25a84c516ad360a606
> Author: Roger Sayle <roger@nextmovesoftware.com>
> Date: Fri Oct 27 10:03:53 2023 +0100
Looks like the patch regressed -march=cascadelake.
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (7 preceding siblings ...)
2023-11-01 10:06 ` ubizjak at gmail dot com
@ 2023-11-01 22:35 ` cvs-commit at gcc dot gnu.org
2023-11-06 19:24 ` moncef.mechri at gmail dot com
2023-11-12 15:48 ` [Bug target/110551] [11/12/13 " roger at nextmovesoftware dot com
10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-11-01 22:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:80b1a371008c31982d35cff9b85ca6affd3ac949
commit r14-5063-g80b1a371008c31982d35cff9b85ca6affd3ac949
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Wed Nov 1 22:33:45 2023 +0000
PR target/110551: Tweak mulx register allocation using peephole2.
This patch is a follow-up to my previous PR target/110551 patch, this
time to address the additional move after mulx, seen on TARGET_BMI2
architectures (such as -march=haswell). The complication here is
that the flexible multiple-set mulx instruction is introduced into
RTL after reload, by split2, and therefore can't benefit from register
preferencing. This results in RTL like the following:
(insn 32 31 17 2 (parallel [
(set (reg:DI 4 si [orig:101 r ] [101])
(mult:DI (reg:DI 1 dx [109])
(reg:DI 5 di [109])))
(set (reg:DI 5 di [ r+8 ])
(umul_highpart:DI (reg:DI 1 dx [109])
(reg:DI 5 di [109])))
]) "pr110551-2.c":8:17 -1
(nil))
(insn 17 32 9 2 (set (reg:DI 0 ax [107])
(reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
(expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
(nil)))
Here insn 32, the mulx instruction, places its results in si and di,
and then immediately after decides to move di to ax, with di now dead.
This can be trivially cleaned up by a peephole2. I've added an
additional constraint that the two SET_DESTs can't be the same
register to avoid confusing the middle-end, but this has well-defined
behaviour on x86_64/BMI2, encoding a umul_highpart.
For the new test case, compiled on x86_64 with -O2 -march=haswell:
Before:
mulx64: movabsq $-7046029254386353131, %rdx
mulx %rdi, %rsi, %rdi
movq %rdi, %rax
xorq %rsi, %rax
ret
After:
mulx64: movabsq $-7046029254386353131, %rdx
mulx %rdi, %rsi, %rax
xorq %rsi, %rax
ret
2023-11-01 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR target/110551
* config/i386/i386.md (*bmi2_umul<mode><dwi>3_1): Tidy condition
as operands[2] with predicate register_operand must be !MEM_P.
(peephole2): Optimize a mulx followed by a register-to-register
move, to place result in the correct destination if possible.
gcc/testsuite/ChangeLog
PR target/110551
* gcc.target/i386/pr110551-2.c: New test case.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (8 preceding siblings ...)
2023-11-01 22:35 ` cvs-commit at gcc dot gnu.org
@ 2023-11-06 19:24 ` moncef.mechri at gmail dot com
2023-11-12 15:48 ` [Bug target/110551] [11/12/13 " roger at nextmovesoftware dot com
10 siblings, 0 replies; 12+ messages in thread
From: moncef.mechri at gmail dot com @ 2023-11-06 19:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
--- Comment #9 from Moncef Mechri <moncef.mechri at gmail dot com> ---
With Roger's latest patch, codegen looks good with -O2 and -O2 -march=haswell.
Thanks!
I think this can be marked as resolved?
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/110551] [11/12/13 Regression] an extra mov when doing 128bit multiply
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
` (9 preceding siblings ...)
2023-11-06 19:24 ` moncef.mechri at gmail dot com
@ 2023-11-12 15:48 ` roger at nextmovesoftware dot com
10 siblings, 0 replies; 12+ messages in thread
From: roger at nextmovesoftware dot com @ 2023-11-12 15:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551
Roger Sayle <roger at nextmovesoftware dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.5 |14.0
Resolution|--- |FIXED
Summary|[11/12/13/14 Regression] an |[11/12/13 Regression] an
|extra mov when doing 128bit |extra mov when doing 128bit
|multiply |multiply
Status|NEW |RESOLVED
--- Comment #10 from Roger Sayle <roger at nextmovesoftware dot com> ---
This should now be fixed on mainline.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-11-12 15:48 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-04 17:26 [Bug rtl-optimization/110551] New: [11 / 12 / 13 /14 regression] Suboptimal codegen for 128 bits multiplication on x86_64 moncef.mechri at gmail dot com
2023-07-04 17:32 ` [Bug target/110551] [11/12/13/14 " pinskia at gcc dot gnu.org
2023-07-04 17:47 ` [Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply pinskia at gcc dot gnu.org
2023-07-04 18:12 ` moncef.mechri at gmail dot com
2023-07-05 7:06 ` rguenth at gcc dot gnu.org
2023-10-18 19:11 ` roger at nextmovesoftware dot com
2023-10-27 9:05 ` cvs-commit at gcc dot gnu.org
2023-10-29 18:01 ` moncef.mechri at gmail dot com
2023-11-01 10:06 ` ubizjak at gmail dot com
2023-11-01 22:35 ` cvs-commit at gcc dot gnu.org
2023-11-06 19:24 ` moncef.mechri at gmail dot com
2023-11-12 15:48 ` [Bug target/110551] [11/12/13 " roger at nextmovesoftware dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).