public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/115749] New: Missed BMI2 optimization on x86-64
@ 2024-07-02 9:31 kim.walisch at gmail dot com
2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
` (15 more replies)
0 siblings, 16 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02 9:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Bug ID: 115749
Summary: Missed BMI2 optimization on x86-64
Product: gcc
Version: 14.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: kim.walisch at gmail dot com
Target Milestone: ---
Hi,
I have debugged a performance issue in one of my C++ applications on x86-64
CPUs where GCC produces noticeably slower code (using all GCC versions) than
Clang. I was able to find that the performance issue was caused by GCC not
using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on
the other hand used the mulx instruction producing a shorter and faster
assembly sequence. For this particular code sequence Clang used up to 30% fewer
instructions than GCC.
Here is a minimal C/C++ code snippet that reproduces the issue:
extern const unsigned long array[240];
unsigned long func(unsigned long x)
{
unsigned long index = x / 240;
return array[index % 240];
}
GCC trunk produces the following 15 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:
func(unsigned long):
movabs rcx, -8608480567731124087
mov rax, rdi
mul rcx
mov rdi, rdx
shr rdi, 7
mov rax, rdi
mul rcx
shr rdx, 7
mov rax, rdx
sal rax, 4
sub rax, rdx
sal rax, 4
sub rdi, rax
mov rax, QWORD PTR array[0+rdi*8]
ret
Clang trunk produces the following shorter and faster 12 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:
func(unsigned long): # @func(unsigned long)
movabs rax, -8608480567731124087
mov rdx, rdi
mulx rdx, rdx, rax
shr rdx, 7
movabs rax, 153722867280912931
mulx rax, rax, rax
shr eax
imul eax, eax, 240
sub edx, eax
mov rax, qword ptr [rip + array@GOTPCREL]
mov rax, qword ptr [rax + 8*rdx]
ret
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug c++/115749] Missed BMI2 optimization on x86-64
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
@ 2024-07-02 11:36 ` kim.walisch at gmail dot com
2024-07-02 11:44 ` [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs pinskia at gcc dot gnu.org
` (14 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02 11:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #1 from kim.walisch at gmail dot com ---
I played a bit more with my C/C++ code snippet and managed to further simplify
it. The GCC performance issue seems to be mostly caused by GCC producing worse
assembly than Clang for the integer modulo by a constant on x86-64 CPUs:
unsigned long func(unsigned long x)
{
return x % 240;
}
GCC trunk produces the following 11 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:
func:
movabs rax, -8608480567731124087
mul rdi
mov rax, rdx
shr rax, 7
mov rdx, rax
sal rdx, 4
sub rdx, rax
mov rax, rdi
sal rdx, 4
sub rax, rdx
ret
Clang trunk produces the following shorter and faster 8 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:
func:
mov rax, rdi
movabs rcx, -8608480567731124087
mov rdx, rdi
mulx rcx, rcx, rcx
shr rcx, 7
imul rcx, rcx, 240
sub rax, rcx
ret
In my first post one can see that Clang uses mulx for both the integer division
by a constant and the integer modulo by a constant, while GCC does not use
mulx. However, for the integer division by a constant GCC uses the same number
of instructions as Clang (even without GCC using mulx) but for the integer
modulo by a constant GCC uses up to 30% more instructions and is noticeably
slower.
Please note that Clang's assembly is also shorter (8 asm instructions) than
GCC's assembly for the integer modulo by a constant on x86-64 CPUs when
compiling without -mbmi2 e.g. with just -O3:
func:
movabs rcx, -8608480567731124087
mov rax, rdi
mul rcx
shr rdx, 7
imul rax, rdx, 240
sub rdi, rax
mov rax, rdi
ret
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
@ 2024-07-02 11:44 ` pinskia at gcc dot gnu.org
2024-07-02 11:59 ` kim.walisch at gmail dot com
` (13 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 11:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |X86_64
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This seems like a tuning issue. In that gcc thinks the shifts and stuff is
faster than mulx.
What happens if you do -march=native?
Does it use mulx?
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
2024-07-02 11:44 ` [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs pinskia at gcc dot gnu.org
@ 2024-07-02 11:59 ` kim.walisch at gmail dot com
2024-07-02 12:17 ` kim.walisch at gmail dot com
` (12 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02 11:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #3 from kim.walisch at gmail dot com ---
(In reply to Andrew Pinski from comment #2)
> This seems like a tuning issue. In that gcc thinks the shifts and stuff is
> faster than mulx.
>
> What happens if you do -march=native?
>
> Does it use mulx?
I tried using g++-14 using both -march=native and -march=x86-64-v4 (on a 12th
Gen Intel Core i5-12600K which supports BMI2 and AVX2) but GCC always produces
that same 11 instruction assembly sequence without mulx for the integer modulo
by a constant.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (2 preceding siblings ...)
2024-07-02 11:59 ` kim.walisch at gmail dot com
@ 2024-07-02 12:17 ` kim.walisch at gmail dot com
2024-07-02 17:21 ` pinskia at gcc dot gnu.org
` (11 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02 12:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #4 from kim.walisch at gmail dot com ---
One possible explanation for why GCC's current integer division by a constant
assembly sequence was chosen back in the day (I guess one or two decades ago)
is that GCC's current assembly sequence uses only 1 mul instruction whereas
Clang uses 2 mul instructions.
Historically, multiplication instructions used to be slower than add, sub and
shift instructions on nearly all CPU architectures and so it made sense to
avoid mul instructions whenever possible. However in the past decade this
performance gap has narrowed and now it is more important to avoid long
instruction dependency chains which GCC's current integer modulo by a constant
assembly sequence suffers from.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (3 preceding siblings ...)
2024-07-02 12:17 ` kim.walisch at gmail dot com
@ 2024-07-02 17:21 ` pinskia at gcc dot gnu.org
2024-07-02 17:58 ` pinskia at gcc dot gnu.org
` (10 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 17:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For the original testcase, the imul vs shift can be reduced down to just:
```
unsigned long func(unsigned long x)
{
return x * 240;
}
```
GCC produces:
```
movq %rdi, %rax
salq $4, %rax
subq %rdi, %rax
salq $4, %rax
```
vs:
```
imulq $240, %rdi, %rax
```
-mtune=skylake produces the imul.
The mulx issue can be reduced down to just:
```
unsigned long func(unsigned long x)
{
__uint128_t t = x;
return (t * 123)>>64;
}
__uint128_t func1(unsigned long x, unsigned long y)
{
__uint128_t t = x;
return (t * 123);
}
```
It only happens with constants it seems.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (4 preceding siblings ...)
2024-07-02 17:21 ` pinskia at gcc dot gnu.org
@ 2024-07-02 17:58 ` pinskia at gcc dot gnu.org
2024-07-02 18:14 ` pinskia at gcc dot gnu.org
` (9 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 17:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |115755
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> The mulx issue can be reduced down to just:
> ```
> unsigned long func(unsigned long x)
> {
> __uint128_t t = x;
> return (t * 123)>>64;
> }
>
>
> __uint128_t func1(unsigned long x, unsigned long y)
> {
> __uint128_t t = x;
> return (t * 123);
> }
>
> ```
>
> It only happens with constants it seems.
I split that out to PR 115755.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115755
[Bug 115755] mulx (with -mbmi2) does not show up with constant multiply
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (5 preceding siblings ...)
2024-07-02 17:58 ` pinskia at gcc dot gnu.org
@ 2024-07-02 18:14 ` pinskia at gcc dot gnu.org
2024-07-02 18:41 ` pinskia at gcc dot gnu.org
` (8 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 18:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2024-07-02
Depends on| |115756
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> For the original testcase, the imul vs shift can be reduced down to just:
> ```
> unsigned long func(unsigned long x)
> {
> return x * 240;
> }
>
> ```
>
> GCC produces:
> ```
> movq %rdi, %rax
> salq $4, %rax
> subq %rdi, %rax
> salq $4, %rax
> ```
> vs:
> ```
> imulq $240, %rdi, %rax
> ```
>
> -mtune=skylake produces the imul.
I split out the tuning issue for imul to PR 115756 .
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756
[Bug 115756] default tuning for x86_64 produces shifts for `*240`
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (6 preceding siblings ...)
2024-07-02 18:14 ` pinskia at gcc dot gnu.org
@ 2024-07-02 18:41 ` pinskia at gcc dot gnu.org
2024-07-03 17:29 ` kim.walisch at gmail dot com
` (7 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 18:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to kim.walisch from comment #4)
> One possible explanation for why GCC's current integer division by a
> constant assembly sequence was chosen back in the day (I guess one or two
> decades ago) is that GCC's current assembly sequence uses only 1 mul
> instruction whereas Clang uses 2 mul instructions.
>
> Historically, multiplication instructions used to be slower than add, sub
> and shift instructions on nearly all CPU architectures and so it made sense
> to avoid mul instructions whenever possible. However in the past decade this
> performance gap has narrowed and now it is more important to avoid long
> instruction dependency chains which GCC's current integer modulo by a
> constant assembly sequence suffers from.
Note the way GCC has handled mult->shift/add (and even the divide/mod
expansion) is via a target independent part which querries the target on the
costs of specific instructions (mult, shift, add, etc.). So if the target has
the cost not modeled correctly, you get the less efficient sequence. This is
why I said it was a cost model issue and why PR 115756 is asking for the
changing of the default (generic) output.
So yes the cost might be based on the older cores and not been retuned since.
Anyways the middle-end is doing the correct thing based on what the target is
giving it.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (7 preceding siblings ...)
2024-07-02 18:41 ` pinskia at gcc dot gnu.org
@ 2024-07-03 17:29 ` kim.walisch at gmail dot com
2024-07-04 1:17 ` liuhongt at gcc dot gnu.org
` (6 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-03 17:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #9 from kim.walisch at gmail dot com ---
Here I am providing some benchmark results to back up my claim that switching
to the integer modulo by a constant algorithm with 2 multiplication
instructions (which is the default in both Clang and MSVC) is faster than GCC's
current default algorithm using only 1 multiplication but more shifts, adds and
subs.
Here is a simple C program that can be used to benchmark both algorithms using
GCC. It computes integer modulo by a constant in a loop (10^10 iterations).
While this benchmark program may not be ideal for simulating real world use
cases, it is better than having no benchmark at all.
-----------------------------------------------------------------------
#include <stdint.h>
#include <inttypes.h>
#include <stdio.h>
#include <sys/time.h>
__attribute__((noinline))
uint64_t benchmark(uint64_t x, uint64_t inc, uint64_t iters) {
uint64_t sum = 0;
#pragma GCC unroll 1
for (uint64_t i = 0; i < iters; i++, x += inc)
sum += x % 240;
return sum;
}
int main() {
struct timeval start, end;
long seconds, useconds;
double mtime;
// Start timer
gettimeofday(&start, NULL);
uint64_t result = benchmark(0, 11, 10000000000ull);
// End timer
gettimeofday(&end, NULL);
// Calculate the elapsed time in milliseconds
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
mtime = ((seconds) * 1000 + useconds / 1000.0);
printf("Result = %" PRIu64 "\n", result);
printf("The loop took %f milliseconds to execute.\n", mtime);
return 0;
}
-----------------------------------------------------------------------
This GCC command produces the assembly sequence with only 1 multiplication
instruction:
gcc -O3 bench.c -o bench1
This GCC command produces the assembly sequence with 2 multiplication
instructions:
gcc -O3 -mtune=skylake bench.c -o bench2
And here are the benchmark results:
CPU: AMD EPYC 9R14 (AMD Zen4 from 2023), Compiler: GCC 14 on Ubuntu 24.04.
$ ./bench1
Result = 1194999999360
The loop took 8385.613000 milliseconds to execute.
$ ./bench2
Result = 1194999999360
The loop took 5898.967000 milliseconds to execute.
The algorithm with 2 multiplications is 30% faster.
CPU: Intel i5-12600K (big.LITTLE), Performance CPU core, Compiler: GCC 13.2
(MinGW-w64)
$ ./bench1.exe
Result = 1194999999360
The loop took 5633.360000 milliseconds to execute.
$ ./bench2.exe
Result = 1194999999360
The loop took 4369.167000 milliseconds to execute.
The algorithm with 2 multiplications is 23% faster.
CPU: Intel i5-12600K (big.LITTLE), Efficiency CPU core, Compiler: GCC 13.2
(MinGW-w64)
$ ./bench1.exe
Result = 1194999999360
The loop took 10788.097000 milliseconds to execute.
$ ./bench2.exe
Result = 1194999999360
The loop took 9453.191000 milliseconds to execute.
The algorithm with 2 multiplications is 12% faster.
One of the comments in PR 115756 was "I'd lean towards shift+add because for
example Intel E-cores have a slow imul.". However, my benchmarks suggest that
even on Intel Efficiency CPU cores the algorithm with 2 multiplication
instructions is faster. (I used the Process Lasso tool on Windows 11 to force
the benchmark to be run on an Efficiency CPU core).
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (8 preceding siblings ...)
2024-07-03 17:29 ` kim.walisch at gmail dot com
@ 2024-07-04 1:17 ` liuhongt at gcc dot gnu.org
2024-07-16 8:18 ` lingling.kong7 at gmail dot com
` (5 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-07-04 1:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |haochen.jiang at intel dot com,
| |liuhongt at gcc dot gnu.org
--- Comment #10 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> One of the comments in PR 115756 was "I'd lean towards shift+add because for
> example Intel E-cores have a slow imul.". However, my benchmarks suggest
> that even on Intel Efficiency CPU cores the algorithm with 2 multiplication
> instructions is faster. (I used the Process Lasso tool on Windows 11 to
> force the benchmark to be run on an Efficiency CPU core).
@haocheng, could you try the benchmark on our Sierra Forest machine?
I'm ok to adjust rtx_cost of imulq for COST_N_INSNS (4) to COST_N_INSNS (3) if
the performance test looks ok.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (9 preceding siblings ...)
2024-07-04 1:17 ` liuhongt at gcc dot gnu.org
@ 2024-07-16 8:18 ` lingling.kong7 at gmail dot com
2024-07-16 21:29 ` roger at nextmovesoftware dot com
` (4 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: lingling.kong7 at gmail dot com @ 2024-07-16 8:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #11 from kong lingling <lingling.kong7 at gmail dot com> ---
After adjusted rtx_cost of imulq for COST_N_INSNS (4) to COST_N_INSNS (3), I
tested the benchmark on Sierra Forest machine based on gcc trunk, and the
algorithm with 2 multiplications is 2% faster. For Spec2017 performance
improvement is around 0.2% (1 copy, -march=native -Ofast -funroll-loops -flto
/ -mtune=generic -O2 -march=x86-64-v3).
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (10 preceding siblings ...)
2024-07-16 8:18 ` lingling.kong7 at gmail dot com
@ 2024-07-16 21:29 ` roger at nextmovesoftware dot com
2024-07-25 1:45 ` cvs-commit at gcc dot gnu.org
` (3 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-07-16 21:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #12 from Roger Sayle <roger at nextmovesoftware dot com> ---
I owe Kim an apology. It does appear that modern x86_64 processors perform
(many) multiplications faster than the latencies given in the Intel/AMD/Agner
Fog documentation.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (11 preceding siblings ...)
2024-07-16 21:29 ` roger at nextmovesoftware dot com
@ 2024-07-25 1:45 ` cvs-commit at gcc dot gnu.org
2024-08-15 5:11 ` liuhongt at gcc dot gnu.org
` (2 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-07-25 1:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #13 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kong Lingling <konglin1@gcc.gnu.org>:
https://gcc.gnu.org/g:bc00de070f0b9a25f68ffddbefe516543a44bd23
commit r15-2295-gbc00de070f0b9a25f68ffddbefe516543a44bd23
Author: Lingling Kong <lingling.kong@intel.com>
Date: Thu Jul 25 09:42:06 2024 +0800
i386: Adjust rtx cost for imulq and imulw [PR115749]
gcc/ChangeLog:
PR target/115749
* config/i386/x86-tune-costs.h (struct processor_costs):
Adjust rtx_cost of imulq and imulw for COST_N_INSNS (4)
to COST_N_INSNS (3).
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr115749.c: New test.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (12 preceding siblings ...)
2024-07-25 1:45 ` cvs-commit at gcc dot gnu.org
@ 2024-08-15 5:11 ` liuhongt at gcc dot gnu.org
2024-08-15 5:35 ` liuhongt at gcc dot gnu.org
2024-08-16 5:00 ` sjames at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-15 5:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Bug 115749 depends on bug 115756, which changed state.
Bug 115756 Summary: default tuning for x86_64 produces shifts for `*240`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
Resolution|--- |FIXED
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (13 preceding siblings ...)
2024-08-15 5:11 ` liuhongt at gcc dot gnu.org
@ 2024-08-15 5:35 ` liuhongt at gcc dot gnu.org
2024-08-16 5:00 ` sjames at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-15 5:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|NEW |RESOLVED
--- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
Fixed in GCC15.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
` (14 preceding siblings ...)
2024-08-15 5:35 ` liuhongt at gcc dot gnu.org
@ 2024-08-16 5:00 ` sjames at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: sjames at gcc dot gnu.org @ 2024-08-16 5:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Sam James <sjames at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |15.0
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-08-16 5:00 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-02 9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
2024-07-02 11:44 ` [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs pinskia at gcc dot gnu.org
2024-07-02 11:59 ` kim.walisch at gmail dot com
2024-07-02 12:17 ` kim.walisch at gmail dot com
2024-07-02 17:21 ` pinskia at gcc dot gnu.org
2024-07-02 17:58 ` pinskia at gcc dot gnu.org
2024-07-02 18:14 ` pinskia at gcc dot gnu.org
2024-07-02 18:41 ` pinskia at gcc dot gnu.org
2024-07-03 17:29 ` kim.walisch at gmail dot com
2024-07-04 1:17 ` liuhongt at gcc dot gnu.org
2024-07-16 8:18 ` lingling.kong7 at gmail dot com
2024-07-16 21:29 ` roger at nextmovesoftware dot com
2024-07-25 1:45 ` cvs-commit at gcc dot gnu.org
2024-08-15 5:11 ` liuhongt at gcc dot gnu.org
2024-08-15 5:35 ` liuhongt at gcc dot gnu.org
2024-08-16 5:00 ` sjames at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).