[Bug c++/115749] New: Missed BMI2 optimization on x86-64

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/115749] New: Missed BMI2 optimization on x86-64
@ 2024-07-02  9:31 kim.walisch at gmail dot com
  2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02  9:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

            Bug ID: 115749
           Summary: Missed BMI2 optimization on x86-64
           Product: gcc
           Version: 14.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kim.walisch at gmail dot com
  Target Milestone: ---

Hi,

I have debugged a performance issue in one of my C++ applications on x86-64
CPUs where GCC produces noticeably slower code (using all GCC versions) than
Clang. I was able to find that the performance issue was caused by GCC not
using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on
the other hand used the mulx instruction producing a shorter and faster
assembly sequence. For this particular code sequence Clang used up to 30% fewer
instructions than GCC.

Here is a minimal C/C++ code snippet that reproduces the issue:


extern const unsigned long array[240];

unsigned long func(unsigned long x)
{
    unsigned long index = x / 240;
    return array[index % 240];
}



GCC trunk produces the following 15 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:

func(unsigned long):
        movabs  rcx, -8608480567731124087
        mov     rax, rdi
        mul     rcx
        mov     rdi, rdx
        shr     rdi, 7
        mov     rax, rdi
        mul     rcx
        shr     rdx, 7
        mov     rax, rdx
        sal     rax, 4
        sub     rax, rdx
        sal     rax, 4
        sub     rdi, rax
        mov     rax, QWORD PTR array[0+rdi*8]
        ret


Clang trunk produces the following shorter and faster 12 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:

func(unsigned long):                               # @func(unsigned long)
        movabs  rax, -8608480567731124087
        mov     rdx, rdi
        mulx    rdx, rdx, rax
        shr     rdx, 7
        movabs  rax, 153722867280912931
        mulx    rax, rax, rax
        shr     eax
        imul    eax, eax, 240
        sub     edx, eax
        mov     rax, qword ptr [rip + array@GOTPCREL]
        mov     rax, qword ptr [rax + 8*rdx]
        ret

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug c++/115749] Missed BMI2 optimization on x86-64
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
@ 2024-07-02 11:36 ` kim.walisch at gmail dot com
  2024-07-02 11:44 ` [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs pinskia at gcc dot gnu.org
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02 11:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #1 from kim.walisch at gmail dot com ---
I played a bit more with my C/C++ code snippet and managed to further simplify
it. The GCC performance issue seems to be mostly caused by GCC producing worse
assembly than Clang for the integer modulo by a constant on x86-64 CPUs:

unsigned long func(unsigned long x)
{
    return x % 240;
}

GCC trunk produces the following 11 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:

func:
        movabs  rax, -8608480567731124087
        mul     rdi
        mov     rax, rdx
        shr     rax, 7
        mov     rdx, rax
        sal     rdx, 4
        sub     rdx, rax
        mov     rax, rdi
        sal     rdx, 4
        sub     rax, rdx
        ret

Clang trunk produces the following shorter and faster 8 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:

func:
        mov     rax, rdi
        movabs  rcx, -8608480567731124087
        mov     rdx, rdi
        mulx    rcx, rcx, rcx
        shr     rcx, 7
        imul    rcx, rcx, 240
        sub     rax, rcx
        ret

In my first post one can see that Clang uses mulx for both the integer division
by a constant and the integer modulo by a constant, while GCC does not use
mulx. However, for the integer division by a constant GCC uses the same number
of instructions as Clang (even without GCC using mulx) but for the integer
modulo by a constant GCC uses up to 30% more instructions and is noticeably
slower.

Please note that Clang's assembly is also shorter (8 asm instructions) than
GCC's assembly for the integer modulo by a constant on x86-64 CPUs when
compiling without -mbmi2 e.g. with just -O3:

func:
        movabs  rcx, -8608480567731124087
        mov     rax, rdi
        mul     rcx
        shr     rdx, 7
        imul    rax, rdx, 240
        sub     rdi, rax
        mov     rax, rdi
        ret

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
  2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
@ 2024-07-02 11:44 ` pinskia at gcc dot gnu.org
  2024-07-02 11:59 ` kim.walisch at gmail dot com
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 11:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |X86_64

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This seems like a tuning issue. In that gcc thinks the shifts and stuff is
faster than mulx.

What happens if you do -march=native?

Does it use mulx?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
  2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
  2024-07-02 11:44 ` [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs pinskia at gcc dot gnu.org
@ 2024-07-02 11:59 ` kim.walisch at gmail dot com
  2024-07-02 12:17 ` kim.walisch at gmail dot com
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02 11:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #3 from kim.walisch at gmail dot com ---
(In reply to Andrew Pinski from comment #2)
> This seems like a tuning issue. In that gcc thinks the shifts and stuff is
> faster than mulx.
> 
> What happens if you do -march=native?
> 
> Does it use mulx?

I tried using g++-14 using both -march=native and -march=x86-64-v4 (on a 12th
Gen Intel Core i5-12600K which supports BMI2 and AVX2) but GCC always produces
that same 11 instruction assembly sequence without mulx for the integer modulo
by a constant.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (2 preceding siblings ...)
  2024-07-02 11:59 ` kim.walisch at gmail dot com
@ 2024-07-02 12:17 ` kim.walisch at gmail dot com
  2024-07-02 17:21 ` pinskia at gcc dot gnu.org
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02 12:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #4 from kim.walisch at gmail dot com ---
One possible explanation for why GCC's current integer division by a constant
assembly sequence was chosen back in the day (I guess one or two decades ago)
is that GCC's current assembly sequence uses only 1 mul instruction whereas
Clang uses 2 mul instructions.

Historically, multiplication instructions used to be slower than add, sub and
shift instructions on nearly all CPU architectures and so it made sense to
avoid mul instructions whenever possible. However in the past decade this
performance gap has narrowed and now it is more important to avoid long
instruction dependency chains which GCC's current integer modulo by a constant
assembly sequence suffers from.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (3 preceding siblings ...)
  2024-07-02 12:17 ` kim.walisch at gmail dot com
@ 2024-07-02 17:21 ` pinskia at gcc dot gnu.org
  2024-07-02 17:58 ` pinskia at gcc dot gnu.org
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 17:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For the original testcase, the imul vs shift can be reduced down to just:
```
unsigned long func(unsigned long x)
{
  return x * 240;
}

```

GCC produces:
```
        movq    %rdi, %rax
        salq    $4, %rax
        subq    %rdi, %rax
        salq    $4, %rax
```
vs:
```
        imulq   $240, %rdi, %rax
```

-mtune=skylake  produces the imul.

The mulx issue can be reduced down to just:
```
unsigned long func(unsigned long x)
{
  __uint128_t t = x;
  return (t * 123)>>64;
}


__uint128_t func1(unsigned long x, unsigned long y)
{
  __uint128_t t = x;
  return (t * 123);
}

```

It only happens with constants it seems.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (4 preceding siblings ...)
  2024-07-02 17:21 ` pinskia at gcc dot gnu.org
@ 2024-07-02 17:58 ` pinskia at gcc dot gnu.org
  2024-07-02 18:14 ` pinskia at gcc dot gnu.org
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 17:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |115755

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> The mulx issue can be reduced down to just:
> ```
> unsigned long func(unsigned long x)
> {
>   __uint128_t t = x;
>   return (t * 123)>>64;
> }
> 
> 
> __uint128_t func1(unsigned long x, unsigned long y)
> {
>   __uint128_t t = x;
>   return (t * 123);
> }
> 
> ```
> 
> It only happens with constants it seems.

I split that out to PR 115755.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115755
[Bug 115755] mulx (with -mbmi2) does not show up with constant multiply

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (5 preceding siblings ...)
  2024-07-02 17:58 ` pinskia at gcc dot gnu.org
@ 2024-07-02 18:14 ` pinskia at gcc dot gnu.org
  2024-07-02 18:41 ` pinskia at gcc dot gnu.org
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 18:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-07-02
         Depends on|                            |115756
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> For the original testcase, the imul vs shift can be reduced down to just:
> ```
> unsigned long func(unsigned long x)
> {
>   return x * 240;
> }
> 
> ```
> 
> GCC produces:
> ```
>         movq    %rdi, %rax
>         salq    $4, %rax
>         subq    %rdi, %rax
>         salq    $4, %rax
> ```
> vs:
> ```
>         imulq   $240, %rdi, %rax
> ```
> 
> -mtune=skylake  produces the imul.

I split out the tuning issue for imul to PR 115756 .


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756
[Bug 115756] default tuning for x86_64 produces shifts for `*240`

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (6 preceding siblings ...)
  2024-07-02 18:14 ` pinskia at gcc dot gnu.org
@ 2024-07-02 18:41 ` pinskia at gcc dot gnu.org
  2024-07-03 17:29 ` kim.walisch at gmail dot com
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-07-02 18:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to kim.walisch from comment #4)
> One possible explanation for why GCC's current integer division by a
> constant assembly sequence was chosen back in the day (I guess one or two
> decades ago) is that GCC's current assembly sequence uses only 1 mul
> instruction whereas Clang uses 2 mul instructions.
> 
> Historically, multiplication instructions used to be slower than add, sub
> and shift instructions on nearly all CPU architectures and so it made sense
> to avoid mul instructions whenever possible. However in the past decade this
> performance gap has narrowed and now it is more important to avoid long
> instruction dependency chains which GCC's current integer modulo by a
> constant assembly sequence suffers from.

Note the way GCC has handled mult->shift/add (and even the divide/mod
expansion) is via a target independent part which querries the target on the
costs of specific instructions (mult, shift, add, etc.). So if the target has
the cost not modeled correctly, you get the less efficient sequence. This is
why I said it was a cost model issue and why PR 115756 is asking for the
changing of the default (generic) output. 
So yes the cost might be based on the older cores and not been retuned since.
Anyways the middle-end is doing the correct thing based on what the target is
giving it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (7 preceding siblings ...)
  2024-07-02 18:41 ` pinskia at gcc dot gnu.org
@ 2024-07-03 17:29 ` kim.walisch at gmail dot com
  2024-07-04  1:17 ` liuhongt at gcc dot gnu.org
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-03 17:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #9 from kim.walisch at gmail dot com ---
Here I am providing some benchmark results to back up my claim that switching
to the integer modulo by a constant algorithm with 2 multiplication
instructions (which is the default in both Clang and MSVC) is faster than GCC's
current default algorithm using only 1 multiplication but more shifts, adds and
subs.

Here is a simple C program that can be used to benchmark both algorithms using
GCC. It computes integer modulo by a constant in a loop (10^10 iterations).
While this benchmark program may not be ideal for simulating real world use
cases, it is better than having no benchmark at all.

-----------------------------------------------------------------------

#include <stdint.h>
#include <inttypes.h>
#include <stdio.h>
#include <sys/time.h>

__attribute__((noinline))
uint64_t benchmark(uint64_t x, uint64_t inc, uint64_t iters) {
    uint64_t sum = 0;

    #pragma GCC unroll 1
    for (uint64_t i = 0; i < iters; i++, x += inc)
        sum += x % 240;

    return sum;
}

int main() {
    struct timeval start, end;
    long seconds, useconds;
    double mtime;

    // Start timer
    gettimeofday(&start, NULL);

    uint64_t result = benchmark(0, 11, 10000000000ull);

    // End timer
    gettimeofday(&end, NULL);

    // Calculate the elapsed time in milliseconds
    seconds  = end.tv_sec  - start.tv_sec;
    useconds = end.tv_usec - start.tv_usec;

    mtime = ((seconds) * 1000 + useconds / 1000.0);

    printf("Result = %" PRIu64 "\n", result);
    printf("The loop took %f milliseconds to execute.\n", mtime);

    return 0;
}

-----------------------------------------------------------------------


This GCC command produces the assembly sequence with only 1 multiplication
instruction:

gcc -O3 bench.c -o bench1

This GCC command produces the assembly sequence with 2 multiplication
instructions:

gcc -O3 -mtune=skylake bench.c -o bench2


And here are the benchmark results:


CPU: AMD EPYC 9R14 (AMD Zen4 from 2023), Compiler: GCC 14 on Ubuntu 24.04.

$ ./bench1
Result = 1194999999360
The loop took 8385.613000 milliseconds to execute.
$ ./bench2
Result = 1194999999360
The loop took 5898.967000 milliseconds to execute.

The algorithm with 2 multiplications is 30% faster.


CPU: Intel i5-12600K (big.LITTLE), Performance CPU core, Compiler: GCC 13.2
(MinGW-w64)

$ ./bench1.exe
Result = 1194999999360
The loop took 5633.360000 milliseconds to execute.
$ ./bench2.exe
Result = 1194999999360
The loop took 4369.167000 milliseconds to execute.

The algorithm with 2 multiplications is 23% faster.


CPU: Intel i5-12600K (big.LITTLE), Efficiency CPU core, Compiler: GCC 13.2
(MinGW-w64)

$ ./bench1.exe
Result = 1194999999360
The loop took 10788.097000 milliseconds to execute.
$ ./bench2.exe
Result = 1194999999360
The loop took 9453.191000 milliseconds to execute.

The algorithm with 2 multiplications is 12% faster.


One of the comments in PR 115756 was "I'd lean towards shift+add because for
example Intel E-cores have a slow imul.". However, my benchmarks suggest that
even on Intel Efficiency CPU cores the algorithm with 2 multiplication
instructions is faster. (I used the Process Lasso tool on Windows 11 to force
the benchmark to be run on an Efficiency CPU core).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (8 preceding siblings ...)
  2024-07-03 17:29 ` kim.walisch at gmail dot com
@ 2024-07-04  1:17 ` liuhongt at gcc dot gnu.org
  2024-07-16  8:18 ` lingling.kong7 at gmail dot com
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-07-04  1:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |haochen.jiang at intel dot com,
                   |                            |liuhongt at gcc dot gnu.org

--- Comment #10 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> One of the comments in PR 115756 was "I'd lean towards shift+add because for
> example Intel E-cores have a slow imul.". However, my benchmarks suggest
> that even on Intel Efficiency CPU cores the algorithm with 2 multiplication
> instructions is faster. (I used the Process Lasso tool on Windows 11 to
> force the benchmark to be run on an Efficiency CPU core).

@haocheng, could you try the benchmark on our Sierra Forest machine?
I'm ok to adjust rtx_cost of imulq for COST_N_INSNS (4) to COST_N_INSNS (3) if
the performance test looks ok.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (9 preceding siblings ...)
  2024-07-04  1:17 ` liuhongt at gcc dot gnu.org
@ 2024-07-16  8:18 ` lingling.kong7 at gmail dot com
  2024-07-16 21:29 ` roger at nextmovesoftware dot com
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: lingling.kong7 at gmail dot com @ 2024-07-16  8:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #11 from kong lingling <lingling.kong7 at gmail dot com> ---
After adjusted rtx_cost of imulq for COST_N_INSNS (4) to COST_N_INSNS (3), I
tested the benchmark on Sierra Forest machine based on gcc trunk, and the
algorithm with 2 multiplications is 2% faster. For Spec2017 performance
improvement is around 0.2%  (1 copy,  -march=native -Ofast -funroll-loops -flto
/ -mtune=generic -O2 -march=x86-64-v3).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (10 preceding siblings ...)
  2024-07-16  8:18 ` lingling.kong7 at gmail dot com
@ 2024-07-16 21:29 ` roger at nextmovesoftware dot com
  2024-07-25  1:45 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-07-16 21:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #12 from Roger Sayle <roger at nextmovesoftware dot com> ---
I owe Kim an apology.  It does appear that modern x86_64 processors perform
(many) multiplications faster than the latencies given in the Intel/AMD/Agner
Fog documentation.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (11 preceding siblings ...)
  2024-07-16 21:29 ` roger at nextmovesoftware dot com
@ 2024-07-25  1:45 ` cvs-commit at gcc dot gnu.org
  2024-08-15  5:11 ` liuhongt at gcc dot gnu.org
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-07-25  1:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #13 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kong Lingling <konglin1@gcc.gnu.org>:

https://gcc.gnu.org/g:bc00de070f0b9a25f68ffddbefe516543a44bd23

commit r15-2295-gbc00de070f0b9a25f68ffddbefe516543a44bd23
Author: Lingling Kong <lingling.kong@intel.com>
Date:   Thu Jul 25 09:42:06 2024 +0800

    i386: Adjust rtx cost for imulq and imulw [PR115749]

    gcc/ChangeLog:

            PR target/115749
            * config/i386/x86-tune-costs.h (struct processor_costs):
            Adjust rtx_cost of imulq and imulw for COST_N_INSNS (4)
            to COST_N_INSNS (3).

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr115749.c: New test.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (12 preceding siblings ...)
  2024-07-25  1:45 ` cvs-commit at gcc dot gnu.org
@ 2024-08-15  5:11 ` liuhongt at gcc dot gnu.org
  2024-08-15  5:35 ` liuhongt at gcc dot gnu.org
  2024-08-16  5:00 ` sjames at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-15  5:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Bug 115749 depends on bug 115756, which changed state.

Bug 115756 Summary: default tuning for x86_64 produces shifts for `*240`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (13 preceding siblings ...)
  2024-08-15  5:11 ` liuhongt at gcc dot gnu.org
@ 2024-08-15  5:35 ` liuhongt at gcc dot gnu.org
  2024-08-16  5:00 ` sjames at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-15  5:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
Fixed in GCC15.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
  2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
                   ` (14 preceding siblings ...)
  2024-08-15  5:35 ` liuhongt at gcc dot gnu.org
@ 2024-08-16  5:00 ` sjames at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: sjames at gcc dot gnu.org @ 2024-08-16  5:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Sam James <sjames at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |15.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-08-16  5:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
2024-07-02 11:44 ` [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs pinskia at gcc dot gnu.org
2024-07-02 11:59 ` kim.walisch at gmail dot com
2024-07-02 12:17 ` kim.walisch at gmail dot com
2024-07-02 17:21 ` pinskia at gcc dot gnu.org
2024-07-02 17:58 ` pinskia at gcc dot gnu.org
2024-07-02 18:14 ` pinskia at gcc dot gnu.org
2024-07-02 18:41 ` pinskia at gcc dot gnu.org
2024-07-03 17:29 ` kim.walisch at gmail dot com
2024-07-04  1:17 ` liuhongt at gcc dot gnu.org
2024-07-16  8:18 ` lingling.kong7 at gmail dot com
2024-07-16 21:29 ` roger at nextmovesoftware dot com
2024-07-25  1:45 ` cvs-commit at gcc dot gnu.org
2024-08-15  5:11 ` liuhongt at gcc dot gnu.org
2024-08-15  5:35 ` liuhongt at gcc dot gnu.org
2024-08-16  5:00 ` sjames at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).