public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/115749] New: Missed BMI2 optimization on x86-64
@ 2024-07-02  9:31 kim.walisch at gmail dot com
  2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: kim.walisch at gmail dot com @ 2024-07-02  9:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

            Bug ID: 115749
           Summary: Missed BMI2 optimization on x86-64
           Product: gcc
           Version: 14.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kim.walisch at gmail dot com
  Target Milestone: ---

Hi,

I have debugged a performance issue in one of my C++ applications on x86-64
CPUs where GCC produces noticeably slower code (using all GCC versions) than
Clang. I was able to find that the performance issue was caused by GCC not
using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on
the other hand used the mulx instruction producing a shorter and faster
assembly sequence. For this particular code sequence Clang used up to 30% fewer
instructions than GCC.

Here is a minimal C/C++ code snippet that reproduces the issue:


extern const unsigned long array[240];

unsigned long func(unsigned long x)
{
    unsigned long index = x / 240;
    return array[index % 240];
}



GCC trunk produces the following 15 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:

func(unsigned long):
        movabs  rcx, -8608480567731124087
        mov     rax, rdi
        mul     rcx
        mov     rdi, rdx
        shr     rdi, 7
        mov     rax, rdi
        mul     rcx
        shr     rdx, 7
        mov     rax, rdx
        sal     rax, 4
        sub     rax, rdx
        sal     rax, 4
        sub     rdi, rax
        mov     rax, QWORD PTR array[0+rdi*8]
        ret


Clang trunk produces the following shorter and faster 12 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:

func(unsigned long):                               # @func(unsigned long)
        movabs  rax, -8608480567731124087
        mov     rdx, rdi
        mulx    rdx, rdx, rax
        shr     rdx, 7
        movabs  rax, 153722867280912931
        mulx    rax, rax, rax
        shr     eax
        imul    eax, eax, 240
        sub     edx, eax
        mov     rax, qword ptr [rip + array@GOTPCREL]
        mov     rax, qword ptr [rax + 8*rdx]
        ret

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-08-16  5:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-02  9:31 [Bug c++/115749] New: Missed BMI2 optimization on x86-64 kim.walisch at gmail dot com
2024-07-02 11:36 ` [Bug c++/115749] " kim.walisch at gmail dot com
2024-07-02 11:44 ` [Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs pinskia at gcc dot gnu.org
2024-07-02 11:59 ` kim.walisch at gmail dot com
2024-07-02 12:17 ` kim.walisch at gmail dot com
2024-07-02 17:21 ` pinskia at gcc dot gnu.org
2024-07-02 17:58 ` pinskia at gcc dot gnu.org
2024-07-02 18:14 ` pinskia at gcc dot gnu.org
2024-07-02 18:41 ` pinskia at gcc dot gnu.org
2024-07-03 17:29 ` kim.walisch at gmail dot com
2024-07-04  1:17 ` liuhongt at gcc dot gnu.org
2024-07-16  8:18 ` lingling.kong7 at gmail dot com
2024-07-16 21:29 ` roger at nextmovesoftware dot com
2024-07-25  1:45 ` cvs-commit at gcc dot gnu.org
2024-08-15  5:11 ` liuhongt at gcc dot gnu.org
2024-08-15  5:35 ` liuhongt at gcc dot gnu.org
2024-08-16  5:00 ` sjames at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).