[Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64
@ 2024-01-23 13:37 accelerator0099 at gmail dot com
  2024-01-23 15:06 ` [Bug target/113560] " rguenth at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: accelerator0099 at gmail dot com @ 2024-01-23 13:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

            Bug ID: 113560
           Summary: Strange code generated when optimizing a
                    multiplication on x86_64
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: accelerator0099 at gmail dot com
  Target Milestone: ---

Code:
#include <immintrin.h>
auto f(char *buf, unsigned long long in) noexcept
{
    unsigned long long hi{};
    auto lo{_mulx_u64(in, 0x2af31dc462ull, &hi)};
    lo = _mulx_u64(lo, 100, &hi);
    __builtin_memcpy(buf + 2, &hi, 2);
    return buf + 10;
}
auto g(char *buf, unsigned long long in) noexcept
{
    unsigned long long hi{};
    _mulx_u64(in, 100, &hi);
    __builtin_memcpy(buf + 2, &hi, 2);
    return buf + 10;
}

Compile with:
-Ofast -std=c++23 -march=znver4

GCC 13.2 and truck generate:
f(char*, unsigned long long):
        movabs  rdx, 184467440738
        mov     rax, rdi
        mulx    r9, r8, rsi
        xor     r9d, r9d
        mov     rsi, r8
        mov     rdi, r9
        add     rsi, r8
        shld    rdi, r8, 1
        add     rsi, r8
        adc     rdi, r9
        shld    rdi, rsi, 3
        sal     rsi, 3
        add     rsi, r8
        adc     rdi, r9
        add     rax, 10
        shld    rdi, rsi, 2
        mov     WORD PTR [rax-8], di
        ret
g(char*, unsigned long long):
        mov     eax, 100
        mul     rsi
        lea     rax, [rdi+10]
        mov     WORD PTR [rdi+2], dx
        ret

GCC 12 generates:
f(char*, unsigned long long):
        movabs  rdx, 184467440738
        mov     rax, rsi
        imul    rax, rdx
        mov     edx, 100
        mulx    rdx, rax, rax
        lea     rax, [rdi+10]
        mov     WORD PTR [rdi+2], dx
        ret
g(char*, unsigned long long):
        mov     eax, 100
        mul     rsi
        lea     rax, [rdi+10]
        mov     WORD PTR [rdi+2], dx
        ret

Clang:
f(char*, unsigned long long):
unsigned long long)
        movabs  rdx, 184467440738
        mov     eax, 100
        imul    rdx, rsi
        mulx    rax, rax, rax
        mov     word ptr [rdi + 2], ax
        lea     rax, [rdi + 10]
        ret
g(char*, unsigned long long):
unsigned long long)
        mov     eax, 100
        mov     rdx, rsi
        mulx    rax, rax, rax
        mov     word ptr [rdi + 2], ax
        lea     rax, [rdi + 10]
        ret

See also:
https://gcc.godbolt.org/z/df7Gr1MKo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
@ 2024-01-23 15:06 ` rguenth at gcc dot gnu.org
  2024-01-24  8:32 ` roger at nextmovesoftware dot com
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-23 15:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
          Component|c++                         |target
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-01-23
             Target|                            |x86_64-*-*

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC thinks the multiplication bu the constant is cheaper this way - are you
sure otherwise?

I see g using a highpart multiply while f uses a widening multiply.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
  2024-01-23 15:06 ` [Bug target/113560] " rguenth at gcc dot gnu.org
@ 2024-01-24  8:32 ` roger at nextmovesoftware dot com
  2024-01-24  9:03 ` amonakov at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-01-24  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com

--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> ---
The costs look sane, and I'd expect the synth_mult generated sequence to be
faster, though it would be good to get some microbenchmarking.
A reduced test case is:
__int128 foo(__int128 x) { return x*100; }
The x86 backend thinks that a 128-bit (TImode) multiplication would take 14
cycles, so instead generates:
x2 = x+x        2 cycles
x3 = x2+x       2 cycles
x24 = x<<3      2 cycles
x25 = x24+x     2 cycles
x100 = x<<2     2 cycles
which is a total of 10 cycles, and predicted to be faster than the generic
implementation (requiring 2 IMULQ, 1 MULQ and 2 ADDQ) for
__int128 bar(__int128 x, int y) { return x*y; }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
  2024-01-23 15:06 ` [Bug target/113560] " rguenth at gcc dot gnu.org
  2024-01-24  8:32 ` roger at nextmovesoftware dot com
@ 2024-01-24  9:03 ` amonakov at gcc dot gnu.org
  2024-01-24  9:04 ` accelerator0099 at gmail dot com
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: amonakov at gcc dot gnu.org @ 2024-01-24  9:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Roger Sayle from comment #2)
> The costs look sane, and I'd expect the synth_mult generated sequence to be
> faster, though it would be good to get some microbenchmarking.
> A reduced test case is:
> __int128 foo(__int128 x) { return x*100; }

This is not an equivalent testcase, mulx is a widening multiply from 64-bit
source operands. It has latency 3 or 4 on most implementations. Costing it as a
synthesized general 128-bit multiplication is wrong.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
                   ` (2 preceding siblings ...)
  2024-01-24  9:03 ` amonakov at gcc dot gnu.org
@ 2024-01-24  9:04 ` accelerator0099 at gmail dot com
  2024-01-24  9:11 ` accelerator0099 at gmail dot com
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: accelerator0099 at gmail dot com @ 2024-01-24  9:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

--- Comment #4 from accelerator0099 at gmail dot com ---
Well, I hope gcc will just generate mulx instruction on arch with BMI2. Let's
look at the AMD64 Architecture Programmer’s Manual Volume 3:
Computes the unsigned product of the specified source operand and the implicit
source operand rDX.
Writes the upper half of the product to the first destination and the lower
half to the second.
So, just a mulx can do this. And according to the manual, it only costs 3 or 4
circles to excute a mulx.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
                   ` (3 preceding siblings ...)
  2024-01-24  9:04 ` accelerator0099 at gmail dot com
@ 2024-01-24  9:11 ` accelerator0099 at gmail dot com
  2024-01-24 10:53 ` roger at nextmovesoftware dot com
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: accelerator0099 at gmail dot com @ 2024-01-24  9:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

--- Comment #5 from accelerator0099 at gmail dot com ---
If we are using an arch without BMI2, we can use single MUL instruction
instead. Here is the description of MUL reg64/mem64.
Multiplies a 64-bit register or memory operand by the contents of the RAX
register and stores the result in the RDX:RAX register.
It stores the result in RDX:RAX, putting the high-order bits of the product in
RDX.
And on zen4 arch, it costs 3 or 4 circles to do this.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
                   ` (4 preceding siblings ...)
  2024-01-24  9:11 ` accelerator0099 at gmail dot com
@ 2024-01-24 10:53 ` roger at nextmovesoftware dot com
  2024-01-28  9:17 ` roger at nextmovesoftware dot com
  2024-02-01  6:12 ` cvs-commit at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-01-24 10:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

--- Comment #6 from Roger Sayle <roger at nextmovesoftware dot com> ---
In the .optimized dump, we have:
  __int128 unsigned __res;
  __int128 unsigned _12;
  ...
  __res_11 = in_2(D) w* 184467440738;
  _12 = __res_11 & 18446744073709551615;
  __res_7 = _12 * 100;

So the first multiplication is a widening multiplication and expanded using
mulx, but the second multiplication is a full width TImode multiplication,
which is why it has the same RTL expansion as "x * 100".  This is looking like
a tree-level issue and (perhaps) not a target-specific problem.

In fact, it looks like this operation is actually a highpart_multiplication as
only the highpart of the result is required (which should still generate mulx,
but  has a different representation at the tree-level).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
                   ` (5 preceding siblings ...)
  2024-01-24 10:53 ` roger at nextmovesoftware dot com
@ 2024-01-28  9:17 ` roger at nextmovesoftware dot com
  2024-02-01  6:12 ` cvs-commit at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-01-28  9:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |roger at nextmovesoftware dot com

--- Comment #7 from Roger Sayle <roger at nextmovesoftware dot com> ---
I'm bootstrapping and regression testing a patch.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
  2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
                   ` (6 preceding siblings ...)
  2024-01-28  9:17 ` roger at nextmovesoftware dot com
@ 2024-02-01  6:12 ` cvs-commit at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-02-01  6:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

--- Comment #8 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:2f14c0dbb789852947cb58fdf7d3162413f053fa

commit r14-8680-g2f14c0dbb789852947cb58fdf7d3162413f053fa
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Thu Feb 1 06:10:42 2024 +0000

    PR target/113560: Enhance is_widening_mult_rhs_p.

    This patch resolves PR113560, a code quality regression from GCC12
    affecting x86_64, by enhancing the middle-end's tree-ssa-math-opts.cc
    to recognize more instances of widening multiplications.

    The widening multiplication perception code identifies cases like:

            _1 = (unsigned __int128) x;
            __res = _1 * 100;

    but in the reported test case, the original input looks like:

            _1 = (unsigned long long) x;
            _2 = (unsigned __int128) _1;
            __res = _2 * 100;

    which gets optimized by constant folding during tree-ssa to:

            _2 = x & 18446744073709551615;  // x & 0xffffffffffffffff
            __res = _2 * 100;

    where the BIT_AND_EXPR hides (has consumed) the extension operation.
    This reveals the more general deficiency (missed optimization
    opportunity) in widening multiplication perception that additionally
    both

    __int128 foo(__int128 x, __int128 y) {
      return (x & 1000) * (y & 1000)
    }

    and

    unsigned __int128 bar(unsigned __int128 x, unsigned __int128) {
      return (x >> 80) * (y >> 80);
    }

    should be recognized as widening multiplications.  Hence rather than
    test explicitly for BIT_AND_EXPR (as in the first version of this patch)
    the more general solution is to make use of range information, as
    provided by tree_non_zero_bits.

    As a demonstration of the observed improvements, function foo above
    currently with -O2 compiles on x86_64 to:

    foo:    movq    %rdi, %rsi
            movq    %rdx, %r8
            xorl    %edi, %edi
            xorl    %r9d, %r9d
            andl    $1000, %esi
            andl    $1000, %r8d
            movq    %rdi, %rcx
            movq    %r9, %rdx
            imulq   %rsi, %rdx
            movq    %rsi, %rax
            imulq   %r8, %rcx
            addq    %rdx, %rcx
            mulq    %r8
            addq    %rdx, %rcx
            movq    %rcx, %rdx
            ret

    with this patch, GCC recognizes the *w and instead generates:

    foo:    movq    %rdi, %rsi
            movq    %rdx, %r8
            andl    $1000, %esi
            andl    $1000, %r8d
            movq    %rsi, %rax
            imulq   %r8
            ret

    which is perhaps easier to understand at the tree-level where

    __int128 foo (__int128 x, __int128 y)
    {
      __int128 _1;
      __int128 _2;
      __int128 _5;

      <bb 2> [local count: 1073741824]:
      _1 = x_3(D) & 1000;
      _2 = y_4(D) & 1000;
      _5 = _1 * _2;
      return _5;

    }

    gets transformed to:

    __int128 foo (__int128 x, __int128 y)
    {
      __int128 _1;
      __int128 _2;
      __int128 _5;
      signed long _7;
      signed long _8;

      <bb 2> [local count: 1073741824]:
      _1 = x_3(D) & 1000;
      _2 = y_4(D) & 1000;
      _7 = (signed long) _1;
      _8 = (signed long) _2;
      _5 = _7 w* _8;
      return _5;
    }

    2023-02-01  Roger Sayle  <roger@nextmovesoftware.com>
                Richard Biener  <rguenther@suse.de>

    gcc/ChangeLog
            PR target/113560
            * tree-ssa-math-opts.cc (is_widening_mult_rhs_p): Use range
            information via tree_non_zero_bits to check if this operand
            is suitably extended for a widening (or highpart) multiplication.
            (convert_mult_to_widen): Insert explicit casts if the RHS or LHS
            isn't already of the claimed type.

    gcc/testsuite/ChangeLog
            PR target/113560
            * g++.target/i386/pr113560.C: New test case.
            * gcc.target/i386/pr113560.c: Likewise.
            * gcc.dg/pr87954.c: Update test case.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-02-01  6:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-23 13:37 [Bug c++/113560] New: Strange code generated when optimizing a multiplication on x86_64 accelerator0099 at gmail dot com
2024-01-23 15:06 ` [Bug target/113560] " rguenth at gcc dot gnu.org
2024-01-24  8:32 ` roger at nextmovesoftware dot com
2024-01-24  9:03 ` amonakov at gcc dot gnu.org
2024-01-24  9:04 ` accelerator0099 at gmail dot com
2024-01-24  9:11 ` accelerator0099 at gmail dot com
2024-01-24 10:53 ` roger at nextmovesoftware dot com
2024-01-28  9:17 ` roger at nextmovesoftware dot com
2024-02-01  6:12 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).