public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments
@ 2020-11-08 19:53 tkoenig at gcc dot gnu.org
  2020-11-09  6:29 ` [Bug rtl-optimization/97756] " tkoenig at gcc dot gnu.org
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-11-08 19:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

            Bug ID: 97756
           Summary: Inefficient handling of 128-bit arguments
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

This is an offshoot from PR 97459.

The code

#define ONE ((__uint128_t) 1)
#define TWO_64 (ONE << 64)
#define MASK60 ((1ul << 60) - 1)

typedef __uint128_t mytype;

void
div_rem_13_v2 (mytype n, mytype *div, unsigned int *rem)
{
  const mytype magic = TWO_64 * 14189803133622732012u + 5675921253449092805u *
ONE;
  unsigned long a, b, c;
  unsigned int r;

  a = n & MASK60;
  b = (n >> 60);
  b = b & MASK60;
  c = (n >> 120);
  r = (a+b+c) % 13;
  n = n - r;
  *div = n * magic;
  *rem = r;
}

when compiled on x86_64 on Zen with -O3 -march=native has quite
some register shuffling at the beginning:

   0:   49 89 f0                mov    %rsi,%r8
   3:   48 89 fe                mov    %rdi,%rsi
   6:   49 89 d1                mov    %rdx,%r9
   9:   48 ba ff ff ff ff ff    movabs $0xfffffffffffffff,%rdx
  10:   ff ff 0f 
  13:   4c 89 c7                mov    %r8,%rdi
  16:   48 89 f0                mov    %rsi,%rax
  19:   49 89 c8                mov    %rcx,%r8
  1c:   48 89 f1                mov    %rsi,%rcx
  1f:   49 89 fa                mov    %rdi,%r10
  22:   48 0f ac f8 3c          shrd   $0x3c,%rdi,%rax
  27:   48 21 d1                and    %rdx,%rcx
  2a:   41 56                   push   %r14
  2c:   49 c1 ea 38             shr    $0x38,%r10
  30:   48 21 d0                and    %rdx,%rax
  33:   53                      push   %rbx
  34:   48 bb c5 4e ec c4 4e    movabs $0x4ec4ec4ec4ec4ec5,%rbx
  3b:   ec c4 4e 
  3e:   4c 01 d1                add    %r10,%rcx
  41:   45 31 db                xor    %r11d,%r11d
  44:   48 01 c1                add    %rax,%rcx
  47:   48 89 c8                mov    %rcx,%rax
  4a:   48 f7 e3                mul    %rbx
  4d:   48 c1 ea 02             shr    $0x2,%rdx
  51:   48 8d 04 52             lea    (%rdx,%rdx,2),%rax
  55:   48 8d 04 82             lea    (%rdx,%rax,4),%rax
  59:   48 89 ca                mov    %rcx,%rdx
  5c:   48 b9 ec c4 4e ec c4    movabs $0xc4ec4ec4ec4ec4ec,%rcx
  63:   4e ec c4 
  66:   48 29 c2                sub    %rax,%rdx
  69:   48 29 d6                sub    %rdx,%rsi
  6c:   49 89 d6                mov    %rdx,%r14
  6f:   4c 19 df                sbb    %r11,%rdi
  72:   48 0f af ce             imul   %rsi,%rcx
  76:   48 89 f2                mov    %rsi,%rdx
  79:   48 89 f8                mov    %rdi,%rax
  7c:   c4 e2 cb f6 fb          mulx   %rbx,%rsi,%rdi
  81:   48 0f af c3             imul   %rbx,%rax
  85:   49 89 31                mov    %rsi,(%r9)
  88:   48 01 c8                add    %rcx,%rax
  8b:   48 01 c7                add    %rax,%rdi
  8e:   49 89 79 08             mov    %rdi,0x8(%r9)
  92:   45 89 30                mov    %r14d,(%r8)
  95:   5b                      pop    %rbx
  96:   41 5e                   pop    %r14
  98:   c3                      retq

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
@ 2020-11-09  6:29 ` tkoenig at gcc dot gnu.org
  2020-12-25 11:35 ` tkoenig at gcc dot gnu.org
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-11-09  6:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #1 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Actually, it was on a Ryzen 1700 (for the -march=native).

I'm at odds with architecture names...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
  2020-11-09  6:29 ` [Bug rtl-optimization/97756] " tkoenig at gcc dot gnu.org
@ 2020-12-25 11:35 ` tkoenig at gcc dot gnu.org
  2021-03-10 12:55 ` ppalka at gcc dot gnu.org
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-12-25 11:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Thomas Koenig <tkoenig at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=98438

--- Comment #2 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Might be related to / dup of PR 98438.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
  2020-11-09  6:29 ` [Bug rtl-optimization/97756] " tkoenig at gcc dot gnu.org
  2020-12-25 11:35 ` tkoenig at gcc dot gnu.org
@ 2021-03-10 12:55 ` ppalka at gcc dot gnu.org
  2021-04-28  4:58 ` pinskia at gcc dot gnu.org
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ppalka at gcc dot gnu.org @ 2021-03-10 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Patrick Palka <ppalka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ppalka at gcc dot gnu.org

--- Comment #3 from Patrick Palka <ppalka at gcc dot gnu.org> ---
Perhaps related to this PR: On x86_64, the following basic wrapper around
int128 addition

  __uint128_t f(__uint128_t x, __uint128_t y) { return x + y; }

gets compiled (/w -O3, -O2 or -Os) to the seemingly suboptimal

        movq    %rdi, %r9
        movq    %rdx, %rax
        movq    %rsi, %r8
        movq    %rcx, %rdx
        addq    %r9, %rax
        adcq    %r8, %rdx
        ret

Clang does:

        movq    %rdi, %rax
        addq    %rdx, %rax
        adcq    %rcx, %rsi
        movq    %rsi, %rdx
        retq

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-03-10 12:55 ` ppalka at gcc dot gnu.org
@ 2021-04-28  4:58 ` pinskia at gcc dot gnu.org
  2021-04-28  7:20 ` [Bug rtl-optimization/97756] [9/10/11/12 Regression] " jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-04-28  4:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dushistov at mail dot ru

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 100301 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [9/10/11/12 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-04-28  4:58 ` pinskia at gcc dot gnu.org
@ 2021-04-28  7:20 ` jakub at gcc dot gnu.org
  2021-06-01  8:18 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-04-28  7:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|enhancement                 |normal
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
            Summary|Inefficient handling of     |[9/10/11/12 Regression]
                   |128-bit arguments           |Inefficient handling of
                   |                            |128-bit arguments
   Last reconfirmed|                            |2021-04-28
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
           Priority|P3                          |P2
   Target Milestone|---                         |9.4

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
On the
__uint128_t f(__uint128_t x, __uint128_t y) { return x + y; }
__uint128_t g(__uint128_t x, __uint128_t y) { return y + x; }
testcase with -O2 this regressed with
r9-6788-g0d2a576a1417b8d4526d369fef1d87cee2c49f99

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [9/10/11/12 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-04-28  7:20 ` [Bug rtl-optimization/97756] [9/10/11/12 Regression] " jakub at gcc dot gnu.org
@ 2021-06-01  8:18 ` rguenth at gcc dot gnu.org
  2021-08-30  5:18 ` crazylht at gmail dot com
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-01  8:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.4                         |9.5

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9.4 is being released, retargeting bugs to GCC 9.5.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [9/10/11/12 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-06-01  8:18 ` rguenth at gcc dot gnu.org
@ 2021-08-30  5:18 ` crazylht at gmail dot com
  2022-05-27  9:43 ` [Bug rtl-optimization/97756] [10/11/12/13 " rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-08-30  5:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Patrick Palka from comment #3)
> Perhaps related to this PR: On x86_64, the following basic wrapper around
> int128 addition
> 
>   __uint128_t f(__uint128_t x, __uint128_t y) { return x + y; }
> 
> gets compiled (/w -O3, -O2 or -Os) to the seemingly suboptimal
> 
>         movq    %rdi, %r9
>         movq    %rdx, %rax
>         movq    %rsi, %r8
>         movq    %rcx, %rdx
>         addq    %r9, %rax
>         adcq    %r8, %rdx
>         ret
> 
> Clang does:
> 
>         movq    %rdi, %rax
>         addq    %rdx, %rax
>         adcq    %rcx, %rsi
>         movq    %rsi, %rdx
>         retq

Remove addti3/ashlti3 from i386.md also helps this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [10/11/12/13 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-08-30  5:18 ` crazylht at gmail dot com
@ 2022-05-27  9:43 ` rguenth at gcc dot gnu.org
  2022-06-28 10:42 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-27  9:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.5                         |10.4

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9 branch is being closed

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [10/11/12/13 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-05-27  9:43 ` [Bug rtl-optimization/97756] [10/11/12/13 " rguenth at gcc dot gnu.org
@ 2022-06-28 10:42 ` jakub at gcc dot gnu.org
  2023-07-07 10:38 ` [Bug rtl-optimization/97756] [11/12/13/14 " rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.4                        |10.5

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-06-28 10:42 ` jakub at gcc dot gnu.org
@ 2023-07-07 10:38 ` rguenth at gcc dot gnu.org
  2023-07-13 21:09 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07 10:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.5                        |11.5

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10 branch is being closed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2023-07-07 10:38 ` [Bug rtl-optimization/97756] [11/12/13/14 " rguenth at gcc dot gnu.org
@ 2023-07-13 21:09 ` pinskia at gcc dot gnu.org
  2023-07-16 11:48 ` tkoenig at gcc dot gnu.org
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-13 21:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com

--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This seems to be improved on trunk ...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2023-07-13 21:09 ` pinskia at gcc dot gnu.org
@ 2023-07-16 11:48 ` tkoenig at gcc dot gnu.org
  2023-11-07 18:38 ` tkoenig at gcc dot gnu.org
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-07-16 11:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #12 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #11)
> This seems to be improved on trunk ...

gcc is down to 37 instructions now for the original test case with -O3.
icc, which appears to be best, has 33, see https://godbolt.org/z/461jeozs9 .

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2023-07-16 11:48 ` tkoenig at gcc dot gnu.org
@ 2023-11-07 18:38 ` tkoenig at gcc dot gnu.org
  2023-11-13  9:06 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-11-07 18:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #13 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Patrick Palka from comment #3)
> Perhaps related to this PR: On x86_64, the following basic wrapper around
> int128 addition
> 
>   __uint128_t f(__uint128_t x, __uint128_t y) { return x + y; }
> 
> gets compiled (/w -O3, -O2 or -Os) to the seemingly suboptimal
> 
>         movq    %rdi, %r9
>         movq    %rdx, %rax
>         movq    %rsi, %r8
>         movq    %rcx, %rdx
>         addq    %r9, %rax
>         adcq    %r8, %rdx
>         ret
> 
> Clang does:
> 
>         movq    %rdi, %rax
>         addq    %rdx, %rax
>         adcq    %rcx, %rsi
>         movq    %rsi, %rdx
>         retq

With current trunk, this is now

        movq    %rdx, %rax
        movq    %rcx, %rdx
        addq    %rdi, %rax
        adcq    %rsi, %rdx
        ret

so it looks OK.

The original test case regressed a bit, it is now 39 instructions.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2023-11-07 18:38 ` tkoenig at gcc dot gnu.org
@ 2023-11-13  9:06 ` cvs-commit at gcc dot gnu.org
  2023-11-13 17:52 ` tkoenig at gcc dot gnu.org
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-11-13  9:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #14 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:0a140730c970870a5125beb1114f6c01679a040e

commit r14-5385-g0a140730c970870a5125beb1114f6c01679a040e
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Mon Nov 13 09:05:16 2023 +0000

    i386: Improve reg pressure of double word right shift then truncate.

    This patch improves register pressure during reload, inspired by PR 97756.
    Normally, a double-word right-shift by a constant produces a double-word
    result, the highpart of which is dead when followed by a truncation.
    The dead code calculating the high part gets cleaned up post-reload, so
    the issue isn't normally visible, except for the increased register
    pressure during reload, sometimes leading to odd register assignments.
    Providing a post-reload splitter, which clobbers a single wordmode
    result register instead of a doubleword result register, helps (a bit).

    An example demonstrating this effect is:

    unsigned long foo (__uint128_t n)
    {
      unsigned long a = n & MASK60;
      unsigned long b = (n >> 60);
      b = b & MASK60;
      unsigned long c = (n >> 120);
      return a+b+c;
    }

    which currently with -O2 generates (13 instructions):
    foo:    movabsq $1152921504606846975, %rcx
            xchgq   %rdi, %rsi
            movq    %rsi, %rax
            shrdq   $60, %rdi, %rax
            movq    %rax, %rdx
            movq    %rsi, %rax
            movq    %rdi, %rsi
            andq    %rcx, %rax
            shrq    $56, %rsi
            andq    %rcx, %rdx
            addq    %rsi, %rax
            addq    %rdx, %rax
            ret

    with this patch, we generate one less mov (12 instructions):
    foo:    movabsq $1152921504606846975, %rcx
            xchgq   %rdi, %rsi
            movq    %rdi, %rdx
            movq    %rsi, %rax
            movq    %rdi, %rsi
            shrdq   $60, %rdi, %rdx
            andq    %rcx, %rax
            shrq    $56, %rsi
            addq    %rsi, %rax
            andq    %rcx, %rdx
            addq    %rdx, %rax
            ret

    The significant difference is easier to see via diff:
    <       shrdq   $60, %rdi, %rax
    <       movq    %rax, %rdx
    ---
    >       shrdq   $60, %rdi, %rdx

    Admittedly a single "mov" isn't much of a saving on modern architectures,
    but as demonstrated by the PR, people still track the number of them.

    2023-11-13  Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            * config/i386/i386.md (<insn><dwi>3_doubleword_lowpart): New
            define_insn_and_split to optimize register usage of doubleword
            right shifts followed by truncation.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2023-11-13  9:06 ` cvs-commit at gcc dot gnu.org
@ 2023-11-13 17:52 ` tkoenig at gcc dot gnu.org
  2023-11-14 12:20 ` cvs-commit at gcc dot gnu.org
  2024-04-26 12:54 ` [Bug rtl-optimization/97756] [11/12/13 " roger at nextmovesoftware dot com
  16 siblings, 0 replies; 18+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-11-13 17:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #15 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to CVS Commits from comment #14)

>     Admittedly a single "mov" isn't much of a saving on modern architectures,
>     but as demonstrated by the PR, people still track the number of them.

Thanks :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2023-11-13 17:52 ` tkoenig at gcc dot gnu.org
@ 2023-11-14 12:20 ` cvs-commit at gcc dot gnu.org
  2024-04-26 12:54 ` [Bug rtl-optimization/97756] [11/12/13 " roger at nextmovesoftware dot com
  16 siblings, 0 replies; 18+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-11-14 12:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #16 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:aad65285a1c681feb9fc5b041c86d841b24c3d2a

commit r14-5442-gaad65285a1c681feb9fc5b041c86d841b24c3d2a
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Nov 14 13:19:48 2023 +0100

    i386: Fix up <insn><dwi>3_doubleword_lowpart [PR112523]

    On Sun, Nov 12, 2023 at 09:03:42PM -0000, Roger Sayle wrote:
    > This patch improves register pressure during reload, inspired by PR
97756.
    > Normally, a double-word right-shift by a constant produces a double-word
    > result, the highpart of which is dead when followed by a truncation.
    > The dead code calculating the high part gets cleaned up post-reload, so
    > the issue isn't normally visible, except for the increased register
    > pressure during reload, sometimes leading to odd register assignments.
    > Providing a post-reload splitter, which clobbers a single wordmode
    > result register instead of a doubleword result register, helps (a bit).

    Unfortunately this broke bootstrap on i686-linux, broke all ACATS tests
    on x86_64-linux as well as miscompiled e.g. __floattisf in libgcc there
    as well.

    The bug is that shrd{l,q} instruction expects the low part of the input
    to be the same register as the output, rather than the high part as the
    patch implemented.
      split_double_mode (<DWI>mode, &operands[1], 1, &operands[1],
&operands[3]);
    sets operands[1] to the lo_half and operands[3] to the hi_half, so if
    operands[0] is not the same register as operands[1] (rather than [3]) after
    RA, we should during splitting move operands[1] into operands[0].

    Your testcase:
    > #define MASK60 ((1ul << 60) - 1)
    > unsigned long foo (__uint128_t n)
    > {
    >   unsigned long a = n & MASK60;
    >   unsigned long b = (n >> 60);
    >   b = b & MASK60;
    >   unsigned long c = (n >> 120);
    >   return a+b+c;
    > }

    still has the same number of instructions.

    Bootstrapped/regtested on x86_64-linux (where it e.g. turns
                    === acats Summary ===
    -# of unexpected failures       2328
    +# of expected passes           2328
    +# of unexpected failures       0
    and fixes gcc.dg/torture/fp-int-convert-*timode.c FAILs as well)
    and i686-linux (where it previously didn't bootstrap, but compared to
    Friday evening's bootstrap the testresults are ok).

    2023-11-14  Jakub Jelinek  <jakub@redhat.com>

            PR target/112523
            PR ada/112514
            * config/i386/i386.md (<insn><dwi>3_doubleword_lowpart): Move
            operands[1] aka low part of input rather than operands[3] aka high
            part of input to output if not the same register.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug rtl-optimization/97756] [11/12/13 Regression] Inefficient handling of 128-bit arguments
  2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2023-11-14 12:20 ` cvs-commit at gcc dot gnu.org
@ 2024-04-26 12:54 ` roger at nextmovesoftware dot com
  16 siblings, 0 replies; 18+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-04-26 12:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |14.0
            Summary|[11/12/13/14/15 Regression] |[11/12/13 Regression]
                   |Inefficient handling of     |Inefficient handling of
                   |128-bit arguments           |128-bit arguments

--- Comment #17 from Roger Sayle <roger at nextmovesoftware dot com> ---
I believe this issue is now fixed on mainline (i.e. for both GCC 14 and GCC
15).
Firstly, many thanks to Jakub for correcting the error in my patch. We now
generate optimal code sequences for the code in comments #3 and #5, and use
generate fewer instructions than described in the original description.

The final remaining issue is that with -O3 GCC still uses more instructions
than clang and icc (see Thomas' comments in comments #12 and #13).  The good
news is that this is intentional, compiling with -Os (to optimize for size)
generates the same number of instructions as clang and icc [in fact, using icc
-Os generates larger code!?].  So when optimizing for performance, GCC is
taking the opportunity to use more (cheap) instructions to execute faster (or
that's the theory).

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-04-26 12:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-08 19:53 [Bug rtl-optimization/97756] New: Inefficient handling of 128-bit arguments tkoenig at gcc dot gnu.org
2020-11-09  6:29 ` [Bug rtl-optimization/97756] " tkoenig at gcc dot gnu.org
2020-12-25 11:35 ` tkoenig at gcc dot gnu.org
2021-03-10 12:55 ` ppalka at gcc dot gnu.org
2021-04-28  4:58 ` pinskia at gcc dot gnu.org
2021-04-28  7:20 ` [Bug rtl-optimization/97756] [9/10/11/12 Regression] " jakub at gcc dot gnu.org
2021-06-01  8:18 ` rguenth at gcc dot gnu.org
2021-08-30  5:18 ` crazylht at gmail dot com
2022-05-27  9:43 ` [Bug rtl-optimization/97756] [10/11/12/13 " rguenth at gcc dot gnu.org
2022-06-28 10:42 ` jakub at gcc dot gnu.org
2023-07-07 10:38 ` [Bug rtl-optimization/97756] [11/12/13/14 " rguenth at gcc dot gnu.org
2023-07-13 21:09 ` pinskia at gcc dot gnu.org
2023-07-16 11:48 ` tkoenig at gcc dot gnu.org
2023-11-07 18:38 ` tkoenig at gcc dot gnu.org
2023-11-13  9:06 ` cvs-commit at gcc dot gnu.org
2023-11-13 17:52 ` tkoenig at gcc dot gnu.org
2023-11-14 12:20 ` cvs-commit at gcc dot gnu.org
2024-04-26 12:54 ` [Bug rtl-optimization/97756] [11/12/13 " roger at nextmovesoftware dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).