public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/43644]  New: __uint128_t missed optimizations.
@ 2010-04-04 23:58 svfuerst at gmail dot com
  2010-04-05 10:03 ` [Bug target/43644] " rguenth at gcc dot gnu dot org
  0 siblings, 1 reply; 7+ messages in thread
From: svfuerst at gmail dot com @ 2010-04-04 23:58 UTC (permalink / raw)
  To: gcc-bugs

__uint128_t foo1(__uint128_t x, __uint128_t y)
{
        return x + y;
}

   0x0000000000000520 <+0>:     mov    %rdx,%rax
   0x0000000000000523 <+3>:     mov    %rcx,%rdx
   0x0000000000000526 <+6>:     push   %rbx
   0x0000000000000527 <+7>:     add    %rdi,%rax
   0x000000000000052a <+10>:    adc    %rsi,%rdx
   0x000000000000052d <+13>:    pop    %rbx
   0x000000000000052e <+14>:    retq

%rbx isn't used, yet is saved and restored.

__uint128_t foo2(__uint128_t x, unsigned long long y)
{
        return x + y;
}
   0x0000000000000550 <+0>:     mov    %rdx,%rax
   0x0000000000000553 <+3>:     push   %rbx
   0x0000000000000554 <+4>:     xor    %edx,%edx
   0x0000000000000556 <+6>:     mov    %rsi,%rbx
   0x0000000000000559 <+9>:     add    %rdi,%rax
   0x000000000000055c <+12>:    adc    %rbx,%rdx
   0x000000000000055f <+15>:    pop    %rbx
   0x0000000000000560 <+16>:    retq

%rbx is used, but doesn't need to be. %rcx can be used instead, saving a
push-pop pair.

__uint128_t foo3(unsigned long long x, __uint128_t y)
{
        return x + y;
}

   0x0000000000000580 <+0>:     mov    %rdi,%rax
   0x0000000000000583 <+3>:     push   %rbx
   0x0000000000000584 <+4>:     mov    %rdx,%rbx
   0x0000000000000587 <+7>:     xor    %edx,%edx
   0x0000000000000589 <+9>:     add    %rsi,%rax
   0x000000000000058c <+12>:    adc    %rbx,%rdx
   0x000000000000058f <+15>:    pop    %rbx
   0x0000000000000590 <+16>:    retq

Similar problems as with the previous two functions, with the addition of the
fact that %rdx can now be used in-situ as an output, avoiding one of the mov
instructions.  i.e. the function could be optimized to be:

mov    %rdi,%rax
xor    %ecx,%ecx
add    %rsi,%rax
adc    %rcx,%rdx
retq


-- 
           Summary: __uint128_t missed optimizations.
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: svfuerst at gmail dot com
 GCC build triplet: x86_64-linux
  GCC host triplet: x86_64-linux
GCC target triplet: x86_64-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/43644] __uint128_t missed optimizations.
  2010-04-04 23:58 [Bug c/43644] New: __uint128_t missed optimizations svfuerst at gmail dot com
@ 2010-04-05 10:03 ` rguenth at gcc dot gnu dot org
  0 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-04-05 10:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from rguenth at gcc dot gnu dot org  2010-04-05 10:03 -------
Confirmed.  There may be (a) dup(s) for this bug.  The issue seems to be that
the ra doesn't pessimize the use of callee-saved regs.  Does it?

In example foo1 cprop-hardreg and dce get rid of the %rbx use, but that's
already after pro-/epilogue.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at redhat dot com
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
           Keywords|                            |missed-optimization, ra
   Last reconfirmed|0000-00-00 00:00:00         |2010-04-05 10:03:09
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/43644] __uint128_t missed optimizations.
       [not found] <bug-43644-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2023-12-31 21:39 ` cvs-commit at gcc dot gnu.org
@ 2024-04-26 12:59 ` roger at nextmovesoftware dot com
  4 siblings, 0 replies; 7+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-04-26 12:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
                 CC|                            |roger at nextmovesoftware dot com
             Status|NEW                         |RESOLVED
   Target Milestone|---                         |14.0

--- Comment #6 from Roger Sayle <roger at nextmovesoftware dot com> ---
This is now fixed on mainline (for GCC 14 and GCC 15).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/43644] __uint128_t missed optimizations.
       [not found] <bug-43644-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2023-08-01  8:21 ` jbeulich at suse dot com
@ 2023-12-31 21:39 ` cvs-commit at gcc dot gnu.org
  2024-04-26 12:59 ` roger at nextmovesoftware dot com
  4 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-12-31 21:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644

--- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:79e1b23b91477b29deccf2cae92a7e8dd816c54a

commit r14-6874-g79e1b23b91477b29deccf2cae92a7e8dd816c54a
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Sun Dec 31 21:37:24 2023 +0000

    i386: Tweak define_insn_and_split to fix FAIL of
gcc.target/i386/pr43644-2.c

    This patch resolves the failure of pr43644-2.c in the testsuite, a code
    quality test I added back in July, that started failing as the code GCC
    generates for 128-bit values (and their parameter passing) has been in
    flux.

    The function:

    unsigned __int128 foo(unsigned __int128 x, unsigned long long y) {
      return x+y;
    }

    currently generates:

    foo:    movq    %rdx, %rcx
            movq    %rdi, %rax
            movq    %rsi, %rdx
            addq    %rcx, %rax
            adcq    $0, %rdx
            ret

    and with this patch, we now generate:

    foo:    movq    %rdi, %rax
            addq    %rdx, %rax
            movq    %rsi, %rdx
            adcq    $0, %rdx

    which is optimal.

    2023-12-31  Uros Bizjak  <ubizjak@gmail.com>
                Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            PR target/43644
            * config/i386/i386.md (*add<dwi>3_doubleword_concat_zext): Tweak
            order of instructions after split, to minimize number of moves.

    gcc/testsuite/ChangeLog
            PR target/43644
            * gcc.target/i386/pr43644-2.c: Expect 2 movq instructions.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/43644] __uint128_t missed optimizations.
       [not found] <bug-43644-4@http.gcc.gnu.org/bugzilla/>
  2023-05-07  6:57 ` cvs-commit at gcc dot gnu.org
  2023-07-07 19:41 ` cvs-commit at gcc dot gnu.org
@ 2023-08-01  8:21 ` jbeulich at suse dot com
  2023-12-31 21:39 ` cvs-commit at gcc dot gnu.org
  2024-04-26 12:59 ` roger at nextmovesoftware dot com
  4 siblings, 0 replies; 7+ messages in thread
From: jbeulich at suse dot com @ 2023-08-01  8:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644

jbeulich at suse dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jbeulich at suse dot com

--- Comment #4 from jbeulich at suse dot com ---
I don't know what's different about my build, but I'm seeing the new
pr43644-2.c test failing, with this code generated:

foo:    movq    %rdx, %rcx
        movq    %rdi, %rax
        movq    %rsi, %rdx
        addq    %rcx, %rax
        adcq    $0, %rdx
        ret

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/43644] __uint128_t missed optimizations.
       [not found] <bug-43644-4@http.gcc.gnu.org/bugzilla/>
  2023-05-07  6:57 ` cvs-commit at gcc dot gnu.org
@ 2023-07-07 19:41 ` cvs-commit at gcc dot gnu.org
  2023-08-01  8:21 ` jbeulich at suse dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-07 19:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644

--- Comment #3 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:bdf2737cda53a83332db1a1a021653447b05a7e7

commit r14-2386-gbdf2737cda53a83332db1a1a021653447b05a7e7
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Jul 7 20:39:58 2023 +0100

    i386: Improve __int128 argument passing (in ix86_expand_move).

    Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
    result in surprising code.  Consider the example below (from PR 43644):

    unsigned __int128 foo(unsigned __int128 x, unsigned long long y) {
      return x+y;
    }

    which currently results in 6 consecutive movq instructions:

    foo:    movq    %rsi, %rax
            movq    %rdi, %rsi
            movq    %rdx, %rcx
            movq    %rax, %rdi
            movq    %rsi, %rax
            movq    %rdi, %rdx
            addq    %rcx, %rax
            adcq    $0, %rdx
            ret

    The underlying issue is that during RTL expansion, we generate the
    following initial RTL for the x argument:

    (insn 4 3 5 2 (set (reg:TI 85)
            (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
         (nil))
    (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
            (reg:DI 87)) "pr43644-2.c":5:1 -1
         (nil))
    (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
            (reg:TI 85)) "pr43644-2.c":5:1 -1
         (nil))

    which by combine/reload becomes

    (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
            (const_int 0 [0])) "pr43644-2.c":5:1 -1
         (nil))
    (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
            (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
         (expr_list:REG_DEAD (reg:DI 93)
            (nil)))
    (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
            (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
         (expr_list:REG_DEAD (reg:DI 94)
            (nil)))

    where the heavy use of SUBREG SET_DESTs creates challenges for both
    combine and register allocation.

    The improvement proposed here is to avoid these problematic SUBREGs
    by adding (two) special cases to ix86_expand_move.  For insn 4, which
    sets a TImode destination from a paradoxical SUBREG, to assign the
    lowpart, we can use an explicit zero extension (zero_extendditi2 was
    added in July 2022), and for insn 5, which sets the highpart of a
    TImode register we can use the *insvti_highpart_1 instruction (that
    was added in May 2023, after being approved for stage1 in January).
    This allows combine to work its magic, merging these insns into a
    *concatditi3 and from there into other optimized forms.

    So for the test case above, we now generate only a single movq:

    foo:    movq    %rdx, %rax
            xorl    %edx, %edx
            addq    %rdi, %rax
            adcq    %rsi, %rdx
            ret

    But there is a little bad news.  This patch causes two (minor) missed
    optimization regressions on x86_64; gcc.target/i386/pr82580.c and
    gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
    no longer generating adcq $0, but instead using xorl.  For the other
    FAIL, register allocation now has more freedom and is (arbitrarily)
    choosing a register assignment that doesn't match what the test is
    expecting.  These issues are easier to explain and fix once this patch
    is in the tree.

    The good news is that this approach fixes a number of long standing
    issues, that need to checked in bugzilla, including PR target/110533
    which was just opened/reported earlier this week.

    2023-07-07  Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            PR target/43644
            PR target/110533
            * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
            TImode destinations from paradoxical SUBREGs (setting the lowpart)
            into explicit zero extensions.  Use *insvti_highpart_1 instruction
            to set the highpart of a TImode destination.

    gcc/testsuite/ChangeLog
            PR target/43644
            PR target/110533
            * gcc.target/i386/pr110533.c: New test case.
            * gcc.target/i386/pr43644-2.c: Likewise.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/43644] __uint128_t missed optimizations.
       [not found] <bug-43644-4@http.gcc.gnu.org/bugzilla/>
@ 2023-05-07  6:57 ` cvs-commit at gcc dot gnu.org
  2023-07-07 19:41 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-05-07  6:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644

--- Comment #2 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:d8a6945c6ea22efa4d5e42fe1922d2b27953c8cd

commit r14-554-gd8a6945c6ea22efa4d5e42fe1922d2b27953c8cd
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Sun May 7 07:52:15 2023 +0100

    Don't call emit_clobber in lower-subreg.cc's resolve_simple_move.

    Following up on posts/reviews by Segher and Uros, there's some question
    over why the middle-end's lower subreg pass emits a clobber (of a
    multi-word register) into the instruction stream before emitting the
    sequence of moves of the word-sized parts.  This clobber interferes
    with (LRA) register allocation, preventing the multi-word pseudo to
    remain in the same hard registers.  This patch eliminates this
    (presumably superfluous) clobber and thereby improves register allocation.

    A concrete example of the observed improvement is PR target/43644.
    For the test case:
    __int128 foo(__int128 x, __int128 y) { return x+y; }

    on x86_64-pc-linux-gnu, gcc -O2 currently generates:

    foo:    movq    %rsi, %rax
            movq    %rdi, %r8
            movq    %rax, %rdi
            movq    %rdx, %rax
            movq    %rcx, %rdx
            addq    %r8, %rax
            adcq    %rdi, %rdx
            ret

    with this patch, we now generate the much improved:

    foo:    movq    %rdx, %rax
            movq    %rcx, %rdx
            addq    %rdi, %rax
            adcq    %rsi, %rdx
            ret

    2023-05-07  Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            PR target/43644
            * lower-subreg.cc (resolve_simple_move): Don't emit a clobber
            immediately before moving a multi-word register by parts.

    gcc/testsuite/ChangeLog
            PR target/43644
            * gcc.target/i386/pr43644.c: New test case.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-26 12:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-04 23:58 [Bug c/43644] New: __uint128_t missed optimizations svfuerst at gmail dot com
2010-04-05 10:03 ` [Bug target/43644] " rguenth at gcc dot gnu dot org
     [not found] <bug-43644-4@http.gcc.gnu.org/bugzilla/>
2023-05-07  6:57 ` cvs-commit at gcc dot gnu.org
2023-07-07 19:41 ` cvs-commit at gcc dot gnu.org
2023-08-01  8:21 ` jbeulich at suse dot com
2023-12-31 21:39 ` cvs-commit at gcc dot gnu.org
2024-04-26 12:59 ` roger at nextmovesoftware dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).