[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily
@ 2022-05-13 16:43 peter at cordes dot ca
  2022-05-13 18:01 ` [Bug tree-optimization/105596] " peter at cordes dot ca
  2022-05-16  7:24 ` rguenth at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: peter at cordes dot ca @ 2022-05-13 16:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

            Bug ID: 105596
           Summary: Loop counter widened to 128-bit unnecessarily
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

For  total *= i  with a u128 total and a u32 loop counter, GCC pessimizes by
widening i and doing a full 128x128 => 128-bit multiply, and having to do a
128-bit increment and compare.

uint64_t i to make it a full register width doesn't help.

unsigned __int128 fact(unsigned n){
    unsigned __int128 total = n;
    for (unsigned i=2 ; i < n ; i++)
        total *= i;
    return total;
}
// 0! = 0  isn't mathematically correct, but that's not the point

https://godbolt.org/z/W4MW9b6T3  (gcc trunk 13.0.0 20220508 (experimental) and
clang 14, which makes efficient asm for all of these.)

# gcc -O3
fact:
        movl    %edi, %r9d
        xorl    %r11d, %r11d
        movq    %r9, %r10           # total = n  zext into  R11:R10
        cmpl    $2, %edi
        jbe     .L7                 # if n<=2 return r11:r10
        movl    $2, %esi            # i = 2  in  RDI:RSI
        xorl    %edi, %edi
.L9:                              # do{
        movq    %r11, %rcx
        movq    %rdi, %rdx
        movq    %r10, %rax
        movq    %r9, %r8              # copy original n to destroy later
        imulq   %r10, %rdx          # 128x128 multiply with 2x imul, 1x
widening mul
        imulq   %rsi, %rcx
        addq    %rdx, %rcx
        mulq    %rsi
        movq    %rdx, %r11          # update total in r11:r10
        movq    %rax, %r10
        addq    %rcx, %r11          # last partial product

        addq    $1, %rsi            # i++ as a 128-bit integer
        adcq    $0, %rdi
        xorq    %rsi, %r8           #  r8 = n^i
        movq    %rdi, %rcx           # useless copy, we're already destroying
r8
        orq     %r8, %rcx            # hi(i^n) | lo(i^n)
        jne     .L9               # }while(i != n);
.L7:
        movq    %r10, %rax
        movq    %r11, %rdx
        ret

So as well as creating extra work to do, it's not even doing it very
efficiently, with multiple unnecessary mov instructions.

This doesn't seem to be x86-64 specific.  It also compiles similarly for
AArch64 and MIPS64.  For some ISAs, I'm not sure if potentially-infinite loops
are making a difference, e.g. PowerPC is hard for me to read.  RV64 has three
multiply instructions in both versions.

I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i.

This anti-optimization goes back to GCC4.6.  With GCC4.5 and earlier, the above
C compiles to a tight loop with the expected mul reg + imul reg,reg and 1
register loop counter: https://godbolt.org/z/6KheaqTx4  (using __uint128_t,
since unsigned __int128 wasn't supported on GCC4.4 or 4.1)

GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly
xor-zeroed register.  It's still just incrementing and comparing a 32-bit loop
counter, but widening it for a 128x128-bit multiply recipe.  GCC4.4 optimizes
away the parts that are useless for the high 64 bits of (u128)i being zero.

-----

A different version compiles efficiently with GCC6 and earlier, only becoming
slow like the above with GCC7 and later.

unsigned __int128 fact_downcount(unsigned n){
    unsigned __int128 total = n;
    for (unsigned i=n-1 ; i > 1 ; i--)
        total *= i;
    return total;  // 0! = 0 isn't mathematically correct
}

-----

When the loop condition is possibly always-true, GCC can't prove the loop is
non-infinite, and as usual can't widen the loop counter.  In this case, that's
a good thing:

unsigned __int128 fact_gcc_handhold(unsigned n){
    unsigned __int128 total = 1;   // loop does do final n
    for (unsigned i=2 ; i <= n ; i++)  // potentially infinite loop defeats
this pessimization
        total *= i;
    return total;  // fun fact:  0! = 1  is mathematically correct
}

fact_gcc_handhold:
        cmpl    $1, %edi
        jbe     .L4
        movl    $2, %ecx           # i = 2   in    ECX
        movl    $1, %eax           # total = 1  in RDX:RAX
        xorl    %edx, %edx
.L3:                             #do{
        movl    %ecx, %esi            # copy i instead of just incrementing it
later :/

        movq    %rdx, %r8           # save high half of total
        addl    $1, %ecx              # i++
        imulq   %rsi, %r8           # lo x hi cross product
        mulq    %rsi                # lo x lo widening
        addq    %r8, %rdx           # 128x64-bit multiply

        cmpl    %ecx, %edi
        jnb     .L3               # }while(i < n)
        ret

Allocating total in RDX:RAX is nice, putting the lo part where we need it for
mulq anyway.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily
  2022-05-13 16:43 [Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily peter at cordes dot ca
@ 2022-05-13 18:01 ` peter at cordes dot ca
  2022-05-16  7:24 ` rguenth at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: peter at cordes dot ca @ 2022-05-13 18:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
https://godbolt.org/z/aoG55T5Yq
gcc -O3 -m32 has the same problem with  unsigned long long total  and unsigned
i.

Pretty much identical instruction sequences in the loop for all 3 versions,
doing add/adc to increment i, for example.  (Plus a bit of spilling). 
fact_gcc_handhold still compiles without the unnecessary widening.

Perhaps should retitle to widen to a "2-register type".

IDK how easily this occurs in real-world loops with 64 and 32-bit integers on
32-bit machines, but that's probably more of a concern for wasting more clock
cycles worldwide.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily
  2022-05-13 16:43 [Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily peter at cordes dot ca
  2022-05-13 18:01 ` [Bug tree-optimization/105596] " peter at cordes dot ca
@ 2022-05-16  7:24 ` rguenth at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-16  7:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-05-16
      Known to fail|                            |12.1.0, 4.8.5, 7.5.0
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
             Target|                            |x86_64-*-*

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed - it's IVOPTs deciding the promotion is beneficial.  It costs both
the original and the final IV set the same and then (likely) breaks the tie on
less IVs being used.

One would need to carefully look at where the cost estimates are off.

With -fno-ivopts we get

.L3:
        movl    %ecx, %esi
        movq    %rdx, %r8
        addl    $1, %ecx
        imulq   %rsi, %r8
        mulq    %rsi
        addq    %r8, %rdx
        cmpl    %edi, %ecx
        jb      .L3

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-05-16  7:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-13 16:43 [Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily peter at cordes dot ca
2022-05-13 18:01 ` [Bug tree-optimization/105596] " peter at cordes dot ca
2022-05-16  7:24 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).