From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 63BB9385843E; Fri, 13 May 2022 16:43:03 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 63BB9385843E
From: "peter at cordes dot ca" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/105596] New: Loop counter widened to 128-bit
 unnecessarily
Date: Fri, 13 May 2022 16:43:03 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: peter at cordes dot ca
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 keywords bug_severity priority component assigned_to reporter
 target_milestone
Message-ID: <bug-105596-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 13 May 2022 16:43:03 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D105596

            Bug ID: 105596
           Summary: Loop counter widened to 128-bit unnecessarily
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

For  total *=3D i  with a u128 total and a u32 loop counter, GCC pessimizes=
 by
widening i and doing a full 128x128 =3D> 128-bit multiply, and having to do=
 a
128-bit increment and compare.

uint64_t i to make it a full register width doesn't help.

unsigned __int128 fact(unsigned n){
    unsigned __int128 total =3D n;
    for (unsigned i=3D2 ; i < n ; i++)
        total *=3D i;
    return total;
}
// 0! =3D 0  isn't mathematically correct, but that's not the point

https://godbolt.org/z/W4MW9b6T3  (gcc trunk 13.0.0 20220508 (experimental) =
and
clang 14, which makes efficient asm for all of these.)

# gcc -O3
fact:
        movl    %edi, %r9d
        xorl    %r11d, %r11d
        movq    %r9, %r10           # total =3D n  zext into  R11:R10
        cmpl    $2, %edi
        jbe     .L7                 # if n<=3D2 return r11:r10
        movl    $2, %esi            # i =3D 2  in  RDI:RSI
        xorl    %edi, %edi
.L9:                              # do{
        movq    %r11, %rcx
        movq    %rdi, %rdx
        movq    %r10, %rax
        movq    %r9, %r8              # copy original n to destroy later
        imulq   %r10, %rdx          # 128x128 multiply with 2x imul, 1x
widening mul
        imulq   %rsi, %rcx
        addq    %rdx, %rcx
        mulq    %rsi
        movq    %rdx, %r11          # update total in r11:r10
        movq    %rax, %r10
        addq    %rcx, %r11          # last partial product

        addq    $1, %rsi            # i++ as a 128-bit integer
        adcq    $0, %rdi
        xorq    %rsi, %r8           #  r8 =3D n^i
        movq    %rdi, %rcx           # useless copy, we're already destroyi=
ng
r8
        orq     %r8, %rcx            # hi(i^n) | lo(i^n)
        jne     .L9               # }while(i !=3D n);
.L7:
        movq    %r10, %rax
        movq    %r11, %rdx
        ret

So as well as creating extra work to do, it's not even doing it very
efficiently, with multiple unnecessary mov instructions.

This doesn't seem to be x86-64 specific.  It also compiles similarly for
AArch64 and MIPS64.  For some ISAs, I'm not sure if potentially-infinite lo=
ops
are making a difference, e.g. PowerPC is hard for me to read.  RV64 has thr=
ee
multiply instructions in both versions.

I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i.


This anti-optimization goes back to GCC4.6.  With GCC4.5 and earlier, the a=
bove
C compiles to a tight loop with the expected mul reg + imul reg,reg and 1
register loop counter: https://godbolt.org/z/6KheaqTx4  (using __uint128_t,
since unsigned __int128 wasn't supported on GCC4.4 or 4.1)

GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly
xor-zeroed register.  It's still just incrementing and comparing a 32-bit l=
oop
counter, but widening it for a 128x128-bit multiply recipe.  GCC4.4 optimiz=
es
away the parts that are useless for the high 64 bits of (u128)i being zero.


-----

A different version compiles efficiently with GCC6 and earlier, only becomi=
ng
slow like the above with GCC7 and later.

unsigned __int128 fact_downcount(unsigned n){
    unsigned __int128 total =3D n;
    for (unsigned i=3Dn-1 ; i > 1 ; i--)
        total *=3D i;
    return total;  // 0! =3D 0 isn't mathematically correct
}


-----

When the loop condition is possibly always-true, GCC can't prove the loop is
non-infinite, and as usual can't widen the loop counter.  In this case, tha=
t's
a good thing:

unsigned __int128 fact_gcc_handhold(unsigned n){
    unsigned __int128 total =3D 1;   // loop does do final n
    for (unsigned i=3D2 ; i <=3D n ; i++)  // potentially infinite loop def=
eats
this pessimization
        total *=3D i;
    return total;  // fun fact:  0! =3D 1  is mathematically correct
}


fact_gcc_handhold:
        cmpl    $1, %edi
        jbe     .L4
        movl    $2, %ecx           # i =3D 2   in    ECX
        movl    $1, %eax           # total =3D 1  in RDX:RAX
        xorl    %edx, %edx
.L3:                             #do{
        movl    %ecx, %esi            # copy i instead of just incrementing=
 it
later :/

        movq    %rdx, %r8           # save high half of total
        addl    $1, %ecx              # i++
        imulq   %rsi, %r8           # lo x hi cross product
        mulq    %rsi                # lo x lo widening
        addq    %r8, %rdx           # 128x64-bit multiply

        cmpl    %ecx, %edi
        jnb     .L3               # }while(i < n)
        ret

Allocating total in RDX:RAX is nice, putting the lo part where we need it f=
or
mulq anyway.=