From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 63BB9385843E; Fri, 13 May 2022 16:43:03 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 63BB9385843E From: "peter at cordes dot ca" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily Date: Fri, 13 May 2022 16:43:03 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: peter at cordes dot ca X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status keywords bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 May 2022 16:43:03 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D105596 Bug ID: 105596 Summary: Loop counter widened to 128-bit unnecessarily Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- For total *=3D i with a u128 total and a u32 loop counter, GCC pessimizes= by widening i and doing a full 128x128 =3D> 128-bit multiply, and having to do= a 128-bit increment and compare. uint64_t i to make it a full register width doesn't help. unsigned __int128 fact(unsigned n){ unsigned __int128 total =3D n; for (unsigned i=3D2 ; i < n ; i++) total *=3D i; return total; } // 0! =3D 0 isn't mathematically correct, but that's not the point https://godbolt.org/z/W4MW9b6T3 (gcc trunk 13.0.0 20220508 (experimental) = and clang 14, which makes efficient asm for all of these.) # gcc -O3 fact: movl %edi, %r9d xorl %r11d, %r11d movq %r9, %r10 # total =3D n zext into R11:R10 cmpl $2, %edi jbe .L7 # if n<=3D2 return r11:r10 movl $2, %esi # i =3D 2 in RDI:RSI xorl %edi, %edi .L9: # do{ movq %r11, %rcx movq %rdi, %rdx movq %r10, %rax movq %r9, %r8 # copy original n to destroy later imulq %r10, %rdx # 128x128 multiply with 2x imul, 1x widening mul imulq %rsi, %rcx addq %rdx, %rcx mulq %rsi movq %rdx, %r11 # update total in r11:r10 movq %rax, %r10 addq %rcx, %r11 # last partial product addq $1, %rsi # i++ as a 128-bit integer adcq $0, %rdi xorq %rsi, %r8 # r8 =3D n^i movq %rdi, %rcx # useless copy, we're already destroyi= ng r8 orq %r8, %rcx # hi(i^n) | lo(i^n) jne .L9 # }while(i !=3D n); .L7: movq %r10, %rax movq %r11, %rdx ret So as well as creating extra work to do, it's not even doing it very efficiently, with multiple unnecessary mov instructions. This doesn't seem to be x86-64 specific. It also compiles similarly for AArch64 and MIPS64. For some ISAs, I'm not sure if potentially-infinite lo= ops are making a difference, e.g. PowerPC is hard for me to read. RV64 has thr= ee multiply instructions in both versions. I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i. This anti-optimization goes back to GCC4.6. With GCC4.5 and earlier, the a= bove C compiles to a tight loop with the expected mul reg + imul reg,reg and 1 register loop counter: https://godbolt.org/z/6KheaqTx4 (using __uint128_t, since unsigned __int128 wasn't supported on GCC4.4 or 4.1) GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly xor-zeroed register. It's still just incrementing and comparing a 32-bit l= oop counter, but widening it for a 128x128-bit multiply recipe. GCC4.4 optimiz= es away the parts that are useless for the high 64 bits of (u128)i being zero. ----- A different version compiles efficiently with GCC6 and earlier, only becomi= ng slow like the above with GCC7 and later. unsigned __int128 fact_downcount(unsigned n){ unsigned __int128 total =3D n; for (unsigned i=3Dn-1 ; i > 1 ; i--) total *=3D i; return total; // 0! =3D 0 isn't mathematically correct } ----- When the loop condition is possibly always-true, GCC can't prove the loop is non-infinite, and as usual can't widen the loop counter. In this case, tha= t's a good thing: unsigned __int128 fact_gcc_handhold(unsigned n){ unsigned __int128 total =3D 1; // loop does do final n for (unsigned i=3D2 ; i <=3D n ; i++) // potentially infinite loop def= eats this pessimization total *=3D i; return total; // fun fact: 0! =3D 1 is mathematically correct } fact_gcc_handhold: cmpl $1, %edi jbe .L4 movl $2, %ecx # i =3D 2 in ECX movl $1, %eax # total =3D 1 in RDX:RAX xorl %edx, %edx .L3: #do{ movl %ecx, %esi # copy i instead of just incrementing= it later :/ movq %rdx, %r8 # save high half of total addl $1, %ecx # i++ imulq %rsi, %r8 # lo x hi cross product mulq %rsi # lo x lo widening addq %r8, %rdx # 128x64-bit multiply cmpl %ecx, %edi jnb .L3 # }while(i < n) ret Allocating total in RDX:RAX is nice, putting the lo part where we need it f= or mulq anyway.=