[Bug rtl-optimization/94945] New: Missed optimization: Carry chain not recognized in manually unrolled loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/94945] New: Missed optimization: Carry chain not recognized in manually unrolled loop
@ 2020-05-04 16:44 madhur4127 at gmail dot com
  2020-05-22 11:27 ` [Bug tree-optimization/94945] " madhur4127 at gmail dot com
  2021-08-20  1:22 ` pinskia at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: madhur4127 at gmail dot com @ 2020-05-04 16:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94945

            Bug ID: 94945
           Summary: Missed optimization: Carry chain not recognized in
                    manually unrolled loop
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: madhur4127 at gmail dot com
  Target Milestone: ---

Context: Big integer addition using ADC (_addcarry_u64).

See Godbolt link: https://godbolt.org/z/rMxe6W

Example:
Suppose the case of big integer addition:

// pa, pb: pointer to big integer A, B
// n: size of big integer A, B
// pr: pointer to result

void add(const uint64_t * __restrict__ pa, const uint64_t * __restrict__ pb,
uint64_t * __restrict__ pr, unsigned n) {
    unsigned char carry = 0;
    unsigned i;
    for(i = 0; i<n; i += 4) {
        carry = _addcarry_u64(carry, pa[i+0], pb[i+0], &pr[i+0]);
        carry = _addcarry_u64(carry, pa[i+1], pb[i+1], &pr[i+1]);
        carry = _addcarry_u64(carry, pa[i+2], pb[i+2], &pr[i+2]);
        carry = _addcarry_u64(carry, pa[i+3], pb[i+3], &pr[i+3]);
    }
}


Without loop unrolling GCC saves the Carry Flag at the end of the loop and
again sets the saved carry flag in the next iteration. GCC doesn't recognize
the propagation of Carry Flag across loop iterations even when manually
unrolling the loop (while Clang does). GCC saves the carry and triggers it
again in this fashion (2 iterations shown):

        mov     ecx, eax  # i, i
        add     r9b, -1   # carry,
        mov     rdx, QWORD PTR [rdi+rcx*8]        # tmp132, *_6
        adc     rdx, QWORD PTR [rsi+rcx*8]        # tmp132, *_4
        mov     QWORD PTR [r8+rcx*8], rdx #* pr, tmp132
        setc    r9b     #, _48   <--------SAVING CARRY

        lea     ecx, [rax+1]      # tmp134,
        mov     rdx, QWORD PTR [rdi+rcx*8]        # tmp140, *_15
        add     r9b, -1   # _48, <--------SETTING CARRY
        adc     rdx, QWORD PTR [rsi+rcx*8]        # tmp140, *_13
        mov     QWORD PTR [r8+rcx*8], rdx #* pr, tmp140
        setc    r9b     #, _47


Another optimization:
Trigger loop unrolling (without the need to manually unrolling) and propagate
the carry without the need to save/set it in between.

Side Note:
Is this the fastest and optimal way to add two big integers? Considering ASM to
be the last resort?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/94945] Missed optimization: Carry chain not recognized in manually unrolled loop
  2020-05-04 16:44 [Bug rtl-optimization/94945] New: Missed optimization: Carry chain not recognized in manually unrolled loop madhur4127 at gmail dot com
@ 2020-05-22 11:27 ` madhur4127 at gmail dot com
  2021-08-20  1:22 ` pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: madhur4127 at gmail dot com @ 2020-05-22 11:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94945

Madhur Chauhan <madhur4127 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|10.0                        |10.1.1
          Component|rtl-optimization            |tree-optimization

--- Comment #1 from Madhur Chauhan <madhur4127 at gmail dot com> ---
Is the scope of this optimization too narrow?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/94945] Missed optimization: Carry chain not recognized in manually unrolled loop
  2020-05-04 16:44 [Bug rtl-optimization/94945] New: Missed optimization: Carry chain not recognized in manually unrolled loop madhur4127 at gmail dot com
  2020-05-22 11:27 ` [Bug tree-optimization/94945] " madhur4127 at gmail dot com
@ 2021-08-20  1:22 ` pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-20  1:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94945

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=94913,
                   |                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=97387
   Target Milestone|---                         |11.0

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
GCC 11+ can produce:
.L3:
        movl    %eax, %r10d
        addb    $-1, %r9b
        leal    1(%rax), %r9d
        movq    (%rdi,%r10,8), %rdx
        adcq    (%rsi,%r10,8), %rdx
        movq    %rdx, (%r8,%r10,8)
        movq    (%rdi,%r9,8), %rdx
        leal    3(%rax), %r10d
        adcq    (%rsi,%r9,8), %rdx
        movq    %rdx, (%r8,%r9,8)
        leal    2(%rax), %r9d
        movq    (%rdi,%r9,8), %rdx
        adcq    (%rsi,%r9,8), %rdx
        movq    %rdx, (%r8,%r9,8)
        movq    (%rdi,%r10,8), %rdx
        adcq    (%rsi,%r10,8), %rdx
        setc    %r9b
        addl    $4, %eax
        movq    %rdx, (%r8,%r10,8)
        cmpl    %eax, %ecx
        ja      .L3


With a single iteration and -funroll-loops GCC 11 gives this for the inner
loop:

.L12:
        movq    (%r8,%r10), %r11
        addb    $-1, %dl
        adcq    (%rdi,%r10), %r11
        movq    %r11, (%rsi,%r10)
        movq    8(%r8,%r10), %r9
        adcq    8(%rdi,%r10), %r9
        movq    %r9, 8(%rsi,%r10)
        movq    16(%r8,%r10), %rax
        adcq    16(%rdi,%r10), %rax
        movq    %rax, 16(%rsi,%r10)
        movq    24(%r8,%r10), %rdx
        adcq    24(%rdi,%r10), %rdx
        movq    %rdx, 24(%rsi,%r10)
        movq    32(%r8,%r10), %r11
        adcq    32(%rdi,%r10), %r11
        movq    %r11, 32(%rsi,%r10)
        movq    40(%r8,%r10), %r9
        adcq    40(%rdi,%r10), %r9
        movq    %r9, 40(%rsi,%r10)
        movq    48(%r8,%r10), %rax
        adcq    48(%rdi,%r10), %rax
        movq    %rax, 48(%rsi,%r10)
        movq    56(%r8,%r10), %r11
        adcq    56(%rdi,%r10), %r11
        movq    %r11, 56(%rsi,%r10)
        setc    %dl
        addq    $64, %r10
        cmpq    %rcx, %r10
        jne     .L12

So Fixed in GCC 11.

Note was most likely fixed by r11-145 (which was pushed around the time you
filed the bug) and r11-3882 (later on last year).

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-20  1:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-04 16:44 [Bug rtl-optimization/94945] New: Missed optimization: Carry chain not recognized in manually unrolled loop madhur4127 at gmail dot com
2020-05-22 11:27 ` [Bug tree-optimization/94945] " madhur4127 at gmail dot com
2021-08-20  1:22 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).