public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/86352] setc/movzx introduced into loop to provide a constant 0 value for a later rep stos
       [not found] <bug-86352-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-29  1:49 ` pinskia at gcc dot gnu.org
  2021-08-29  1:56 ` pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 2+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-29  1:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
   Last reconfirmed|                            |2021-08-29
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note GCC does too good of a job and removes the zeroing of the return value as
it is not used; it actually removes the return value fully :).

Here is a new testcase which does not cause the removal of the zeroing.
    using u64 = unsigned long long;

    struct Bucket {
        u64 mLeaves[16] = {};
    };

    struct BucketMap {
        u64 acquire() noexcept {
            while (true) {
                u64 map = mData;

                u64 index = (map & 1) ? 1 : 0;
                auto mask = u64(1) << index;

                auto previous =
                    __atomic_fetch_or(&mData, mask, __ATOMIC_SEQ_CST);
                if ((previous & mask) == 0) {
                    return index;
                }
            }
        }

        __attribute__((noinline)) Bucket acquireBucket() noexcept {
            acquire();
            return Bucket();
        }

        volatile u64 mData = 1;
    };

    int main() {
        BucketMap map;
        Bucket t = map.acquireBucket();
        return t.mLeaves[0];
    }

With the trunk we get:
BucketMap::acquireBucket():
.LFB1:
        .cfi_startproc
        movq    %rdi, %r8
        movq    %rsi, %rcx
        .p2align 4,,10
        .p2align 3
.L2:
        movq    (%rsi), %rdx
        xorl    %eax, %eax
        andl    $1, %edx
        lock btsq       %rdx, (%rcx)
        setc    %al
        jc      .L2
        movq    %r8, %rdi
        movl    $16, %ecx
        rep stosq
        movq    %r8, %rax
        ret

So the setc is useless overall really.
The reason why it is still there is because it does not become useless until
after combine and the dce for RTL runs right before combine.

Trying 14, 17 -> 18:
   14: r93:DI=flags:CCC==0
      REG_DEAD flags:CCC
   17: flags:CCZ=cmp(r93:DI,0)
   18: pc={(flags:CCZ!=0)?L16:pc}
      REG_DEAD flags:CCZ
      REG_BR_PROB 955630228
Failed to match this instruction:
(parallel [
        (set (pc)
            (if_then_else (eq (reg:CCC 17 flags)
                    (const_int 0 [0]))
                (label_ref:DI 16)
                (pc)))
        (set (reg:DI 93)
            (eq:DI (reg:CCC 17 flags)
                (const_int 0 [0])))
    ])
Failed to match this instruction:
(parallel [
        (set (pc)
            (if_then_else (eq (reg:CCC 17 flags)
                    (const_int 0 [0]))
                (label_ref:DI 16)
                (pc)))
        (set (reg:DI 93)
            (eq:DI (reg:CCC 17 flags)
                (const_int 0 [0])))
    ])
Successfully matched this instruction:
(set (reg:DI 93)
    (eq:DI (reg:CCC 17 flags)
        (const_int 0 [0])))
Successfully matched this instruction:
(set (pc)
    (if_then_else (eq (reg:CCC 17 flags)
            (const_int 0 [0]))
        (label_ref:DI 16)
        (pc)))
allowing combination of insns 14, 17 and 18
original costs 4 + 4 + 12 = 20
replacement costs 4 + 12 = 16
deferring deletion of insn with uid = 14.
modifying insn i2    17: r93:DI=flags:CCC==0
deferring rescan insn with uid = 17.
modifying insn i3    18: pc={(flags:CCC==0)?L16:pc}
      REG_BR_PROB 955630228
      REG_DEAD flags:CCZ
deferring rescan insn with uid = 18.

The reason reg 93 was not REG_DEAD after if statement is because cse and/or
forwprop (and maybe even gcse) cames around and decides to that it r93 should
be reused for 0 outside of the loop.
Maybe if frowprop could do better with set/cmp/if in the first place, this
might not have happened ...
Just some good things leading to bad code and too much interactions to count
here.

Also as I said the original testcase GCC now optimizes better anyways (better
than LLVM even).

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug rtl-optimization/86352] setc/movzx introduced into loop to provide a constant 0 value for a later rep stos
       [not found] <bug-86352-4@http.gcc.gnu.org/bugzilla/>
  2021-08-29  1:49 ` [Bug rtl-optimization/86352] setc/movzx introduced into loop to provide a constant 0 value for a later rep stos pinskia at gcc dot gnu.org
@ 2021-08-29  1:56 ` pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 2+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-29  1:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
As for the memset issue with vectors, with -march=skylake on the trunk we get:

BucketMap::acquireBucket():
        movq    %rdi, %rax
        movq    %rsi, %rcx
.L2:
        movq    (%rsi), %rdx
        andl    $1, %edx
        lock btsq       %rdx, (%rcx)
        jc      .L2
        vpxor   %xmm15, %xmm15, %xmm15
        vmovdqu %ymm15, (%rax)
        vmovdqu %ymm15, 32(%rax)
        vmovdqu %ymm15, 64(%rax)
        vmovdqu %ymm15, 96(%rax)
        ret

Which I think is close to the best code, there are two extra moves which is due
to the way atomics are represented but otherwise still decent code.

Which is much better than LLVM can do:
BucketMap::acquireBucket():         # @BucketMap::acquireBucket()
        movq    %rdi, %r8
        movl    $1, %ecx
.LBB1_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB1_2 Depth 2
        movq    (%rsi), %rax
        andb    $1, %al
        shlxq   %rax, %rcx, %rdx
        movq    (%rsi), %rax
        .p2align        4, 0x90
.LBB1_2:                                #   Parent Loop BB1_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movq    %rax, %rdi
        orq     %rdx, %rdi
        lock            cmpxchgq        %rdi, (%rsi)
        jne     .LBB1_2
# %bb.3:                                #   in Loop: Header=BB1_1 Depth=1
        testl   %eax, %edx
        jne     .LBB1_1
# %bb.4:
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %ymm0, 96(%r8)
        vmovups %ymm0, 64(%r8)
        vmovups %ymm0, 32(%r8)
        vmovups %ymm0, (%r8)
        movq    %r8, %rax
        vzeroupper
        ret

So many things badly wrong with the above LLVM code, compare xchange loop vs
btsq, extra vzeroupper which is not needed as the upper parts of the ymm
registers are already zero.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-08-29  1:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-86352-4@http.gcc.gnu.org/bugzilla/>
2021-08-29  1:49 ` [Bug rtl-optimization/86352] setc/movzx introduced into loop to provide a constant 0 value for a later rep stos pinskia at gcc dot gnu.org
2021-08-29  1:56 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).