[Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop
@ 2021-12-16 16:25 thiago at kde dot org
  2021-12-17  3:26 ` [Bug target/103750] " crazylht at gmail dot com
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: thiago at kde dot org @ 2021-12-16 16:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

            Bug ID: 103750
           Summary: [i386] GCC schedules KMOV instructions that destroys
                    performance in loop
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thiago at kde dot org
  Target Milestone: ---

Testcase:

const char16_t *qustrchr(char16_t *n, char16_t *e, char16_t c) noexcept
{
    __m256i mch256 = _mm256_set1_epi16(c);
    for ( ; n < e; n += 32) {
        __m256i data1 = _mm256_loadu_si256(reinterpret_cast<const __m256i
*>(n));
        __m256i data2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(n)
+ 1);
        __mmask16 mask1 = _mm256_cmpeq_epu16_mask(data1, mch256);
        __mmask16 mask2 = _mm256_cmpeq_epu16_mask(data2, mch256);
        if (_kortestz_mask16_u8(mask1, mask2))
            continue;

        unsigned idx = _tzcnt_u32(mask1);
        if (mask1 == 0) {
            idx = __tzcnt_u16(mask2);
            n += 16;
        }
        return n + idx;
    }
    return e;
}

The assembly for this produces:

        vmovdqu16       (%rdi), %ymm1
        vmovdqu16       32(%rdi), %ymm2
        vpcmpuw $0, %ymm0, %ymm1, %k0
        vpcmpuw $0, %ymm0, %ymm2, %k1
        kmovw   %k0, %edx
        kmovw   %k1, %eax
        kortestw        %k1, %k0
        je      .L10

Those two KMOVW instructions aren't required for the check that follows.
They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
can't be dispatched until those two have executed, thus introducing a 2-cycle
delay in this loop.

Clang generates:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpcmpeqw        (%rdi), %ymm0, %k0
        vpcmpeqw        32(%rdi), %ymm0, %k1
        kortestw        %k0, %k1
        jne     .LBB0_3

ICC inserts one KMOVW, but not the other.

Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M

LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC
code is 3 cycles/loop.

LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles per
loop, because it can run in parallel with the second load, given that the loads
are ports 2 and 3.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
@ 2021-12-17  3:26 ` crazylht at gmail dot com
  2021-12-17  5:05 ` crazylht at gmail dot com
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-17  3:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
kmovw here is zero_extend, and at gimple level it's not redundant in loop.

  _31 = MEM[(const __m256i_u * {ref-all})n_5];
  _30 = MEM[(const __m256i_u * {ref-all})n_5 + 32B];
  _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
  _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
  _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
  _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
  _2 = (int) _27;
  _3 = (int) _29;
  _15 = __builtin_ia32_kortestzhi (_3, _2);
  _25 = (unsigned char) _15;
  if (_25 != 0)

but at rtl level, _28/_29 propagate into kortest and be partial redundant but
failed to sink or be eliminated.

(insn 29 27 30 3 (set (reg:SI 83 [ _2 ])
        (zero_extend:SI (reg:HI 111))) "test.c":24:32 147 {*zero_extendhisi2}
     (nil))
(insn 30 29 31 3 (set (reg:SI 116 [ _29 ])
        (zero_extend:SI (reg:HI 106))) "test.c":24:32 147 {*zero_extendhisi2}
     (nil))

(insn 58 56 60 8 (parallel [
            (set (reg:HI 120)
                (unspec:HI [
                        (subreg:HI (reg:SI 83 [ _2 ]) 0)
                    ] UNSPEC_TZCNT))
            (clobber (reg:CC 17 flags))

(insn 52 51 55 7 (parallel [
            (set (reg/v:SI 88 [ idx ])
                (ctz:SI (reg:SI 116 [ _29 ])))
            (clobber (reg:CC 17 flags))

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
  2021-12-17  3:26 ` [Bug target/103750] " crazylht at gmail dot com
@ 2021-12-17  5:05 ` crazylht at gmail dot com
  2021-12-17  6:14 ` crazylht at gmail dot com
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-17  5:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
Failed here

  /* Allow propagations into a loop only for reg-to-reg copies, since
     replacing one register by another shouldn't increase the cost.  */
  struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father;
  struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father;
  if ((reg_prop_only || def_loop != use_loop)
      && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
    return false;

But def_loop is inner loop of use_loop, it should be ok to propagate from inner
loop to outer loop.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
  2021-12-17  3:26 ` [Bug target/103750] " crazylht at gmail dot com
  2021-12-17  5:05 ` crazylht at gmail dot com
@ 2021-12-17  6:14 ` crazylht at gmail dot com
  2021-12-17  6:33 ` crazylht at gmail dot com
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-17  6:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #2)
> Failed here
> 
>   /* Allow propagations into a loop only for reg-to-reg copies, since
>      replacing one register by another shouldn't increase the cost.  */
>   struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father;
>   struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father;
>   if ((reg_prop_only || def_loop != use_loop)
>       && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
>     return false;
> 
> But def_loop is inner loop of use_loop, it should be ok to propagate from
> inner loop to outer loop.

Guess def_loop != use_loop used here with assumption that those "invariant"
should be sinked into same loop.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (2 preceding siblings ...)
  2021-12-17  6:14 ` crazylht at gmail dot com
@ 2021-12-17  6:33 ` crazylht at gmail dot com
  2021-12-17 12:52 ` thiago at kde dot org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-17  6:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #3)
> (In reply to Hongtao.liu from comment #2)
> > Failed here
> > 
> >   /* Allow propagations into a loop only for reg-to-reg copies, since
> >      replacing one register by another shouldn't increase the cost.  */
> >   struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father;
> >   struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father;
> >   if ((reg_prop_only || def_loop != use_loop)
> >       && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
> >     return false;
> > 
> > But def_loop is inner loop of use_loop, it should be ok to propagate from
> > inner loop to outer loop.
> 
> Guess def_loop != use_loop used here with assumption that those "invariant"
> should be sinked into same loop.

There seems to be no pass in rtl level to do "sink" stuff.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (3 preceding siblings ...)
  2021-12-17  6:33 ` crazylht at gmail dot com
@ 2021-12-17 12:52 ` thiago at kde dot org
  2021-12-17 13:48 ` thiago at kde dot org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: thiago at kde dot org @ 2021-12-17 12:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #5 from Thiago Macieira <thiago at kde dot org> ---
Maybe this is running afoul of GCC's thinking that a simple register-register
move is free? I've seen it save a constant in an opmask register, but kmov{d,q}
is not free like mov{l,q} is.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (4 preceding siblings ...)
  2021-12-17 12:52 ` thiago at kde dot org
@ 2021-12-17 13:48 ` thiago at kde dot org
  2021-12-17 13:51 ` thiago at kde dot org
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: thiago at kde dot org @ 2021-12-17 13:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #6 from Thiago Macieira <thiago at kde dot org> ---
It got worse. Now I'm seeing:

.L807:
        vmovdqu16       (%rsi), %ymm2
        vmovdqu16       32(%rsi), %ymm3
        vpcmpuw $6, %ymm0, %ymm2, %k2
        vpcmpuw $6, %ymm0, %ymm3, %k3
        kmovw   %k2, %eax
        kmovw   %k3, %edx
        kmovd   %eax, %k4
        kmovd   %edx, %k5
        kortestd        %k5, %k4
        je      .L814

Code snippet:

        auto loadAndCompare = [maxval](const Char *ptr, unsigned mask = ~0U) {
            if constexpr (sizeof(Char) == 1) {
                __m256i mval = _mm256_set1_epi8(maxval);
                __m256i data = _mm256_maskz_loadu_epi8(mask, ptr);
                return _mm256_cmpgt_epu8_mask(data, mval);
            } else if constexpr (sizeof(Char) == 2) {
                __m256i mval = _mm256_set1_epi16(maxval);
                __m256i data = _mm256_maskz_loadu_epi16(mask, ptr);
                return _mm256_cmpgt_epu16_mask(data, mval);
            } else if constexpr (sizeof(Char) == 4) {
                __m256i mval = _mm256_set1_epi32(maxval);
                __m256i data = _mm256_maskz_loadu_epi32(mask, ptr);
                return _mm256_cmpgt_epu32_mask(data, mval);
            }
        };
/*...*/
            auto mask1 = loadAndCompare(n1);
            auto mask2 = loadAndCompare(n2);

I can make a compilable version if you need me to

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (5 preceding siblings ...)
  2021-12-17 13:48 ` thiago at kde dot org
@ 2021-12-17 13:51 ` thiago at kde dot org
  2021-12-17 14:05 ` thiago at kde dot org
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: thiago at kde dot org @ 2021-12-17 13:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #7 from Thiago Macieira <thiago at kde dot org> ---
I should add the same is not happening for Char == char, meaning the returned
type is an __mmask32 (unsigned)

        vmovdqu8        (%rsi), %ymm2
        vmovdqu8        32(%rsi), %ymm3
        vpcmpub $6, %ymm0, %ymm2, %k0
        vpcmpub $6, %ymm0, %ymm3, %k1
        kortestd        %k1, %k0
        je      .L792

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (6 preceding siblings ...)
  2021-12-17 13:51 ` thiago at kde dot org
@ 2021-12-17 14:05 ` thiago at kde dot org
  2021-12-17 15:34 ` ubizjak at gmail dot com
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: thiago at kde dot org @ 2021-12-17 14:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #8 from Thiago Macieira <thiago at kde dot org> ---
Update again: looks like the issue was the next line I didn't paste, which was
performing _kortestz_mask32_u8 on an __mmask16. The type mismatch was causing
this problem.

If I Use the correct _kortestz_maskXX_u8, I'm getting:

        vmovdqu8        (%rsi), %ymm2
        vmovdqu8        32(%rsi), %ymm3
        vpcmpub $6, %ymm0, %ymm2, %k0
        vpcmpub $6, %ymm0, %ymm3, %k1
        kortestd        %k1, %k0
        je      .L794


        vmovdqu16       (%rsi), %ymm2
        vmovdqu16       32(%rsi), %ymm3
        vpcmpuw $6, %ymm0, %ymm2, %k0
        vpcmpuw $6, %ymm0, %ymm3, %k1
        kortestw        %k1, %k0
        je      .L807

So it looks like GCC is not completely wrong, but it could be more lenient
(Clang is). You can lower the severity of this issue.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (7 preceding siblings ...)
  2021-12-17 14:05 ` thiago at kde dot org
@ 2021-12-17 15:34 ` ubizjak at gmail dot com
  2021-12-20  2:28 ` crazylht at gmail dot com
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-17 15:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #9 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Thiago Macieira from comment #0)
> Testcase:
...
> The assembly for this produces:
> 
>         vmovdqu16       (%rdi), %ymm1
>         vmovdqu16       32(%rdi), %ymm2
>         vpcmpuw $0, %ymm0, %ymm1, %k0
>         vpcmpuw $0, %ymm0, %ymm2, %k1
>         kmovw   %k0, %edx
>         kmovw   %k1, %eax
>         kortestw        %k1, %k0
>         je      .L10
> 
> Those two KMOVW instructions aren't required for the check that follows.
> They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
> can't be dispatched until those two have executed, thus introducing a
> 2-cycle delay in this loop.

These are not NOP moves but zero-extensions.

        vmovdqu16       (%rdi), %ymm1   # 93    [c=17 l=6]  movv16hi_internal/2
        vmovdqu16       32(%rdi), %ymm2 # 94    [c=21 l=7]  movv16hi_internal/2
        vpcmpuw $0, %ymm0, %ymm1, %k0   # 21    [c=4 l=7]  avx512vl_ucmpv16hi3
        vpcmpuw $0, %ymm0, %ymm2, %k1   # 27    [c=4 l=7]  avx512vl_ucmpv16hi3
        kmovw   %k0, %edx       # 30    [c=4 l=4]  *zero_extendhisi2/1
        kmovw   %k1, %eax       # 29    [c=4 l=4]  *zero_extendhisi2/1
        kortestw        %k1, %k0        # 31    [c=4 l=4]  kortesthi

since for some reason tree optimizers give us:

  _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
  _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
  _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
  _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
  _2 = (int) _27;
  _3 = (int) _29;
  _15 = __builtin_ia32_kortestzhi (_3, _2);


> Clang generates:
> 
> .LBB0_2:                                # =>This Inner Loop Header: Depth=1
>         vpcmpeqw        (%rdi), %ymm0, %k0
>         vpcmpeqw        32(%rdi), %ymm0, %k1
>         kortestw        %k0, %k1
>         jne     .LBB0_3
> 
> ICC inserts one KMOVW, but not the other.
> 
> Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M
> 
> LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
> It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC
> code is 3 cycles/loop.
> 
> LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles
> per loop, because it can run in parallel with the second load, given that
> the loads are ports 2 and 3.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (8 preceding siblings ...)
  2021-12-17 15:34 ` ubizjak at gmail dot com
@ 2021-12-20  2:28 ` crazylht at gmail dot com
  2021-12-20  3:40 ` crazylht at gmail dot com
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-20  2:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Uroš Bizjak from comment #9)
> (In reply to Thiago Macieira from comment #0)
> > Testcase:
> ...
> > The assembly for this produces:
> > 
> >         vmovdqu16       (%rdi), %ymm1
> >         vmovdqu16       32(%rdi), %ymm2
> >         vpcmpuw $0, %ymm0, %ymm1, %k0
> >         vpcmpuw $0, %ymm0, %ymm2, %k1
> >         kmovw   %k0, %edx
> >         kmovw   %k1, %eax
> >         kortestw        %k1, %k0
> >         je      .L10
> > 
> > Those two KMOVW instructions aren't required for the check that follows.
> > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
> > can't be dispatched until those two have executed, thus introducing a
> > 2-cycle delay in this loop.
> 
> These are not NOP moves but zero-extensions.
> 
>         vmovdqu16       (%rdi), %ymm1   # 93    [c=17 l=6] 
> movv16hi_internal/2
>         vmovdqu16       32(%rdi), %ymm2 # 94    [c=21 l=7] 
> movv16hi_internal/2
>         vpcmpuw $0, %ymm0, %ymm1, %k0   # 21    [c=4 l=7] 
> avx512vl_ucmpv16hi3
>         vpcmpuw $0, %ymm0, %ymm2, %k1   # 27    [c=4 l=7] 
> avx512vl_ucmpv16hi3
>         kmovw   %k0, %edx       # 30    [c=4 l=4]  *zero_extendhisi2/1
>         kmovw   %k1, %eax       # 29    [c=4 l=4]  *zero_extendhisi2/1
>         kortestw        %k1, %k0        # 31    [c=4 l=4]  kortesthi
> 
> since for some reason tree optimizers give us:
> 
>   _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
>   _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
>   _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
>   _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
>   _2 = (int) _27;
>   _3 = (int) _29;
>   _15 = __builtin_ia32_kortestzhi (_3, _2);
> 
> 

Is there any way to avoid zero_extension for
>   _2 = (int) _27;
>   _3 = (int) _29;

Since __builtin_ia32_kortestzhi is defined to accept 2 short parameters. Also
ABI doesn't ask for clearing the upper bits.

i.e. for testcase
int
__attribute__((noipa))
foo (short a)
{
    return a;
}

int
foo1 (short a)
{
    return foo (a);
}


_Z3foos:
        movswl  %di, %eax
        ret
_Z4foo1s:
        movswl  %di, %edi
        jmp     _Z3foos


movswl in foo1 seems redundant.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (9 preceding siblings ...)
  2021-12-20  2:28 ` crazylht at gmail dot com
@ 2021-12-20  3:40 ` crazylht at gmail dot com
  2021-12-20  3:46 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-20  3:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Thiago Macieira from comment #6)
> It got worse. Now I'm seeing:
> 
> .L807:
>         vmovdqu16       (%rsi), %ymm2
>         vmovdqu16       32(%rsi), %ymm3
>         vpcmpuw $6, %ymm0, %ymm2, %k2
>         vpcmpuw $6, %ymm0, %ymm3, %k3
>         kmovw   %k2, %eax
>         kmovw   %k3, %edx
>         kmovd   %eax, %k4
>         kmovd   %edx, %k5
>         kortestd        %k5, %k4
>         je      .L814
> 
(define_insn "*zero_extend<mode>si2"
  [(set (match_operand:SI 0 "register_operand" "=r,*r,*k")
        (zero_extend:SI
          (match_operand:SWI12 1 "nonimmediate_operand" "<r>m,*k,*km")))]
  "!(TARGET_ZERO_EXTEND_WITH_AND && optimize_function_for_speed_p (cfun))"

zero_extendhisi is supported with k alternative, it should be optimized to 

         vmovdqu16       (%rsi), %ymm2
         vmovdqu16       32(%rsi), %ymm3
         vpcmpuw $6, %ymm0, %ymm2, %k2
         vpcmpuw $6, %ymm0, %ymm3, %k3
         kmovw   %k2, %k4
         kmovw   %k3, %k5
         kortestd        %k5, %k4

And considering vpcmpuw will implicitly zero extend k2&k3, it can be further
optimized to

         vmovdqu16       (%rsi), %ymm2
         vmovdqu16       32(%rsi), %ymm3
         vpcmpuw $6, %ymm0, %ymm2, %k2
         vpcmpuw $6, %ymm0, %ymm3, %k3
         kortestd        %k3, %k2

> Code snippet:
> 
>         auto loadAndCompare = [maxval](const Char *ptr, unsigned mask = ~0U)
> {
>             if constexpr (sizeof(Char) == 1) {
>                 __m256i mval = _mm256_set1_epi8(maxval);
>                 __m256i data = _mm256_maskz_loadu_epi8(mask, ptr);
>                 return _mm256_cmpgt_epu8_mask(data, mval);
>             } else if constexpr (sizeof(Char) == 2) {
>                 __m256i mval = _mm256_set1_epi16(maxval);
>                 __m256i data = _mm256_maskz_loadu_epi16(mask, ptr);
>                 return _mm256_cmpgt_epu16_mask(data, mval);
>             } else if constexpr (sizeof(Char) == 4) {
>                 __m256i mval = _mm256_set1_epi32(maxval);
>                 __m256i data = _mm256_maskz_loadu_epi32(mask, ptr);
>                 return _mm256_cmpgt_epu32_mask(data, mval);
>             }
>         };
> /*...*/
>             auto mask1 = loadAndCompare(n1);
>             auto mask2 = loadAndCompare(n2);
> 
> I can make a compilable version if you need me to

Yes, please.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (10 preceding siblings ...)
  2021-12-20  3:40 ` crazylht at gmail dot com
@ 2021-12-20  3:46 ` crazylht at gmail dot com
  2021-12-20  5:27 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-20  3:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #12 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Thiago Macieira from comment #6)
> > It got worse. Now I'm seeing:
> > 
> > .L807:
> >         vmovdqu16       (%rsi), %ymm2
> >         vmovdqu16       32(%rsi), %ymm3
> >         vpcmpuw $6, %ymm0, %ymm2, %k2
> >         vpcmpuw $6, %ymm0, %ymm3, %k3
> >         kmovw   %k2, %eax
> >         kmovw   %k3, %edx
> >         kmovd   %eax, %k4
> >         kmovd   %edx, %k5
Guess there're other usage below for %eax,%edx, that's why RA didn't choose k
alternative.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (11 preceding siblings ...)
  2021-12-20  3:46 ` crazylht at gmail dot com
@ 2021-12-20  5:27 ` crazylht at gmail dot com
  2021-12-20  7:56 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-20  5:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 52031
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52031&action=edit
untested patch.

Attached patch can optimize #c0 to

        vmovdqu (%rdi), %ymm1
        vmovdqu16       32(%rdi), %ymm2
        vpcmpuw $0, %ymm0, %ymm1, %k1
        vpcmpuw $0, %ymm0, %ymm2, %k0
        kmovw   %k1, %k2
        kortestw        %k0, %k1
        je      .L10


and #c6 to

.L4:
        vmovdqu (%rdi), %ymm2
        vmovdqu 32(%rdi), %ymm1
        vpcmpuw $0, %ymm0, %ymm2, %k3
        vpcmpuw $0, %ymm0, %ymm1, %k0
        kmovw   %k3, %k1
        kmovw   %k0, %k2
        kortestd        %k2, %k1
        je      .L10


It should be much better than orginal version, but still a little suboptimal:
the frist kmovw should be sinked to the exit edge, the latter 2 kmovw should be
emilated.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (12 preceding siblings ...)
  2021-12-20  5:27 ` crazylht at gmail dot com
@ 2021-12-20  7:56 ` crazylht at gmail dot com
  2021-12-21  1:41 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-20  7:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #52031|0                           |1
        is obsolete|                            |

--- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 52032
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52032&action=edit
update patch

Update patch, Now gcc can generate optimal code

for #c0

.L4:
        vmovdqu (%rdi), %ymm1
        vmovdqu16       32(%rdi), %ymm2
        vpcmpuw $0, %ymm0, %ymm1, %k1
        vpcmpuw $0, %ymm0, %ymm2, %k0
        kortestw        %k0, %k1
        je      .L10
        kortestw        %k1, %k1
        je      .L5
        kmovd   %k1, %eax



For #c6

.L4:
        vmovdqu (%rdi), %ymm2
        vmovdqu 32(%rdi), %ymm1
        vpcmpuw $0, %ymm0, %ymm2, %k3
        vpcmpuw $0, %ymm0, %ymm1, %k0
        kortestd        %k0, %k3
        je      .L10
        kortestw        %k3, %k3
        je      .L5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (13 preceding siblings ...)
  2021-12-20  7:56 ` crazylht at gmail dot com
@ 2021-12-21  1:41 ` crazylht at gmail dot com
  2021-12-23  5:43 ` cvs-commit at gcc dot gnu.org
  2022-01-07 15:09 ` cvs-commit at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2021-12-21  1:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #15 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #14)
> Created attachment 52032 [details]
> update patch
> 
> Update patch, Now gcc can generate optimal code
> 

current fix add define_insn_and_splitter for 3 things:
1. Combine vpcmpuw and zero_extend into vpcmpuw.
2. Canonicalize vpcmpuw pattern so CSE can replace duplicate vpcmpuw to just
kmov
3. Use DImode as dest of zero_extend so cprop_hardreg can eliminate redundant
kmov.

But the sink issue still exists, i.e. for testcase in PR103774, there's
memory_operand in vpcmpuw, and combine failed due to cost increase, and the
redudant kmov remains in the loop.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (14 preceding siblings ...)
  2021-12-21  1:41 ` crazylht at gmail dot com
@ 2021-12-23  5:43 ` cvs-commit at gcc dot gnu.org
  2022-01-07 15:09 ` cvs-commit at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-12-23  5:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #16 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:1a7ce8570997eb1596c803443d20687b43fa2e47

commit r12-6103-g1a7ce8570997eb1596c803443d20687b43fa2e47
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Dec 22 16:48:54 2021 +0800

    Combine vpcmpuw + zero_extend to vpcmpuw.

    vcmp{ps,ph,pd} and vpcmp{,u}{b,w,d,q} implicitly clear the upper bits
    of dest.

    gcc/ChangeLog:

            PR target/103750
            * config/i386/sse.md
            (*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
            New pre_reload define_insn_and_split.
            (*<avx512>_cmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
            Ditto.
            (*<avx512>_ucmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
            Ditto.
            (*<avx512>_ucmp<VI48_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
            Ditto.
            (*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
            Ditto.
            (*<avx512>_cmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
            Ditto.
            (*<avx512>_ucmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
            Ditto.
            (*<avx512>_ucmp<VI48_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
            Ditto.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/avx512bw-pr103750-1.c: New test.
            * gcc.target/i386/avx512bw-pr103750-2.c: New test.
            * gcc.target/i386/avx512f-pr103750-1.c: New test.
            * gcc.target/i386/avx512f-pr103750-2.c: New test.
            * gcc.target/i386/avx512fp16-pr103750-1.c: New test.
            * gcc.target/i386/avx512fp16-pr103750-2.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
  2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
                   ` (15 preceding siblings ...)
  2021-12-23  5:43 ` cvs-commit at gcc dot gnu.org
@ 2022-01-07 15:09 ` cvs-commit at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-07 15:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #17 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:e7a7dbb5ca5dd69689f1a462ba7620180acfe8b0

commit r12-6342-ge7a7dbb5ca5dd69689f1a462ba7620180acfe8b0
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Dec 20 11:13:38 2021 +0800

    Allow propagations from inner loop to outer loop.

    NULL is considered as an outer loop of any other loop.

    gcc/ChangeLog:

            PR rtl-optimization/103750
            * fwprop.c (forward_propagate_into): Allow propagations from
            inner loop to outer loop.

    gcc/testsuite/ChangeLog:

            * g++.target/i386/pr103750-fwprop-1.C: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-01-07 15:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-16 16:25 [Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop thiago at kde dot org
2021-12-17  3:26 ` [Bug target/103750] " crazylht at gmail dot com
2021-12-17  5:05 ` crazylht at gmail dot com
2021-12-17  6:14 ` crazylht at gmail dot com
2021-12-17  6:33 ` crazylht at gmail dot com
2021-12-17 12:52 ` thiago at kde dot org
2021-12-17 13:48 ` thiago at kde dot org
2021-12-17 13:51 ` thiago at kde dot org
2021-12-17 14:05 ` thiago at kde dot org
2021-12-17 15:34 ` ubizjak at gmail dot com
2021-12-20  2:28 ` crazylht at gmail dot com
2021-12-20  3:40 ` crazylht at gmail dot com
2021-12-20  3:46 ` crazylht at gmail dot com
2021-12-20  5:27 ` crazylht at gmail dot com
2021-12-20  7:56 ` crazylht at gmail dot com
2021-12-21  1:41 ` crazylht at gmail dot com
2021-12-23  5:43 ` cvs-commit at gcc dot gnu.org
2022-01-07 15:09 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).