public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/103252] New: questionable codegen with kmovd
@ 2021-11-15 14:58 jason at zx2c4 dot com
  2021-11-15 15:18 ` [Bug c/103252] " jason at zx2c4 dot com
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: jason at zx2c4 dot com @ 2021-11-15 14:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

            Bug ID: 103252
           Summary: questionable codegen with kmovd
           Product: gcc
           Version: 11.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jason at zx2c4 dot com
  Target Milestone: ---

Created attachment 51798
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51798&action=edit
preprocessed source

gcc 11.2 with -march=native is generating avx512 code that seems less efficient
than the non avx512 case, for inet_aton_end in glibc 2.34.

Non-avx512 (-march=skylake):

mov     ecx, ds:(__libc_tsd_CTYPE_B_tpoff - _GLOBAL_OFFSET_TABLE_)[ebx]
mov     ecx, gs:[ecx]

Avx512:

kmovd   k0, ds:(__libc_tsd_CTYPE_B_tpoff - _GLOBAL_OFFSET_TABLE_)[ebx]
kmovd   edx, k0
kmovd   k0, dword ptr gs:[edx]
kmovd   edx, k0

Command line is:

thinkpad /var/tmp/portage/sys-libs/glibc-2.34-r2/work/glibc-2.34/resolv #
x86_64-pc-linux-gnu-gcc -m32 -march=native -pipe -O2 -Wl,-O1 -Wl,--as-needed
inet_addr.c -c -std=gnu11 -fgnu89-inline  -march=native -pipe -O2 -Wall
-Wwrite-strings -Wundef -fmerge-all-constants -frounding-math
-fstack-protector-strong -fno-common -Wstrict-prototypes -Wold-style-definition
-fmath-errno    -Wa,-mtune=i686   -ftls-model=initial-exec   -U_FORTIFY_SOURCE 
 -I../include
-I/var/tmp/portage/sys-libs/glibc-2.34-r2/work/build-x86-x86_64-pc-linux-gnu-nptl/resolv

-I/var/tmp/portage/sys-libs/glibc-2.34-r2/work/build-x86-x86_64-pc-linux-gnu-nptl
 -I../sysdeps/unix/sysv/linux/i386/i686  -I../sysdeps/i386/i686/nptl 
-I../sysdeps/unix/sysv/linux/i386  -I../sysdeps/unix/sysv/linux/x86/include
-I../sysdeps/unix/sysv/linux/x86  -I../sysdeps/x86/nptl  -I../sysdeps/i386/nptl
 -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux 
-I../sysdeps/nptl  -I../sysdeps/pthread  -I../sysdeps/gnu 
-I../sysdeps/unix/inet  -I../sysdeps/unix/sysv  -I../sysdeps/unix/i386 
-I../sysdeps/unix  -I../sysdeps/posix  -I../sysdeps/i386/i686/fpu/multiarch 
-I../sysdeps/i386/i686/fpu  -I../sysdeps/i386/i686/multiarch 
-I../sysdeps/i386/i686  -I../sysdeps/i386/fpu  -I../sysdeps/x86/fpu 
-I../sysdeps/i386  -I../sysdeps/x86/include -I../sysdeps/x86 
-I../sysdeps/wordsize-32  -I../sysdeps/ieee754/float128 
-I../sysdeps/ieee754/ldbl-96/include -I../sysdeps/ieee754/ldbl-96 
-I../sysdeps/ieee754/dbl-64  -I../sysdeps/ieee754/flt-32  -I../sysdeps/ieee754 
-I../sysdeps/generic  -I.. -I../libio -I. -nostdinc -isystem
/usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/include -isystem
/usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/include-fixed -isystem /usr/include
-D_LIBC_REENTRANT -include
/var/tmp/portage/sys-libs/glibc-2.34-r2/work/build-x86-x86_64-pc-linux-gnu-nptl/libc-modules.h
-DMODULE_NAME=libc -include ../include/libc-symbols.h  -DPIC    
-DTOP_NAMESPACE=glibc -o
/var/tmp/portage/sys-libs/glibc-2.34-r2/work/build-x86-x86_64-pc-linux-gnu-nptl/resolv/inet_addr.o
-MD -MP -MF
/var/tmp/portage/sys-libs/glibc-2.34-r2/work/build-x86-x86_64-pc-linux-gnu-nptl/resolv/inet_addr.o.dt
-MT
/var/tmp/portage/sys-libs/glibc-2.34-r2/work/build-x86-x86_64-pc-linux-gnu-nptl/resolv/inet_addr.o

And the preprocessed -E source is attached. This was noticed during
investigating https://www.sourceware.org/bugzilla/show_bug.cgi?id=28595 , which
seems likely to be a binutils or glibc bug instead. But nonetheless, the code
gen here seems surprising.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug c/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
@ 2021-11-15 15:18 ` jason at zx2c4 dot com
  2021-11-15 17:10 ` jason at zx2c4 dot com
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jason at zx2c4 dot com @ 2021-11-15 15:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #1 from Jason A. Donenfeld <jason at zx2c4 dot com> ---
My march=native should expand to:

-march=tigerlake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2
-mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mavx512f -mbmi -mbmi2 -maes
-mpclmul -mavx512vl -mavx512bw -mavx512dq -mavx512cd -mno-avx512er
-mno-avx512pf -mavx512vbmi -mavx512ifma -mno-avx5124vnniw -mno-avx5124fmaps
-mavx512vpopcntdq -mavx512vbmi2 -mgfni -mvpclmulqdq -mavx512vnni -mavx512bitalg
-mno-avx512bf16 -mavx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote
-mclflushopt -mclwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr
-mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mmovdir64b -mmovdiri -mno-mwaitx
-mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd
-mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk
-mvaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves
-mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl
-mno-widekl -mno-avxvnni --param "l1-cache-size=48" --param
"l1-cache-line-size=64" --param "l2-cache-size=24576" -mtune=tigerlake

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug c/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
  2021-11-15 15:18 ` [Bug c/103252] " jason at zx2c4 dot com
@ 2021-11-15 17:10 ` jason at zx2c4 dot com
  2021-11-15 17:52 ` [Bug target/103252] " jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jason at zx2c4 dot com @ 2021-11-15 17:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #2 from Jason A. Donenfeld <jason at zx2c4 dot com> ---
Here's a more minimal test case: https://gcc.godbolt.org/z/15hnsb6of
And even smaller: https://gcc.godbolt.org/z/KG63ErzEr

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
  2021-11-15 15:18 ` [Bug c/103252] " jason at zx2c4 dot com
  2021-11-15 17:10 ` jason at zx2c4 dot com
@ 2021-11-15 17:52 ` jakub at gcc dot gnu.org
  2021-11-15 17:59 ` pinskia at gcc dot gnu.org
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-15 17:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
On https://gcc.godbolt.org/z/KG63ErzEr I don't see anything wrong, ia32 has
just a few GPRs and all of them are heavily used in the loop, so if the %k?
registers aren't slower than memory, it seems just fine to me.
For https://gcc.godbolt.org/z/15hnsb6of trunk emits
        kmovd   %ecx, %k0
        movl    __libc_tsd_CTYPE_B@gotntpoff(%ebx), %ecx
        kmovd   %k0, %eax
        movl    %gs:(%ecx), %ecx
        testb   $32, 1(%ecx,%eax,2)
where it is unclear to me why if RA made the unfortunate decision here to use
%k0 doesn't e.g. postreload fix that up, it has:
(insn 221 94 95 17 (set (reg:SI 68 k0 [orig:130 c ] [130])
        (reg:SI 2 cx [orig:130 c ] [130])) "pr103252.c":81:68 77
{*movsi_internal}
     (nil))
... a few insns that don't touch %eax nor %k0
(insn 222 98 197 17 (set (reg:SI 0 ax [orig:130 c ] [130])
        (reg:SI 68 k0 [orig:130 c ] [130])) "pr103252.c":81:43 77
{*movsi_internal}
     (nil))
so why does it the unnecessary hop through another register when it could have
movl %ecx, %eax instead of the first kmovd and nothing instead of the second. 
Those 2 are the only %k0 uses.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (2 preceding siblings ...)
  2021-11-15 17:52 ` [Bug target/103252] " jakub at gcc dot gnu.org
@ 2021-11-15 17:59 ` pinskia at gcc dot gnu.org
  2021-11-15 18:07 ` jason at zx2c4 dot com
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-11-15 17:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Jason A. Donenfeld from comment #2)
> Here's a more minimal test case: https://gcc.godbolt.org/z/15hnsb6of

        kmovd   k0, ecx
        mov     ecx, DWORD PTR __libc_tsd_CTYPE_B@gotntpoff[ebx]
        kmovd   eax, k0
        mov     ecx, DWORD PTR gs:[ecx]
        test    BYTE PTR 1[ecx+eax*2], 32

vs:

        mov     ecx, DWORD PTR __libc_tsd_CTYPE_B@gotntpoff[ebx]
        mov     ecx, DWORD PTR gs:[ecx]
        test    BYTE PTR 1[ecx+edx*2], 32

Maybe it could have directly used eax in the first instruction but other than
that, I don't see any issue with this code generation. 

Before RA:
(insn 162 161 163 15 (set (reg:SI 130 [ c ])
        (sign_extend:SI (reg/v:QI 104 [ c ]))) "/app/example.c":81:68 156
{extendqisi2}
     (expr_list:REG_DEAD (reg/v:QI 104 [ c ])
        (nil)))
(insn 163 162 273 15 (set (reg:SI 132)
        (mem/u/c:SI (plus:SI (reg:SI 82)
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("__libc_tsd_CTYPE_B") [flags 0x60] 
<var_decl 0x7f34b0cb4c60 __libc_tsd_CTYPE_B>)
                        ] UNSPEC_GOTNTPOFF))) [7  S4 A8]))
"/app/example.c":81:67 77 {*movsi_internal}
     (expr_list:REG_EQUIV (mem/u/c:SI (plus:SI (reg:SI 82)
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("__libc_tsd_CTYPE_B") [flags 0x60] 
<var_decl 0x7f34b0cb4c60 __libc_tsd_CTYPE_B>)
                        ] UNSPEC_GOTNTPOFF))) [7  S4 A8])
        (nil)))
(insn 273 163 165 15 (set (reg/f:SI 131 [ __libc_tsd_CTYPE_B ])
        (mem/f/c:SI (reg:SI 132) [4 __libc_tsd_CTYPE_B+0 S4 A32 AS2]))
"/app/example.c":81:67 77 {*movsi_internal}
     (expr_list:REG_DEAD (reg:SI 132)
        (nil)))
(note 165 273 166 15 NOTE_INSN_DELETED)
(note 166 165 274 15 NOTE_INSN_DELETED)
(insn 274 166 168 15 (set (reg:CCZ 17 flags)
        (compare:CCZ (and:QI (mem:QI (plus:SI (plus:SI (mult:SI (reg:SI 130 [ c
])
                                (const_int 2 [0x2]))
                            (reg/f:SI 131 [ __libc_tsd_CTYPE_B ]))
                        (const_int 1 [0x1])) [5 *_12+1 S1 A8])
                (const_int 32 [0x20]))
            (const_int 0 [0]))) "/app/example.c":81:43 502 {*testqi_1_maybe_si}
     (expr_list:REG_DEAD (reg/f:SI 131 [ __libc_tsd_CTYPE_B ])
        (expr_list:REG_DEAD (reg:SI 130 [ c ])
            (nil))))

After reload:

(insn 162 161 298 16 (set (reg:SI 2 cx [orig:130 c ] [130])
        (sign_extend:SI (reg/v:QI 2 cx [orig:104 c ] [104])))
"/app/example.c":81:68 156 {extendqisi2}
     (nil))
(insn 298 162 163 16 (set (reg:SI 68 k0 [orig:130 c ] [130])
        (reg:SI 2 cx [orig:130 c ] [130])) "/app/example.c":81:68 77
{*movsi_internal}
     (nil))
(insn 163 298 273 16 (set (reg:SI 2 cx [132])
        (mem/u/c:SI (plus:SI (reg:SI 3 bx [82])
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("__libc_tsd_CTYPE_B") [flags 0x60] 
<var_decl 0x7fd752e61c60 __libc_tsd_CTYPE_B>)
                        ] UNSPEC_GOTNTPOFF))) [7  S4 A8]))
"/app/example.c":81:67 77 {*movsi_internal}
     (expr_list:REG_EQUIV (mem/u/c:SI (plus:SI (reg:SI 3 bx [82])
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("__libc_tsd_CTYPE_B") [flags 0x60] 
<var_decl 0x7fd752e61c60 __libc_tsd_CTYPE_B>)
                        ] UNSPEC_GOTNTPOFF))) [7  S4 A8])
        (nil)))
(insn 273 163 165 16 (set (reg/f:SI 2 cx [orig:131 __libc_tsd_CTYPE_B ] [131])
        (mem/f/c:SI (reg:SI 2 cx [132]) [4 __libc_tsd_CTYPE_B+0 S4 A32 AS2]))
"/app/example.c":81:67 77 {*movsi_internal}
     (nil))
(note 165 273 166 16 NOTE_INSN_DELETED)
(note 166 165 299 16 NOTE_INSN_DELETED)
(insn 299 166 274 16 (set (reg:SI 0 ax [orig:130 c ] [130])
        (reg:SI 68 k0 [orig:130 c ] [130])) "/app/example.c":81:43 77
{*movsi_internal}
     (nil))
(insn 274 299 168 16 (set (reg:CCZ 17 flags)
        (compare:CCZ (and:QI (mem:QI (plus:SI (plus:SI (mult:SI (reg:SI 0 ax
[orig:130 c ] [130])
                                (const_int 2 [0x2]))
                            (reg/f:SI 2 cx [orig:131 __libc_tsd_CTYPE_B ]
[131]))
                        (const_int 1 [0x1])) [5 *_12+1 S1 A8])
                (const_int 32 [0x20]))
            (const_int 0 [0]))) "/app/example.c":81:43 502 {*testqi_1_maybe_si}
     (nil))

This is odd but not too bad really.  It is definitely a register allocator
issue but how bad is questionable. Register allocator will definitely do
different choices if there are more register available which is what is
happening in the AVX case.


> And even smaller: https://gcc.godbolt.org/z/KG63ErzEr

This one is fine/ok as GCC is using k0 as a spill register rather than spilling
to memory. 32bit x86 has limited registers and all. There is nothing odd about
this one even.

        kmovd   ecx, k0
vs
        mov     ecx, DWORD PTR 4[esp]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (3 preceding siblings ...)
  2021-11-15 17:59 ` pinskia at gcc dot gnu.org
@ 2021-11-15 18:07 ` jason at zx2c4 dot com
  2021-11-16  1:20 ` crazylht at gmail dot com
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jason at zx2c4 dot com @ 2021-11-15 18:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #5 from Jason A. Donenfeld <jason at zx2c4 dot com> ---
> This one is fine/ok as GCC is using k0 as a spill register rather than spilling to memory. 32bit x86 has limited registers and all. There is nothing odd about this one even.

Right, okay, I see what's happening there. I suppose it's a point of debate as
to whether using k0 is actually faster than having the frontend optimize away
the stack access or whether it misses that and there's a memory latency penalty
for spilling. Presumably the risk of penalty is too high, I guess.

For the original example, though, it doesn't seem to even be saving a spill.
The non-k0 code is clearly better than the k0 code. I don't know much about how
the allocator works and interacts with various passes of the optimizer, but I
wonder if spilling to a mask register should have a higher weight than spilling
to a gpr?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (4 preceding siblings ...)
  2021-11-15 18:07 ` jason at zx2c4 dot com
@ 2021-11-16  1:20 ` crazylht at gmail dot com
  2021-11-16  9:29 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-11-16  1:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
> For the original example, though, it doesn't seem to even be saving a spill.
> The non-k0 code is clearly better than the k0 code. I don't know much about
> how the allocator works and interacts with various passes of the optimizer,
> but I wonder if spilling to a mask register should have a higher weight than
> spilling to a gpr?
In GCC internal cost table, spill to a mask register is 2.5 times expensive
than gpr, a little bit cheaper than spill to memory.

And yes sometimes spill to memory could be cheaper if there is an opportunity
for store forward optimisation, but it depends on the workroads and capacity of
store buffer.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (5 preceding siblings ...)
  2021-11-16  1:20 ` crazylht at gmail dot com
@ 2021-11-16  9:29 ` rguenth at gcc dot gnu.org
  2021-11-16 11:51 ` jason at zx2c4 dot com
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-11-16  9:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-11-16

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (6 preceding siblings ...)
  2021-11-16  9:29 ` rguenth at gcc dot gnu.org
@ 2021-11-16 11:51 ` jason at zx2c4 dot com
  2021-11-16 11:56 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jason at zx2c4 dot com @ 2021-11-16 11:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #7 from Jason A. Donenfeld <jason at zx2c4 dot com> ---
The strange thing in this case is that the non-avx512 codegen _doesn't_ spill
to memory. It just uses the gprs that are around. So it seems like that,
somehow, the mere existence of the mask registers causes the register allocator
to be lazier than usual, resulting in this situation, where the effects combine
to produce suboptimal code.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (7 preceding siblings ...)
  2021-11-16 11:51 ` jason at zx2c4 dot com
@ 2021-11-16 11:56 ` jakub at gcc dot gnu.org
  2021-11-16 12:06 ` jason at zx2c4 dot com
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-16 11:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Definitely not lazier.  When the mask registers are available for use, RA
considers them and when spilling to those is cheaper than to memory, it spills
to them and not memory.  Where cheaper is determined by the gcc cost model,
which tries to approximately model the hw, but on real hw whether it is
beneficial or not may depend on various other details that perhaps aren't even
considered.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (8 preceding siblings ...)
  2021-11-16 11:56 ` jakub at gcc dot gnu.org
@ 2021-11-16 12:06 ` jason at zx2c4 dot com
  2021-11-18  7:09 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jason at zx2c4 dot com @ 2021-11-16 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #9 from Jason A. Donenfeld <jason at zx2c4 dot com> ---
>  When the mask registers are available for use, RA considers them and when spilling to those is cheaper than to memory, it spills to them and not memory.

Yes, this is the thing I don't get. When you compare the codegen for avx512 vs
non-avx512, the non-avx512 doesn't spill at all there. So this isn't "spill to
memory" vs "spill to mask register". This is "don't spill" vs "spill to mask
register". And the latter seems clearly worse.


-------------

(As an aside, Agner reports a certain "fast forwarding" on Tigerlake+, with
zero latency write-to-read for certain addresses, with stack computations being
easy ones. Looking at 'MOV r32,[m32]+MOV [m32],r32' here: 
http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel00506E3_Skylake2_InstLatX64.txt
http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel00606A6_ICX_InstLatX64.txt
http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel00806C1_TigerLake3_InstLatX64.txt
http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel0090672_AlderLake_BC_AVX512_InstLatX64.txt
you can see the huge reduction on Tigerlake and Alderlake. This is totally
irrelevant and immaterial for the purposes of this bug report, but I idly
wonder if at some point there'll be a slightly different cost model for this
between Icelake and Tigerlake+.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (9 preceding siblings ...)
  2021-11-16 12:06 ` jason at zx2c4 dot com
@ 2021-11-18  7:09 ` crazylht at gmail dot com
  2021-11-18  8:26 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-11-18  7:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Jason A. Donenfeld from comment #2)
> > Here's a more minimal test case: https://gcc.godbolt.org/z/15hnsb6of
> 
>         kmovd   k0, ecx
>         mov     ecx, DWORD PTR __libc_tsd_CTYPE_B@gotntpoff[ebx]
>         kmovd   eax, k0
>         mov     ecx, DWORD PTR gs:[ecx]
>         test    BYTE PTR 1[ecx+eax*2], 32
> 
> vs:
> 
>         mov     ecx, DWORD PTR __libc_tsd_CTYPE_B@gotntpoff[ebx]
>         mov     ecx, DWORD PTR gs:[ecx]
>         test    BYTE PTR 1[ecx+edx*2], 32
> 

Why cprop_hardreg can't handle this?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (10 preceding siblings ...)
  2021-11-18  7:09 ` crazylht at gmail dot com
@ 2021-11-18  8:26 ` crazylht at gmail dot com
  2021-11-19  2:42 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-11-18  8:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
> Why cprop_hardreg can't handle this?

cprop_hardreg only prop hard register, not memory.

(insn 86 85 227 15 (set (reg:SI 68 k0 [132])
        (mem/u/c:SI (plus:SI (reg:SI 3 bx [82])
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("__libc_tsd_CTYPE_B") [flags 0x60] 
<var_decl 0x7f6668cf2b40 __libc_tsd_CTYPE_B>)
                        ] UNSPEC_GOTNTPOFF))) [7  S4 A8])) "est.c":81:67 75
{*movsi_internal}
     (expr_list:REG_EQUIV (mem/u/c:SI (plus:SI (reg:SI 3 bx [82])
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("__libc_tsd_CTYPE_B") [flags 0x60] 
<var_decl 0x7f6668cf2b40 __libc_tsd_CTYPE_B>)
                        ] UNSPEC_GOTNTPOFF))) [7  S4 A8])
        (nil)))
(insn 227 86 202 15 (set (reg:SI 2 cx [132])
        (reg:SI 68 k0 [132])) test.c":81:67 75 {*movsi_internal}
     (expr_list:REG_DEAD (reg:SI 68 k0 [132])
        (nil)))
(insn 202 227 88 15 (set (reg/f:SI 68 k0 [orig:131 __libc_tsd_CTYPE_B ] [131])
        (mem/f/c:SI (reg:SI 2 cx [132]) [4 __libc_tsd_CTYPE_B+0 S4 A32 AS2]))
"test.c":81:67 75 {*movsi_internal}
     (expr_list:REG_DEAD (reg:SI 2 cx [132])
        (nil)))
(note 88 202 89 15 NOTE_INSN_DELETED)
(note 89 88 228 15 NOTE_INSN_DELETED)
(insn 228 89 203 15 (set (reg/f:SI 2 cx [orig:131 __libc_tsd_CTYPE_B ] [131])
        (reg/f:SI 68 k0 [orig:131 __libc_tsd_CTYPE_B ] [131])) "test.c":81:43
75 {*movsi_internal}
     (expr_list:REG_DEAD (reg/f:SI 68 k0 [orig:131 __libc_tsd_CTYPE_B ] [131])
        (nil)))

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (11 preceding siblings ...)
  2021-11-18  8:26 ` crazylht at gmail dot com
@ 2021-11-19  2:42 ` crazylht at gmail dot com
  2021-11-19  6:27 ` crazylht at gmail dot com
  2021-11-24  5:47 ` crazylht at gmail dot com
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-11-19  2:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #12 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Jason A. Donenfeld from comment #9)
> >  When the mask registers are available for use, RA considers them and when spilling to those is cheaper than to memory, it spills to them and not memory.
> 
> Yes, this is the thing I don't get. When you compare the codegen for avx512
> vs non-avx512, the non-avx512 doesn't spill at all there. So this isn't
> "spill to memory" vs "spill to mask register". This is "don't spill" vs
> "spill to mask register". And the latter seems clearly worse.

for non-avx512, Due to the small number of registers available, and the short
live range of r132, r132 is first
Pushing a18(r132,l0) (cost 70) ---- (allocate as mem first)
 and then finally found there're available register
Popping a18(r132,l0) -- assign reg 2. --------- (allocate as register when
there're available register)

for avx512, due to enough number of registers available, r132 is finally
assigned as alternative class. for while picture avx512 has less mem allocated.

avx2:
Disposition:
   26:r82  l1     3    1:r82  l0     3   36:r89  l1     2    2:r89  l0     2
   13:r97  l0     5   37:r101 l1     1    3:r101 l0     4   27:r103 l1   mem
    4:r103 l0   mem   38:r105 l1     0    5:r105 l0     0   28:r108 l1     6
    6:r108 l0     6    0:r112 l0     0   29:r113 l1     5    7:r113 l0     5
   30:r114 l1   mem    8:r114 l0   mem   31:r115 l1   mem    9:r115 l0   mem
   22:r118 l0     0   21:r119 l0     0   40:r128 l1     0   39:r129 l1     0
   17:r130 l0     1   16:r131 l0     2   18:r132 l0     2   15:r136 l0     1
   12:r139 l0     0   32:r142 l1   mem   10:r142 l0   mem   33:r143 l1     4
   20:r143 l0   mem   34:r144 l1   mem   11:r144 l0   mem   35:r145 l1   mem
   19:r145 l0   mem   25:r146 l0     0   24:r147 l0     1   23:r148 l0     2
   41:r149 l1     0   14:r150 l0     0

avx512:
Disposition:
   26:r82  l1     3    1:r82  l0     3   36:r89  l1     1    2:r89  l0     2
   13:r97  l0     4   37:r101 l1     2    3:r101 l0     1   27:r103 l1   mem
    4:r103 l0   mem   38:r105 l1     0    5:r105 l0     0   28:r108 l1     6
    6:r108 l0     6    0:r112 l0     0   29:r113 l1     4    7:r113 l0     4
   30:r114 l1   mem    8:r114 l0   mem   31:r115 l1   mem    9:r115 l0   mem
   22:r118 l0     0   21:r119 l0     0   40:r128 l1     0   39:r129 l1     0
   17:r130 l0     2   16:r131 l0    68   18:r132 l0    68   15:r136 l0     2
   12:r139 l0     0   32:r142 l1   mem   10:r142 l0   mem   33:r143 l1   mem
   20:r143 l0   mem   34:r144 l1     5   11:r144 l0     5   35:r145 l1   mem
   19:r145 l0   mem   25:r146 l0     0   24:r147 l0     1   23:r148 l0     2
   41:r149 l1     0   14:r150 l0     0



So for short live range reg, we may lose opportunity to allocate best regclass,
maybe add peephole2 to handle those cases instead of tune RA.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (12 preceding siblings ...)
  2021-11-19  2:42 ` crazylht at gmail dot com
@ 2021-11-19  6:27 ` crazylht at gmail dot com
  2021-11-24  5:47 ` crazylht at gmail dot com
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-11-19  6:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---

> 
> So for short live range reg, we may lose opportunity to allocate best
> regclass, maybe add peephole2 to handle those cases instead of tune RA.
No, r132 is also used as addr, but currently lra only add cost of movement from
 mask to gpr, but we could possibly run out of gpr which means there will be an
extra spill, and this is not counted by record_address_regs.

modified   gcc/ira-costs.c
@@ -1226,7 +1226,7 @@ record_address_regs (machine_mode mode, addr_space_t as,
rtx x,
        struct costs *pp;
        int *pp_costs;
        enum reg_class i;
-       int k, regno, add_cost;
+       int k, regno, add_cost, potential_spill_cost;
        cost_classes_t cost_classes_ptr;
        enum reg_class *cost_classes;
        move_table *move_in_cost;
@@ -1239,6 +1239,7 @@ record_address_regs (machine_mode mode, addr_space_t as,
rtx x,
          ALLOCNO_BAD_SPILL_P (ira_curr_regno_allocno_map[regno]) = true;
        pp = COSTS (costs, COST_INDEX (regno));
        add_cost = (ira_memory_move_cost[Pmode][rclass][1] * scale) / 2;
+       potential_spill_cost = add_cost / 5;
        if (INT_MAX - add_cost < pp->mem_cost)
          pp->mem_cost = INT_MAX;
        else
@@ -1252,6 +1253,10 @@ record_address_regs (machine_mode mode, addr_space_t as,
rtx x,
          {
            i = cost_classes[k];
            add_cost = (move_in_cost[i][rclass] * scale) / 2;
+           /* If we run out of rclass regs, there could be an extra spill,
+              Let's say 20% possibility.  */
+           if (!ira_class_subset_p[i][rclass])
+             add_cost += potential_spill_cost;
            if (INT_MAX - add_cost < pp_costs[k])
              pp_costs[k] = INT_MAX;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/103252] questionable codegen with kmovd
  2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
                   ` (13 preceding siblings ...)
  2021-11-19  6:27 ` crazylht at gmail dot com
@ 2021-11-24  5:47 ` crazylht at gmail dot com
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-11-24  5:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252

--- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #13)
> > 
> > So for short live range reg, we may lose opportunity to allocate best
> > regclass, maybe add peephole2 to handle those cases instead of tune RA.
> No, r132 is also used as addr, but currently lra only add cost of movement
> from  mask to gpr, but we could possibly run out of gpr which means there
> will be an extra spill, and this is not counted by record_address_regs.
> 
> modified   gcc/ira-costs.c
> @@ -1226,7 +1226,7 @@ record_address_regs (machine_mode mode, addr_space_t
> as, rtx x,
>  	struct costs *pp;
>  	int *pp_costs;
>  	enum reg_class i;
> -	int k, regno, add_cost;
> +	int k, regno, add_cost, potential_spill_cost;
>  	cost_classes_t cost_classes_ptr;
>  	enum reg_class *cost_classes;
>  	move_table *move_in_cost;
> @@ -1239,6 +1239,7 @@ record_address_regs (machine_mode mode, addr_space_t
> as, rtx x,
>  	  ALLOCNO_BAD_SPILL_P (ira_curr_regno_allocno_map[regno]) = true;
>  	pp = COSTS (costs, COST_INDEX (regno));
>  	add_cost = (ira_memory_move_cost[Pmode][rclass][1] * scale) / 2;
> +	potential_spill_cost = add_cost / 5;
>  	if (INT_MAX - add_cost < pp->mem_cost)
>  	  pp->mem_cost = INT_MAX;
>  	else
> @@ -1252,6 +1253,10 @@ record_address_regs (machine_mode mode, addr_space_t
> as, rtx x,
>  	  {
>  	    i = cost_classes[k];
>  	    add_cost = (move_in_cost[i][rclass] * scale) / 2;
> +	    /* If we run out of rclass regs, there could be an extra spill,
> +	       Let's say 20% possibility.  */
> +	    if (!ira_class_subset_p[i][rclass])
> +	      add_cost += potential_spill_cost;
>  	    if (INT_MAX - add_cost < pp_costs[k])
>  	      pp_costs[k] = INT_MAX;

Increase cost will lose some spill to mask opportunity like testcase
https://gcc.godbolt.org/z/KG63ErzEr

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-11-24  5:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-15 14:58 [Bug c/103252] New: questionable codegen with kmovd jason at zx2c4 dot com
2021-11-15 15:18 ` [Bug c/103252] " jason at zx2c4 dot com
2021-11-15 17:10 ` jason at zx2c4 dot com
2021-11-15 17:52 ` [Bug target/103252] " jakub at gcc dot gnu.org
2021-11-15 17:59 ` pinskia at gcc dot gnu.org
2021-11-15 18:07 ` jason at zx2c4 dot com
2021-11-16  1:20 ` crazylht at gmail dot com
2021-11-16  9:29 ` rguenth at gcc dot gnu.org
2021-11-16 11:51 ` jason at zx2c4 dot com
2021-11-16 11:56 ` jakub at gcc dot gnu.org
2021-11-16 12:06 ` jason at zx2c4 dot com
2021-11-18  7:09 ` crazylht at gmail dot com
2021-11-18  8:26 ` crazylht at gmail dot com
2021-11-19  2:42 ` crazylht at gmail dot com
2021-11-19  6:27 ` crazylht at gmail dot com
2021-11-24  5:47 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).