public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f
@ 2022-02-21 11:00 pinskia at gcc dot gnu.org
2022-02-21 11:01 ` [Bug target/104610] " pinskia at gcc dot gnu.org
` (23 more replies)
0 siblings, 24 replies; 25+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-02-21 11:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
Bug ID: 104610
Summary: memcmp () == 0 can be optimized better for avx512f
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
Target Milestone: ---
Target: x86_64-linux-gnu
Take:
bool f(char *a)
{
char t[] = "0123456789012345678901234567890";
return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}
----- CUT ----
GCC does this via branches and compares but clang/LLVM does:
vmovdqu (%rdi), %ymm0
vpxor .LCPI0_0(%rip), %ymm0, %ymm0
vptest %ymm0, %ymm0
sete %al
vzeroupper
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
@ 2022-02-21 11:01 ` pinskia at gcc dot gnu.org
2022-02-22 4:03 ` crazylht at gmail dot com
` (22 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-02-21 11:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note even without avx512f, LLVM does:
movdqu (%rdi), %xmm0
movdqu 16(%rdi), %xmm1
pcmpeqb .LCPI0_0(%rip), %xmm1
pcmpeqb .LCPI0_1(%rip), %xmm0
pand %xmm1, %xmm0
pmovmskb %xmm0, %eax
cmpl $65535, %eax # imm = 0xFFFF
sete %al
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
2022-02-21 11:01 ` [Bug target/104610] " pinskia at gcc dot gnu.org
@ 2022-02-22 4:03 ` crazylht at gmail dot com
2022-02-22 5:54 ` crazylht at gmail dot com
` (21 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-22 4:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
in Gimple, there're
_1 = __builtin_memcmp_eq (a_5(D), &t[0], 32);
_2 = _1 == 0;
_6 = (int) _2;
So it's related to codegen optimization with vectorized codes for
__builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
also i saw when size is 9, llvm generates
f(char*): # @f(char*)
movabs rcx, 3979270244072042800
xor rcx, qword ptr [rdi]
movzx edx, byte ptr [rdi + 8]
xor eax, eax
or rdx, rcx
setne al
ret
while gcc
f(char*):
movabsq $3979270244072042800, %rax
cmpq %rax, (%rdi)
je .L5
.L2:
movl $1, %eax
ret
.L5:
cmpb $0, 8(%rdi)
jne .L2
xorl %eax, %eax
ret
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
2022-02-21 11:01 ` [Bug target/104610] " pinskia at gcc dot gnu.org
2022-02-22 4:03 ` crazylht at gmail dot com
@ 2022-02-22 5:54 ` crazylht at gmail dot com
2022-02-22 5:55 ` crazylht at gmail dot com
` (20 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-22 5:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #2)
> in Gimple, there're
>
> _1 = __builtin_memcmp_eq (a_5(D), &t[0], 32);
> _2 = _1 == 0;
> _6 = (int) _2;
>
>
> So it's related to codegen optimization with vectorized codes for
> __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
>
There's no optab or target_hook for backend to participate in optimization of
Participation in optimization.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (2 preceding siblings ...)
2022-02-22 5:54 ` crazylht at gmail dot com
@ 2022-02-22 5:55 ` crazylht at gmail dot com
2022-02-22 6:18 ` crazylht at gmail dot com
` (19 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-22 5:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #3)
> (In reply to Hongtao.liu from comment #2)
> > in Gimple, there're
> >
> > _1 = __builtin_memcmp_eq (a_5(D), &t[0], 32);
> > _2 = _1 == 0;
> > _6 = (int) _2;
> >
> >
> > So it's related to codegen optimization with vectorized codes for
> > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> >
> There's no optab or target_hook for backend to participate in optimization
> of Participation in optimization.
typo last optimization should be compare_by_pieces.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (3 preceding siblings ...)
2022-02-22 5:55 ` crazylht at gmail dot com
@ 2022-02-22 6:18 ` crazylht at gmail dot com
2022-02-22 6:32 ` crazylht at gmail dot com
` (18 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-22 6:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #4)
> (In reply to Hongtao.liu from comment #3)
> > (In reply to Hongtao.liu from comment #2)
> > > in Gimple, there're
> > >
> > > _1 = __builtin_memcmp_eq (a_5(D), &t[0], 32);
> > > _2 = _1 == 0;
> > > _6 = (int) _2;
> > >
> > >
> > > So it's related to codegen optimization with vectorized codes for
> > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> > >
> > There's no optab or target_hook for backend to participate in optimization
But there's cbranch_optab check in can_compare_p, and i386 supports
V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them?
25899(define_expand "cbranch<mode>4"
25900 [(set (reg:CC FLAGS_REG)
25901 (compare:CC (match_operand:VI48_AVX 1 "register_operand")
25902 (match_operand:VI48_AVX 2 "nonimmediate_operand")))
25903 (set (pc) (if_then_else
25904 (match_operator 0 "bt_comparison_operator"
25905 [(reg:CC FLAGS_REG) (const_int 0)])
25906 (label_ref (match_operand 3))
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (4 preceding siblings ...)
2022-02-22 6:18 ` crazylht at gmail dot com
@ 2022-02-22 6:32 ` crazylht at gmail dot com
2022-02-23 2:15 ` crazylht at gmail dot com
` (17 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-22 6:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Hongtao.liu from comment #4)
> > (In reply to Hongtao.liu from comment #3)
> > > (In reply to Hongtao.liu from comment #2)
> > > > in Gimple, there're
> > > >
> > > > _1 = __builtin_memcmp_eq (a_5(D), &t[0], 32);
> > > > _2 = _1 == 0;
> > > > _6 = (int) _2;
> > > >
> > > >
> > > > So it's related to codegen optimization with vectorized codes for
> > > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> > > >
> > > There's no optab or target_hook for backend to participate in optimization
> But there's cbranch_optab check in can_compare_p, and i386 supports
> V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them?
>
> 25899(define_expand "cbranch<mode>4"
> 25900 [(set (reg:CC FLAGS_REG)
> 25901 (compare:CC (match_operand:VI48_AVX 1 "register_operand")
> 25902 (match_operand:VI48_AVX 2 "nonimmediate_operand")))
> 25903 (set (pc) (if_then_else
> 25904 (match_operator 0 "bt_comparison_operator"
> 25905 [(reg:CC FLAGS_REG) (const_int 0)])
> 25906 (label_ref (match_operand 3))
After supporting cbranchoi4, gcc generates
_Z1fPc:
.LFB0:
.cfi_startproc
vmovdqa .LC1(%rip), %ymm0
vpxor (%rdi), %ymm0, %ymm0
vptest %ymm0, %ymm0
sete %al
vzeroupper
which is optimal as clang/llvm does.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (5 preceding siblings ...)
2022-02-22 6:32 ` crazylht at gmail dot com
@ 2022-02-23 2:15 ` crazylht at gmail dot com
2022-02-23 6:18 ` crazylht at gmail dot com
` (16 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-23 2:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #6)
> (In reply to Hongtao.liu from comment #5)
> > (In reply to Hongtao.liu from comment #4)
> > > (In reply to Hongtao.liu from comment #3)
> > > > (In reply to Hongtao.liu from comment #2)
> > > > > in Gimple, there're
> > > > >
> > > > > _1 = __builtin_memcmp_eq (a_5(D), &t[0], 32);
> > > > > _2 = _1 == 0;
> > > > > _6 = (int) _2;
> > > > >
> > > > >
> > > > > So it's related to codegen optimization with vectorized codes for
> > > > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> > > > >
> > > > There's no optab or target_hook for backend to participate in optimization
> > But there's cbranch_optab check in can_compare_p, and i386 supports
> > V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them?
> >
> > 25899(define_expand "cbranch<mode>4"
> > 25900 [(set (reg:CC FLAGS_REG)
> > 25901 (compare:CC (match_operand:VI48_AVX 1 "register_operand")
> > 25902 (match_operand:VI48_AVX 2 "nonimmediate_operand")))
> > 25903 (set (pc) (if_then_else
> > 25904 (match_operator 0 "bt_comparison_operator"
> > 25905 [(reg:CC FLAGS_REG) (const_int 0)])
> > 25906 (label_ref (match_operand 3))
>
> After supporting cbranchoi4, gcc generates
>
> _Z1fPc:
> .LFB0:
> .cfi_startproc
> vmovdqa .LC1(%rip), %ymm0
> vpxor (%rdi), %ymm0, %ymm0
> vptest %ymm0, %ymm0
> sete %al
> vzeroupper
>
> which is optimal as clang/llvm does.
Also extend cbranchti to ptest when target_sse4_1 and CODE == NE || CODE == EQ
so gcc generates
movdqu (%rdi), %xmm0
movdqa .LC1(%rip), %xmm1
pxor %xmm1, %xmm0
ptest %xmm0, %xmm0
sete %al
for
bool f128(char *a)
{
char t[] = "012345678901234";
return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}
the original codegen is
movabsq $14692989455579448, %rax
xorq 8(%rdi), %rax
movabsq $3978425819141910832, %rdx
xorq (%rdi), %rdx
orq %rdx, %rax
sete %al
ret
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (6 preceding siblings ...)
2022-02-23 2:15 ` crazylht at gmail dot com
@ 2022-02-23 6:18 ` crazylht at gmail dot com
2022-02-23 21:04 ` hjl.tools at gmail dot com
` (15 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-23 6:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 52495
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52495&action=edit
untested patch.
With the patch, it exposes one potential issue related to dse(or
ix86_gen_scratch_sse_rtx usage). in dse1, it try to replace load insn with
equivalent value, but the inserted new insns(insn 45, insn 44, insn 46) will
set xmm31, but dse is not aware of that, and xmm31 is alive and will be used by
insn 10 which is exactly after new added insns, and it breaks data flow.
and i think for i386 part, maybe we shouldn't use ix86_gen_scratch_sse_rtx in
ix86_expand_vector_move which is called by emit_move_insn and used in many
pre_reload passes, it may break data flow if there're other explicit hard
register used.
dump before vs after dse
+(insn 45 8 44 2 (set (reg:DI 91)
+ (const_int 4855531112742205610 [0x43624fd242db38aa]))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 80 {*movdi_internal}
+ (nil))
+(insn 44 45 46 2 (set (reg:V4DI 67 xmm31)
+ (vec_duplicate:V4DI (reg:DI 91)))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 7768
{*avx512vl_vec_dup_gprv4di}
+ (expr_list:REG_DEAD (reg:DI 91)
+ (nil)))
+(insn 46 44 10 2 (set (reg:OI 90)
+ (reg:OI 67 xmm31))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 78
{*movoi_internal_avx}
+ (expr_list:REG_EQUAL (const_wide_int
0x43624fd242db38aa43624fd242db38aa43624fd242db38aa43624fd242db38aa)
(nil)))
(insn 10 46 13 2 (set (mem/j/c:V16SF (plus:DI (reg/f:DI 19 frame)
(const_int -128 [0xffffffffffffff80])) [4 bd.x+0 S64 A512])
(reg:V16SF 67 xmm31))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 1707
{movv16sf_internal}
(expr_list:REG_DEAD (reg:V16SF 67 xmm31)
(nil)))
(insn 13 10 14 2 (set (reg:OI 86 [ MEM <char[1:64]> [(void *)&bd] ])
- (mem/c:OI (plus:DI (reg/f:DI 19 frame)
- (const_int -128 [0xffffffffffffff80])) [0 MEM <char[1:64]>
[(void *)&bd]+0 S32 A512]))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":49:7 78
{*movoi_internal_avx}
- (nil))
+ (reg:OI 90)) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":49:7
78 {*movoi_internal_avx}
+ (expr_list:REG_DEAD (reg:OI 90)
+ (nil)))
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (7 preceding siblings ...)
2022-02-23 6:18 ` crazylht at gmail dot com
@ 2022-02-23 21:04 ` hjl.tools at gmail dot com
2022-02-24 5:33 ` crazylht at gmail dot com
` (14 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2022-02-23 21:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2022-02-23
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #9 from H.J. Lu <hjl.tools at gmail dot com> ---
ix86_gen_scratch_sse_rtx was added to prevent combine from changing
store of vector registers with constant value to store of constant
value. You can change ix86_gen_scratch_sse_rtx to return a pseudo
register and watch the regressions in GCC testsuite. If we can fix
these regressions, ix86_gen_scratch_sse_rtx isn't needed.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (8 preceding siblings ...)
2022-02-23 21:04 ` hjl.tools at gmail dot com
@ 2022-02-24 5:33 ` crazylht at gmail dot com
2022-02-24 6:00 ` hjl.tools at gmail dot com
` (13 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-02-24 5:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to H.J. Lu from comment #9)
> ix86_gen_scratch_sse_rtx was added to prevent combine from changing
> store of vector registers with constant value to store of constant
> value. You can change ix86_gen_scratch_sse_rtx to return a pseudo
> register and watch the regressions in GCC testsuite. If we can fix
> these regressions, ix86_gen_scratch_sse_rtx isn't needed.
it regresses, i'm thinking of add a peephole2 to split mov mem to mov + movd +
shufd which can prevent regression for pr100865, and for vzeroupper, i don't
have a good way to avoid those regressions.
gcc.target/i386/pr100865-11b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+,
16
gcc.target/i386/pr100865-12b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+,
16
gcc.target/i386/pr100865-8a.c scan-assembler-times (?:vpbroadcastd|vpshufd)[\\t
]+[^\n]*, %xmm[0-9]+ 1
gcc.target/i386/pr100865-8b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+,
16
gcc.target/i386/pr100865-8c.c scan-assembler-times vpshufd[\\t ]+[^\n]*,
%xmm[0-9]+ 1
gcc.target/i386/pr100865-9b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+,
16
gcc.target/i386/pr100865-9c.c scan-assembler-times vpshufd[\\t ]+[^\n]*,
%xmm[0-9]+ 1
gcc.target/i386/pr82941-1.c scan-assembler-not vzeroupper
gcc.target/i386/pr82942-1.c scan-assembler-not vzeroupper
gcc.target/i386/pr82990-1.c scan-assembler-not vzeroupper
gcc.target/i386/pr82990-3.c scan-assembler-not vzeroupper
gcc.target/i386/pr82990-5.c scan-assembler-not vzeroupper
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (9 preceding siblings ...)
2022-02-24 5:33 ` crazylht at gmail dot com
@ 2022-02-24 6:00 ` hjl.tools at gmail dot com
2022-02-26 20:26 ` hjl.tools at gmail dot com
` (12 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2022-02-24 6:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #11 from H.J. Lu <hjl.tools at gmail dot com> ---
Don't worry about vzeroupper.
It's ok to have vzeroupper.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (10 preceding siblings ...)
2022-02-24 6:00 ` hjl.tools at gmail dot com
@ 2022-02-26 20:26 ` hjl.tools at gmail dot com
2022-02-27 19:02 ` hjl.tools at gmail dot com
` (11 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2022-02-26 20:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |104704
--- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Hongtao.liu from comment #8)
> Created attachment 52495 [details]
> untested patch.
>
> With the patch, it exposes one potential issue related to dse(or
> ix86_gen_scratch_sse_rtx usage). in dse1, it try to replace load insn with
> equivalent value, but the inserted new insns(insn 45, insn 44, insn 46) will
> set xmm31, but dse is not aware of that, and xmm31 is alive and will be used
> by insn 10 which is exactly after new added insns, and it breaks data flow.
>
I opened PR 104704.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104704
[Bug 104704] [12 Regression] ix86_gen_scratch_sse_rtx doesn't work with
explicit XMM7/XMM15/XMM31 usage
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (11 preceding siblings ...)
2022-02-26 20:26 ` hjl.tools at gmail dot com
@ 2022-02-27 19:02 ` hjl.tools at gmail dot com
2022-02-27 19:29 ` hjl.tools at gmail dot com
` (10 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2022-02-27 19:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #13 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Hongtao.liu from comment #8)
> Created attachment 52495 [details]
> untested patch.
I see these regressions with -m32:
FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg"
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O0
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O1
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O2
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O3 -g
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -Os
FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O0
FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O1
FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -Og -g
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (12 preceding siblings ...)
2022-02-27 19:02 ` hjl.tools at gmail dot com
@ 2022-02-27 19:29 ` hjl.tools at gmail dot com
2022-03-04 3:03 ` hjl.tools at gmail dot com
` (9 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2022-02-27 19:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #14 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to H.J. Lu from comment #13)
> (In reply to Hongtao.liu from comment #8)
> > Created attachment 52495 [details]
> > untested patch.
>
> I see these regressions with -m32:
>
> FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg"
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O0
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O1
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O2
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O3 -g
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -Os
> FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O0
> FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O1
> FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -Og -g
-m64 regression:
FAIL: gcc.target/i386/pr82580.c scan-assembler-not \\mmovzb
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (13 preceding siblings ...)
2022-02-27 19:29 ` hjl.tools at gmail dot com
@ 2022-03-04 3:03 ` hjl.tools at gmail dot com
2022-03-28 5:05 ` crazylht at gmail dot com
` (8 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2022-03-04 3:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
Bug 104610 depends on bug 104704, which changed state.
Bug 104704 Summary: [12 Regression] ix86_gen_scratch_sse_rtx doesn't work with explicit XMM7/XMM15/XMM31 usage
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104704
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |FIXED
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (14 preceding siblings ...)
2022-03-04 3:03 ` hjl.tools at gmail dot com
@ 2022-03-28 5:05 ` crazylht at gmail dot com
2022-03-28 5:27 ` crazylht at gmail dot com
` (7 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-03-28 5:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #15 from Hongtao.liu <crazylht at gmail dot com> ---
Could someone help to mark this blocks PR105073, the patch is ready and waiting
for GCC13.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (15 preceding siblings ...)
2022-03-28 5:05 ` crazylht at gmail dot com
@ 2022-03-28 5:27 ` crazylht at gmail dot com
2022-05-18 2:47 ` cvs-commit at gcc dot gnu.org
` (6 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-03-28 5:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
Hongtao.liu <crazylht at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #52495|0 |1
is obsolete| |
--- Comment #16 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 52692
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52692&action=edit
Patch pending for GCC13
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (16 preceding siblings ...)
2022-03-28 5:27 ` crazylht at gmail dot com
@ 2022-05-18 2:47 ` cvs-commit at gcc dot gnu.org
2022-05-18 2:49 ` crazylht at gmail dot com
` (5 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-05-18 2:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #17 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:850a13d754497faae91afabc6958780f1d63a574
commit r13-580-g850a13d754497faae91afabc6958780f1d63a574
Author: liuhongt <hongtao.liu@intel.com>
Date: Tue Mar 1 13:41:52 2022 +0800
Expand __builtin_memcmp_eq with ptest for OImode.
gcc/ChangeLog:
PR target/104610
* config/i386/i386-expand.cc (ix86_expand_branch): Use ptest
for QImode when code is EQ or NE.
* config/i386/i386.md (cbranchoi4): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr104610.c: New test.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (17 preceding siblings ...)
2022-05-18 2:47 ` cvs-commit at gcc dot gnu.org
@ 2022-05-18 2:49 ` crazylht at gmail dot com
2022-06-16 7:41 ` crazylht at gmail dot com
` (4 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-05-18 2:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #18 from Hongtao.liu <crazylht at gmail dot com> ---
Fixed in GCC13.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (18 preceding siblings ...)
2022-05-18 2:49 ` crazylht at gmail dot com
@ 2022-06-16 7:41 ` crazylht at gmail dot com
2023-06-28 10:12 ` cvs-commit at gcc dot gnu.org
` (3 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2022-06-16 7:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #19 from Hongtao.liu <crazylht at gmail dot com> ---
I'm wondering would targetm.overlap_op_by_pieces_p helps here.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (19 preceding siblings ...)
2022-06-16 7:41 ` crazylht at gmail dot com
@ 2023-06-28 10:12 ` cvs-commit at gcc dot gnu.org
2023-06-28 11:05 ` ubizjak at gmail dot com
` (2 subsequent siblings)
23 siblings, 0 replies; 25+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-06-28 10:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #20 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:4afbebcdc5780d28e52b7d65643e462c7c3882ce
commit r14-2159-g4afbebcdc5780d28e52b7d65643e462c7c3882ce
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Wed Jun 28 11:11:34 2023 +0100
i386: Add cbranchti4 pattern to i386.md (for -m32 compare_by_pieces).
This patch fixes some very odd (unanticipated) code generation by
compare_by_pieces with -m32 -mavx, since the recent addition of the
cbranchoi4 pattern. The issue is that cbranchoi4 is available with
TARGET_AVX, but cbranchti4 is currently conditional on TARGET_64BIT
which results in the odd behaviour (thanks to OPTAB_WIDEN) that with
-m32 -mavx, compare_by_pieces ends up (inefficiently) widening 128-bit
comparisons to 256-bits before performing PTEST.
This patch fixes this by providing a cbranchti4 pattern that's available
with either TARGET_64BIT or TARGET_SSE4_1.
For the test case below (again from PR 104610):
int foo(char *a)
{
static const char t[] = "0123456789012345678901234567890";
return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}
GCC with -m32 -O2 -mavx currently produces the bonkers:
foo: pushl %ebp
movl %esp, %ebp
andl $-32, %esp
subl $64, %esp
movl 8(%ebp), %eax
vmovdqa .LC0, %xmm4
movl $0, 48(%esp)
vmovdqu (%eax), %xmm2
movl $0, 52(%esp)
movl $0, 56(%esp)
movl $0, 60(%esp)
movl $0, 16(%esp)
movl $0, 20(%esp)
movl $0, 24(%esp)
movl $0, 28(%esp)
vmovdqa %xmm2, 32(%esp)
vmovdqa %xmm4, (%esp)
vmovdqa (%esp), %ymm5
vpxor 32(%esp), %ymm5, %ymm0
vptest %ymm0, %ymm0
jne .L2
vmovdqu 16(%eax), %xmm7
movl $0, 48(%esp)
movl $0, 52(%esp)
vmovdqa %xmm7, 32(%esp)
vmovdqa .LC1, %xmm7
movl $0, 56(%esp)
movl $0, 60(%esp)
movl $0, 16(%esp)
movl $0, 20(%esp)
movl $0, 24(%esp)
movl $0, 28(%esp)
vmovdqa %xmm7, (%esp)
vmovdqa (%esp), %ymm1
vpxor 32(%esp), %ymm1, %ymm0
vptest %ymm0, %ymm0
je .L6
.L2: movl $1, %eax
xorl $1, %eax
vzeroupper
leave
ret
.L6: xorl %eax, %eax
xorl $1, %eax
vzeroupper
leave
ret
with this patch, we now generate the (slightly) more sensible:
foo: vmovdqa .LC0, %xmm0
movl 4(%esp), %eax
vpxor (%eax), %xmm0, %xmm0
vptest %xmm0, %xmm0
jne .L2
vmovdqa .LC1, %xmm0
vpxor 16(%eax), %xmm0, %xmm0
vptest %xmm0, %xmm0
je .L5
.L2: movl $1, %eax
xorl $1, %eax
ret
.L5: xorl %eax, %eax
xorl $1, %eax
ret
2023-06-28 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_branch): Also use ptest
for TImode comparisons on 32-bit architectures.
* config/i386/i386.md (cbranch<mode>4): Change from SDWIM to
SWIM1248x to exclude/avoid TImode being conditional on -m64.
(cbranchti4): New define_expand for TImode on both TARGET_64BIT
and/or with TARGET_SSE4_1.
* config/i386/predicates.md (ix86_timode_comparison_operator):
New predicate that depends upon TARGET_64BIT.
(ix86_timode_comparison_operand): Likewise.
gcc/testsuite/ChangeLog
* gcc.target/i386/pieces-memcmp-2.c: New test case.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (20 preceding siblings ...)
2023-06-28 10:12 ` cvs-commit at gcc dot gnu.org
@ 2023-06-28 11:05 ` ubizjak at gmail dot com
2023-10-10 7:37 ` crazylht at gmail dot com
2023-10-30 3:10 ` cvs-commit at gcc dot gnu.org
23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2023-06-28 11:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #21 from Uroš Bizjak <ubizjak at gmail dot com> ---
Just before the patch from Comment #20, the compiler creates (-O2 -mavx):
--cut here--
vmovdqa .LC1(%rip), %xmm0
vmovdqa %xmm0, -24(%rsp)
vmovdqu (%rdi), %xmm0
vpxor .LC0(%rip), %xmm0, %xmm0
vptest %xmm0, %xmm0
je .L5
.L2:
movl $1, %eax
testl %eax, %eax
sete %al
ret
.L5:
vmovdqu 16(%rdi), %xmm0
vpxor -24(%rsp), %xmm0, %xmm0
vptest %xmm0, %xmm0
jne .L2
xorl %eax, %eax
testl %eax, %eax
sete %al
ret
--cut here--
Please note the creative way of returning 0 and 1 ... :
movl $1, %eax
testl %eax, %eax
sete %al
ret
Even the new code (From comment #20) is unnecessarily convoluted:
.L2: movl $1, %eax
xorl $1, %eax
ret
.L5: xorl %eax, %eax
xorl $1, %eax
ret
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (21 preceding siblings ...)
2023-06-28 11:05 ` ubizjak at gmail dot com
@ 2023-10-10 7:37 ` crazylht at gmail dot com
2023-10-30 3:10 ` cvs-commit at gcc dot gnu.org
23 siblings, 0 replies; 25+ messages in thread
From: crazylht at gmail dot com @ 2023-10-10 7:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
For 64-byte memory comparison
int compare (const char* s1, const char* s2)
{
return __builtin_memcmp (s1, s2, 64) == 0;
}
We're generating
vmovdqu (%rsi), %ymm0
vpxorq (%rdi), %ymm0, %ymm0
vptest %ymm0, %ymm0
jne .L2
vmovdqu 32(%rsi), %ymm0
vpxorq 32(%rdi), %ymm0, %ymm0
vptest %ymm0, %ymm0
je .L5
.L2:
movl $1, %eax
xorl $1, %eax
vzeroupper
ret
An alternative way is using vpcmpeq + kortest and check Carry bit
vmovdqu64 (%rsi), %zmm0
xorl %eax, %eax
vpcmpeqd (%rdi), %zmm0, %k0
kortestw %k0, %k0
setc %al
vzeroupper
Not sure if it's better or not.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug target/104610] memcmp () == 0 can be optimized better for avx512f
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
` (22 preceding siblings ...)
2023-10-10 7:37 ` crazylht at gmail dot com
@ 2023-10-30 3:10 ` cvs-commit at gcc dot gnu.org
23 siblings, 0 replies; 25+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-10-30 3:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
--- Comment #23 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:8c40b72036c967fbb1d1150515cf70aec382f0a2
commit r14-5002-g8c40b72036c967fbb1d1150515cf70aec382f0a2
Author: liuhongt <hongtao.liu@intel.com>
Date: Mon Oct 9 15:07:54 2023 +0800
Improve memcmpeq for 512-bit vector with vpcmpeq + kortest.
When 2 vectors are equal, kmask is allones and kortest will set CF,
else CF will be cleared.
So CF bit can be used to check for the result of the comparison.
Before:
vmovdqu (%rsi), %ymm0
vpxorq (%rdi), %ymm0, %ymm0
vptest %ymm0, %ymm0
jne .L2
vmovdqu 32(%rsi), %ymm0
vpxorq 32(%rdi), %ymm0, %ymm0
vptest %ymm0, %ymm0
je .L5
.L2:
movl $1, %eax
xorl $1, %eax
vzeroupper
ret
After:
vmovdqu64 (%rsi), %zmm0
xorl %eax, %eax
vpcmpeqd (%rdi), %zmm0, %k0
kortestw %k0, %k0
setc %al
vzeroupper
ret
gcc/ChangeLog:
PR target/104610
* config/i386/i386-expand.cc (ix86_expand_branch): Handle
512-bit vector with vpcmpeq + kortest.
* config/i386/i386.md (cbranchxi4): New expander.
* config/i386/sse.md: (cbranch<mode>4): Extend to V16SImode
and V8DImode.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr104610-2.c: New test.
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2023-10-30 3:10 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 11:00 [Bug target/104610] New: memcmp () == 0 can be optimized better for avx512f pinskia at gcc dot gnu.org
2022-02-21 11:01 ` [Bug target/104610] " pinskia at gcc dot gnu.org
2022-02-22 4:03 ` crazylht at gmail dot com
2022-02-22 5:54 ` crazylht at gmail dot com
2022-02-22 5:55 ` crazylht at gmail dot com
2022-02-22 6:18 ` crazylht at gmail dot com
2022-02-22 6:32 ` crazylht at gmail dot com
2022-02-23 2:15 ` crazylht at gmail dot com
2022-02-23 6:18 ` crazylht at gmail dot com
2022-02-23 21:04 ` hjl.tools at gmail dot com
2022-02-24 5:33 ` crazylht at gmail dot com
2022-02-24 6:00 ` hjl.tools at gmail dot com
2022-02-26 20:26 ` hjl.tools at gmail dot com
2022-02-27 19:02 ` hjl.tools at gmail dot com
2022-02-27 19:29 ` hjl.tools at gmail dot com
2022-03-04 3:03 ` hjl.tools at gmail dot com
2022-03-28 5:05 ` crazylht at gmail dot com
2022-03-28 5:27 ` crazylht at gmail dot com
2022-05-18 2:47 ` cvs-commit at gcc dot gnu.org
2022-05-18 2:49 ` crazylht at gmail dot com
2022-06-16 7:41 ` crazylht at gmail dot com
2023-06-28 10:12 ` cvs-commit at gcc dot gnu.org
2023-06-28 11:05 ` ubizjak at gmail dot com
2023-10-10 7:37 ` crazylht at gmail dot com
2023-10-30 3:10 ` cvs-commit at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).