From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) by sourceware.org (Postfix) with ESMTPS id 68F2C3858C36 for ; Fri, 27 Oct 2023 07:13:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 68F2C3858C36 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 68F2C3858C36 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::112b ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698390821; cv=none; b=gs+WWncQLHzU7Mmh7QV1JprAyoDMOx0aN0rJf/NntiRYq4Rd6lCGIoyNvwYBPOXCpz/UzzUPNu2WfbfMu98jd85nsUN2bjwODf7kRI+F0QRlyeddFhEAUJdMoeJxxAY0dWfpKeUMmIhneJ5QfvRHQCLKmR0JHL1kzd0nnzwvnz0= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698390821; c=relaxed/simple; bh=wX+BYwDrJ4Adb6YkPZP+qWkG5E0XmgmkBVlZdRSVWRY=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=OQG2TOdetrat+ZLcWZYDhjCpdNVRLVI9ATiZhzSBkpbpQuKo45S4MWQ6kYUPmXcy9vQSD8ITSGmlZDgFKsmGbv1vO26sE4uZy18n7I23HNEJDHyUdxI8AsVOMXl8fLciJujp3haqbdubFJ8IlAQMNEGBPaiU+9F+MQX+gcD7fxU= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-5a7af20c488so13530257b3.1 for ; Fri, 27 Oct 2023 00:13:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698390819; x=1698995619; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=AMhlw3jz3OVtRYi3nvuTaayxHI5A0Q2dmbTTkAkRHmQ=; b=E/6eDNebUpSVFBPTSS4Yyt1OSmOwDbgJ63I00VxfAM23+ORKjc7cEHFpVzs+Q6EC3j 4TfYIubwD0+buat21f+xrwAojkucMUGY4V18YNBVMYtPmPOVT/PlgfPCaga4TFgWGUIz Z/vRMkKc1N79Cw4/j60G2bFx1ApXhqrBDyeiiezD31JScD54of1W1le2wPhUrRSN/3lx wjn1UKmjty8fIYn+XZ7VQKc/6ovbLz5auWyXiYpFlQD2nupbh15SIW3RNjBCYfUfWuH7 OiZqBkGv5fM8X7bMJxUCoCil0bia+8Ho/f2Lru6UE/WQ23hWYKO44q5t889OaRDIlAYK 2C7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698390819; x=1698995619; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AMhlw3jz3OVtRYi3nvuTaayxHI5A0Q2dmbTTkAkRHmQ=; b=wRMg0qmEv/bj4M46ipX+1sQ62xTygoqvma21xFE009Ds8NuGdS2rToyC0tdxOOBNZ0 3O71qeE8f4bN584yaoo8Ip1tw72U/oJyuJ9Wm7YhMkXeVIxnnpyoNP1TNiedJMJpvZZq Btf5O4+sz/D4K9mG9HflGekBsS7bLHWGOEdIODkvEFU1pD8IApy/nvXLiGl6yghmCnNr 9tuDtEoJC6tK1HBHIIy2/fUGVC1DU2nyA8w7fFEayWuDb+/CSw+iOFvoyK78LRhx9ojl Gl5eKHdLl25H3eB6d7ACe4PZNkJ06Ik69lyCT8PiqZZ5EoCAh85uCWqQ0loWDUwu0Ev4 QqrQ== X-Gm-Message-State: AOJu0YwL80UxIyN4Pq8ssKP0HycwM4WHmQVXAeNQPT9DcwdAs5ULZzZ1 1UpkGlV8Vs41v3sYMoKetPJ0ScCN8Mo/1AlaQhk= X-Google-Smtp-Source: AGHT+IGP9OuSjBFrLTVQKFCsWLNjytnbHCfJP2C6iIf4REa8ZXXsAzbcCexmsqfOCkD3JO/Tu9D33mmTV3HbH4/quoU= X-Received: by 2002:a0d:d904:0:b0:5a8:2b82:a031 with SMTP id b4-20020a0dd904000000b005a82b82a031mr1964728ywe.26.1698390818599; Fri, 27 Oct 2023 00:13:38 -0700 (PDT) MIME-Version: 1.0 References: <20231027054736.3529877-1-hongtao.liu@intel.com> In-Reply-To: From: Hongtao Liu Date: Fri, 27 Oct 2023 15:21:55 +0800 Message-ID: Subject: Re: [PATCH] Improve memcmpeq for 512-bit vector with vpcmpeq + kortest. To: Richard Biener Cc: liuhongt , gcc-patches@gcc.gnu.org, ubizjak@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, Oct 27, 2023 at 2:49=E2=80=AFPM Richard Biener wrote: > > > > > Am 27.10.2023 um 07:50 schrieb liuhongt : > > > > =EF=BB=BFWhen 2 vectors are equal, kmask is allones and kortest will se= t CF, > > else CF will be cleared. > > > > So CF bit can be used to check for the result of the comparison. > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ok for trunk? > > Is that also profitable for 256bit aka AVX10? Yes, it's also available for both 128-bit and 256-bit with AVX10, from performance perspective it's better. AVX10: vpcmp + kortest vs AVX2: vpxor + vptest vptest is more expensive than vpcmp + kortest > Is there a jump on carry in case the result feeds control flow rather tha= n a value and is using ktest better then (does combine figure this out?) There are JC and JNC, there're many pattern matches for ptest which can't be automatically adjusted to kortest by combining, backend needs to manually transform them. That's why my patch only handles 64-bit vectors(to avoid regressing those pattern match stuff). > > > Before: > > vmovdqu (%rsi), %ymm0 > > vpxorq (%rdi), %ymm0, %ymm0 > > vptest %ymm0, %ymm0 > > jne .L2 > > vmovdqu 32(%rsi), %ymm0 > > vpxorq 32(%rdi), %ymm0, %ymm0 > > vptest %ymm0, %ymm0 > > je .L5 > > .L2: > > movl $1, %eax > > xorl $1, %eax > > vzeroupper > > ret > > > > After: > > vmovdqu64 (%rsi), %zmm0 > > xorl %eax, %eax > > vpcmpeqd (%rdi), %zmm0, %k0 > > kortestw %k0, %k0 > > setc %al > > vzeroupper > > ret > > > > gcc/ChangeLog: > > > > PR target/104610 > > * config/i386/i386-expand.cc (ix86_expand_branch): Handle > > 512-bit vector with vpcmpeq + kortest. > > * config/i386/i386.md (cbranchxi4): New expander. > > * config/i386/sse.md: (cbranch4): Extend to V16SImode > > and V8DImode. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/pr104610-2.c: New test. > > --- > > gcc/config/i386/i386-expand.cc | 55 +++++++++++++++------- > > gcc/config/i386/i386.md | 16 +++++++ > > gcc/config/i386/sse.md | 36 +++++++++++--- > > gcc/testsuite/gcc.target/i386/pr104610-2.c | 14 ++++++ > > 4 files changed, 99 insertions(+), 22 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr104610-2.c > > > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expa= nd.cc > > index 1eae9d7c78c..c664cb61e80 100644 > > --- a/gcc/config/i386/i386-expand.cc > > +++ b/gcc/config/i386/i386-expand.cc > > @@ -2411,30 +2411,53 @@ ix86_expand_branch (enum rtx_code code, rtx op0= , rtx op1, rtx label) > > rtx tmp; > > > > /* Handle special case - vector comparsion with boolean result, trans= form > > - it using ptest instruction. */ > > + it using ptest instruction or vpcmpeq + kortest. */ > > if (GET_MODE_CLASS (mode) =3D=3D MODE_VECTOR_INT > > || (mode =3D=3D TImode && !TARGET_64BIT) > > - || mode =3D=3D OImode) > > + || mode =3D=3D OImode > > + || GET_MODE_SIZE (mode) =3D=3D 64) > > { > > - rtx flag =3D gen_rtx_REG (CCZmode, FLAGS_REG); > > - machine_mode p_mode =3D GET_MODE_SIZE (mode) =3D=3D 32 ? V4DImod= e : V2DImode; > > + unsigned msize =3D GET_MODE_SIZE (mode); > > + machine_mode p_mode > > + =3D msize =3D=3D 64 ? V16SImode : msize =3D=3D 32 ? V4DImode : V2D= Imode; > > + /* kortest set CF when result is 0xFFFF (op0 =3D=3D op1). */ > > + rtx flag =3D gen_rtx_REG (msize =3D=3D 64 ? CCCmode : CCZmode, F= LAGS_REG); > > > > gcc_assert (code =3D=3D EQ || code =3D=3D NE); > > > > - if (GET_MODE_CLASS (mode) !=3D MODE_VECTOR_INT) > > + /* Using vpcmpeq zmm zmm k + kortest for 512-bit vectors. */ > > + if (msize =3D=3D 64) > > { > > - op0 =3D lowpart_subreg (p_mode, force_reg (mode, op0), mode); > > - op1 =3D lowpart_subreg (p_mode, force_reg (mode, op1), mode); > > - mode =3D p_mode; > > + if (mode !=3D V16SImode) > > + { > > + op0 =3D lowpart_subreg (p_mode, force_reg (mode, op0), mode)= ; > > + op1 =3D lowpart_subreg (p_mode, force_reg (mode, op1), mode)= ; > > + } > > + > > + tmp =3D gen_reg_rtx (HImode); > > + emit_insn (gen_avx512f_cmpv16si3 (tmp, op0, op1, GEN_INT (0))); > > + emit_insn (gen_kortesthi_ccc (tmp, tmp)); > > + } > > + /* Using ptest for 128/256-bit vectors. */ > > + else > > + { > > + if (GET_MODE_CLASS (mode) !=3D MODE_VECTOR_INT) > > + { > > + op0 =3D lowpart_subreg (p_mode, force_reg (mode, op0), mode)= ; > > + op1 =3D lowpart_subreg (p_mode, force_reg (mode, op1), mode)= ; > > + mode =3D p_mode; > > + } > > + > > + /* Generate XOR since we can't check that one operand is zero > > + vector. */ > > + tmp =3D gen_reg_rtx (mode); > > + emit_insn (gen_rtx_SET (tmp, gen_rtx_XOR (mode, op0, op1))); > > + tmp =3D gen_lowpart (p_mode, tmp); > > + emit_insn (gen_rtx_SET (gen_rtx_REG (CCZmode, FLAGS_REG), > > + gen_rtx_UNSPEC (CCZmode, > > + gen_rtvec (2, tmp, tmp), > > + UNSPEC_PTEST))); > > } > > - /* Generate XOR since we can't check that one operand is zero ve= ctor. */ > > - tmp =3D gen_reg_rtx (mode); > > - emit_insn (gen_rtx_SET (tmp, gen_rtx_XOR (mode, op0, op1))); > > - tmp =3D gen_lowpart (p_mode, tmp); > > - emit_insn (gen_rtx_SET (gen_rtx_REG (CCZmode, FLAGS_REG), > > - gen_rtx_UNSPEC (CCZmode, > > - gen_rtvec (2, tmp, tmp), > > - UNSPEC_PTEST))); > > tmp =3D gen_rtx_fmt_ee (code, VOIDmode, flag, const0_rtx); > > tmp =3D gen_rtx_IF_THEN_ELSE (VOIDmode, tmp, > > gen_rtx_LABEL_REF (VOIDmode, label), > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > > index abaf2f311e8..51d8d0c3b97 100644 > > --- a/gcc/config/i386/i386.md > > +++ b/gcc/config/i386/i386.md > > @@ -1442,6 +1442,22 @@ (define_expand "cbranchoi4" > > DONE; > > }) > > > > +(define_expand "cbranchxi4" > > + [(set (reg:CC FLAGS_REG) > > + (compare:CC (match_operand:XI 1 "nonimmediate_operand") > > + (match_operand:XI 2 "nonimmediate_operand"))) > > + (set (pc) (if_then_else > > + (match_operator 0 "bt_comparison_operator" > > + [(reg:CC FLAGS_REG) (const_int 0)]) > > + (label_ref (match_operand 3)) > > + (pc)))] > > + "TARGET_AVX512F && TARGET_EVEX512 && !TARGET_PREFER_AVX256" > > +{ > > + ix86_expand_branch (GET_CODE (operands[0]), > > + operands[1], operands[2], operands[3]); > > + DONE; > > +}) > > + > > (define_expand "cstore4" > > [(set (reg:CC FLAGS_REG) > > (compare:CC (match_operand:SDWIM 2 "nonimmediate_operand") > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > > index c988935d4df..88fb1154699 100644 > > --- a/gcc/config/i386/sse.md > > +++ b/gcc/config/i386/sse.md > > @@ -2175,9 +2175,9 @@ (define_insn "ktest" > > (set_attr "type" "msklog") > > (set_attr "prefix" "vex")]) > > > > -(define_insn "kortest" > > - [(set (reg:CC FLAGS_REG) > > - (unspec:CC > > +(define_insn "*kortest" > > + [(set (reg FLAGS_REG) > > + (unspec > > [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand" "k") > > (match_operand:SWI1248_AVX512BWDQ 1 "register_operand" "k")] > > UNSPEC_KORTEST))] > > @@ -2187,6 +2187,30 @@ (define_insn "kortest" > > (set_attr "type" "msklog") > > (set_attr "prefix" "vex")]) > > > > +(define_insn "kortest_ccc" > > + [(set (reg:CCC FLAGS_REG) > > + (unspec:CCC > > + [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand") > > + (match_operand:SWI1248_AVX512BWDQ 1 "register_operand")] > > + UNSPEC_KORTEST))] > > + "TARGET_AVX512F") > > + > > +(define_insn "kortest_ccz" > > + [(set (reg:CCZ FLAGS_REG) > > + (unspec:CCZ > > + [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand") > > + (match_operand:SWI1248_AVX512BWDQ 1 "register_operand")] > > + UNSPEC_KORTEST))] > > + "TARGET_AVX512F") > > + > > +(define_expand "kortest" > > + [(set (reg:CC FLAGS_REG) > > + (unspec:CC > > + [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand") > > + (match_operand:SWI1248_AVX512BWDQ 1 "register_operand")] > > + UNSPEC_KORTEST))] > > + "TARGET_AVX512F") > > + > > (define_insn "kunpckhi" > > [(set (match_operand:HI 0 "register_operand" "=3Dk") > > (ior:HI > > @@ -27840,14 +27864,14 @@ (define_insn "_store_mask" > > > > (define_expand "cbranch4" > > [(set (reg:CC FLAGS_REG) > > - (compare:CC (match_operand:VI48_AVX 1 "register_operand") > > - (match_operand:VI48_AVX 2 "nonimmediate_operand"))) > > + (compare:CC (match_operand:VI48_AVX_AVX512F 1 "register_operand") > > + (match_operand:VI48_AVX_AVX512F 2 "nonimmediate_operand"))= ) > > (set (pc) (if_then_else > > (match_operator 0 "bt_comparison_operator" > > [(reg:CC FLAGS_REG) (const_int 0)]) > > (label_ref (match_operand 3)) > > (pc)))] > > - "TARGET_SSE4_1" > > + "TARGET_SSE4_1 && ( !=3D 64 || !TARGET_PREFER_AVX256)" > > { > > ix86_expand_branch (GET_CODE (operands[0]), > > operands[1], operands[2], operands[3]); > > diff --git a/gcc/testsuite/gcc.target/i386/pr104610-2.c b/gcc/testsuite= /gcc.target/i386/pr104610-2.c > > new file mode 100644 > > index 00000000000..999ef926a18 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr104610-2.c > > @@ -0,0 +1,14 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-mavx512f -O2 -mtune=3Dgeneric" } */ > > +/* { dg-final { scan-assembler-times {(?n)vpcmpeq.*zmm} 2 } } */ > > +/* { dg-final { scan-assembler-times {(?n)kortest.*k[0-7]} 2 } } */ > > + > > +int compare (const char* s1, const char* s2) > > +{ > > + return __builtin_memcmp (s1, s2, 64) =3D=3D 0; > > +} > > + > > +int compare1 (const char* s1, const char* s2) > > +{ > > + return __builtin_memcmp (s1, s2, 64) !=3D 0; > > +} > > -- > > 2.31.1 > > --=20 BR, Hongtao