From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x636.google.com (mail-ej1-x636.google.com [IPv6:2a00:1450:4864:20::636]) by sourceware.org (Postfix) with ESMTPS id 911353858D35 for ; Fri, 27 Oct 2023 06:49:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 911353858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 911353858D35 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::636 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698389386; cv=none; b=NjfbTdNzEIj/qS/lvmjLYXqCcPLdKXCCnNb+2lMD/8ta0vwUXYkGmlcyCGJZGU4JaRJFwD9qiX/IOWMcF+fol+wqMJTHpyKbO0wmASLi28reOSFvsgLhIHDBjm7cXmVO1f979FHLQqr+kdsYjr2M4b5awon24eCpKZjXAEWGC/0= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698389386; c=relaxed/simple; bh=HNWCQ9X4mhALJN3dBMwE8PdznYyeIn5vaWfpTO9AZyY=; h=DKIM-Signature:From:Mime-Version:Subject:Date:Message-Id:To; b=vq2JlhP1Mz/M9fQGZdfTebu4wGmBa+tWSMxpz7lJPKamufYtnS5N6LuVbVn7ZujNuX7hgsnY0urvfiABosROXqSSJptdnjyNVPspVfjZ3Q4yoxtPm/QZ8YcEeKIUHJurczz6mtMkTDAghEadSM13psltj1/gKR2KH8da7qjzYWE= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-ej1-x636.google.com with SMTP id a640c23a62f3a-9b1ebc80d0aso252241566b.0 for ; Thu, 26 Oct 2023 23:49:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698389382; x=1698994182; darn=gcc.gnu.org; h=to:in-reply-to:cc:references:message-id:date:subject:mime-version :from:content-transfer-encoding:from:to:cc:subject:date:message-id :reply-to; bh=wk8wezufHF8aijyyTrB+QcBg6Wow24Z4zhy+UeMyVtI=; b=TKUUlFV/i2a76toEgoGApGHaIr2Ow2o1iXTenzDKEaJzEjvQ4MGYLZwE5RqmJaoumc J8hGdNpdBERnzh6BWQ6XJe42Q2YMn9TQxAvpiUzQk8+u6NYvgJzF2Y5cMSsARrDhz1Ll fXYp1O2HiaxDWgpw62i9f4NUHduN6W5fDu+MG+aw8SpidiNQhG4qcSm06cam2OXYlSDQ NIko94IpnnatlDHD4JYyH+kr5RjuVQC6aGJO42+BlAH8AlEW/9PyExiXS0+ZIrwWCQGS qEAKx2mGZpBvB40dPtBu7lLFC+L1Y+i1juw5KjYmzDrq9L2np6XnPf9fv5IZZ/KhZlSz rDjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698389382; x=1698994182; h=to:in-reply-to:cc:references:message-id:date:subject:mime-version :from:content-transfer-encoding:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wk8wezufHF8aijyyTrB+QcBg6Wow24Z4zhy+UeMyVtI=; b=U2syyaV4FVpEIruRmwveRkKFaNdbL+VDk1NhjCBKbfQjjyPwra7svrsUO/w0F42BBF ApwE3sm+kf2wsD0TL3fBKUlCQRUa3/TnCVMaYR5rPb47cRajgm2k3cik6/xTs2cJKlAk OX2mSOImP3YINEU1NHNVhrGkz6wM+fRJGxO9Hee0JeyWeOBnn9xbLkdm5TPiGY+J28wg 413IubDCis1nSm5lrcgOoRAqUltAC9ee4o6gCk9fZYbU+uB7Z/Vygo/Qjc5Rijjo+0qX p6iRw0KNMWExtOC4nny8rKguRs7/Ey9Ws9a9gEU+z3upYAX64g1CAhkc5dm2Z+i6aVvR Hi0g== X-Gm-Message-State: AOJu0YxwBjdRzjYaq8v+FYa692X6slDdhwF1TOiz0VvxQ2LYud2nQ+58 0D3qodiLfH5ENrV24pD/I33fWUVEkJg= X-Google-Smtp-Source: AGHT+IHvWf5bqZwcRHaGr497nBWCJjMS/xaeOzzPj9aztBTLAO8R+EnV47XUMFfmJ5wBSWibQJ1wig== X-Received: by 2002:a17:906:730d:b0:9bd:a75a:5644 with SMTP id di13-20020a170906730d00b009bda75a5644mr1446142ejc.16.1698389381525; Thu, 26 Oct 2023 23:49:41 -0700 (PDT) Received: from smtpclient.apple (dynamic-095-117-023-019.95.117.pool.telefonica.de. [95.117.23.19]) by smtp.gmail.com with ESMTPSA id k13-20020a17090666cd00b00982a92a849asm703874ejp.91.2023.10.26.23.49.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 26 Oct 2023 23:49:40 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Biener Mime-Version: 1.0 (1.0) Subject: Re: [PATCH] Improve memcmpeq for 512-bit vector with vpcmpeq + kortest. Date: Fri, 27 Oct 2023 08:49:30 +0200 Message-Id: References: <20231027054736.3529877-1-hongtao.liu@intel.com> Cc: gcc-patches@gcc.gnu.org, ubizjak@gmail.com In-Reply-To: <20231027054736.3529877-1-hongtao.liu@intel.com> To: liuhongt X-Mailer: iPhone Mail (20H115) X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > Am 27.10.2023 um 07:50 schrieb liuhongt : >=20 > =EF=BB=BFWhen 2 vectors are equal, kmask is allones and kortest will set C= F, > else CF will be cleared. >=20 > So CF bit can be used to check for the result of the comparison. >=20 > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? Is that also profitable for 256bit aka AVX10? Is there a jump on carry in case the result feeds control flow rather than a= value and is using ktest better then (does combine figure this out?) > Before: > vmovdqu (%rsi), %ymm0 > vpxorq (%rdi), %ymm0, %ymm0 > vptest %ymm0, %ymm0 > jne .L2 > vmovdqu 32(%rsi), %ymm0 > vpxorq 32(%rdi), %ymm0, %ymm0 > vptest %ymm0, %ymm0 > je .L5 > .L2: > movl $1, %eax > xorl $1, %eax > vzeroupper > ret >=20 > After: > vmovdqu64 (%rsi), %zmm0 > xorl %eax, %eax > vpcmpeqd (%rdi), %zmm0, %k0 > kortestw %k0, %k0 > setc %al > vzeroupper > ret >=20 > gcc/ChangeLog: >=20 > PR target/104610 > * config/i386/i386-expand.cc (ix86_expand_branch): Handle > 512-bit vector with vpcmpeq + kortest. > * config/i386/i386.md (cbranchxi4): New expander. > * config/i386/sse.md: (cbranch4): Extend to V16SImode > and V8DImode. >=20 > gcc/testsuite/ChangeLog: >=20 > * gcc.target/i386/pr104610-2.c: New test. > --- > gcc/config/i386/i386-expand.cc | 55 +++++++++++++++------- > gcc/config/i386/i386.md | 16 +++++++ > gcc/config/i386/sse.md | 36 +++++++++++--- > gcc/testsuite/gcc.target/i386/pr104610-2.c | 14 ++++++ > 4 files changed, 99 insertions(+), 22 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr104610-2.c >=20 > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.= cc > index 1eae9d7c78c..c664cb61e80 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -2411,30 +2411,53 @@ ix86_expand_branch (enum rtx_code code, rtx op0, r= tx op1, rtx label) > rtx tmp; >=20 > /* Handle special case - vector comparsion with boolean result, transfor= m > - it using ptest instruction. */ > + it using ptest instruction or vpcmpeq + kortest. */ > if (GET_MODE_CLASS (mode) =3D=3D MODE_VECTOR_INT > || (mode =3D=3D TImode && !TARGET_64BIT) > - || mode =3D=3D OImode) > + || mode =3D=3D OImode > + || GET_MODE_SIZE (mode) =3D=3D 64) > { > - rtx flag =3D gen_rtx_REG (CCZmode, FLAGS_REG); > - machine_mode p_mode =3D GET_MODE_SIZE (mode) =3D=3D 32 ? V4DImode := V2DImode; > + unsigned msize =3D GET_MODE_SIZE (mode); > + machine_mode p_mode > + =3D msize =3D=3D 64 ? V16SImode : msize =3D=3D 32 ? V4DImode : V2DImo= de; > + /* kortest set CF when result is 0xFFFF (op0 =3D=3D op1). */ > + rtx flag =3D gen_rtx_REG (msize =3D=3D 64 ? CCCmode : CCZmode, FLAG= S_REG); >=20 > gcc_assert (code =3D=3D EQ || code =3D=3D NE); >=20 > - if (GET_MODE_CLASS (mode) !=3D MODE_VECTOR_INT) > + /* Using vpcmpeq zmm zmm k + kortest for 512-bit vectors. */ > + if (msize =3D=3D 64) > { > - op0 =3D lowpart_subreg (p_mode, force_reg (mode, op0), mode); > - op1 =3D lowpart_subreg (p_mode, force_reg (mode, op1), mode); > - mode =3D p_mode; > + if (mode !=3D V16SImode) > + { > + op0 =3D lowpart_subreg (p_mode, force_reg (mode, op0), mode); > + op1 =3D lowpart_subreg (p_mode, force_reg (mode, op1), mode); > + } > + > + tmp =3D gen_reg_rtx (HImode); > + emit_insn (gen_avx512f_cmpv16si3 (tmp, op0, op1, GEN_INT (0))); > + emit_insn (gen_kortesthi_ccc (tmp, tmp)); > + } > + /* Using ptest for 128/256-bit vectors. */ > + else > + { > + if (GET_MODE_CLASS (mode) !=3D MODE_VECTOR_INT) > + { > + op0 =3D lowpart_subreg (p_mode, force_reg (mode, op0), mode); > + op1 =3D lowpart_subreg (p_mode, force_reg (mode, op1), mode); > + mode =3D p_mode; > + } > + > + /* Generate XOR since we can't check that one operand is zero > + vector. */ > + tmp =3D gen_reg_rtx (mode); > + emit_insn (gen_rtx_SET (tmp, gen_rtx_XOR (mode, op0, op1))); > + tmp =3D gen_lowpart (p_mode, tmp); > + emit_insn (gen_rtx_SET (gen_rtx_REG (CCZmode, FLAGS_REG), > + gen_rtx_UNSPEC (CCZmode, > + gen_rtvec (2, tmp, tmp), > + UNSPEC_PTEST))); > } > - /* Generate XOR since we can't check that one operand is zero vecto= r. */ > - tmp =3D gen_reg_rtx (mode); > - emit_insn (gen_rtx_SET (tmp, gen_rtx_XOR (mode, op0, op1))); > - tmp =3D gen_lowpart (p_mode, tmp); > - emit_insn (gen_rtx_SET (gen_rtx_REG (CCZmode, FLAGS_REG), > - gen_rtx_UNSPEC (CCZmode, > - gen_rtvec (2, tmp, tmp), > - UNSPEC_PTEST))); > tmp =3D gen_rtx_fmt_ee (code, VOIDmode, flag, const0_rtx); > tmp =3D gen_rtx_IF_THEN_ELSE (VOIDmode, tmp, > gen_rtx_LABEL_REF (VOIDmode, label), > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index abaf2f311e8..51d8d0c3b97 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -1442,6 +1442,22 @@ (define_expand "cbranchoi4" > DONE; > }) >=20 > +(define_expand "cbranchxi4" > + [(set (reg:CC FLAGS_REG) > + (compare:CC (match_operand:XI 1 "nonimmediate_operand") > + (match_operand:XI 2 "nonimmediate_operand"))) > + (set (pc) (if_then_else > + (match_operator 0 "bt_comparison_operator" > + [(reg:CC FLAGS_REG) (const_int 0)]) > + (label_ref (match_operand 3)) > + (pc)))] > + "TARGET_AVX512F && TARGET_EVEX512 && !TARGET_PREFER_AVX256" > +{ > + ix86_expand_branch (GET_CODE (operands[0]), > + operands[1], operands[2], operands[3]); > + DONE; > +}) > + > (define_expand "cstore4" > [(set (reg:CC FLAGS_REG) > (compare:CC (match_operand:SDWIM 2 "nonimmediate_operand") > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index c988935d4df..88fb1154699 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -2175,9 +2175,9 @@ (define_insn "ktest" > (set_attr "type" "msklog") > (set_attr "prefix" "vex")]) >=20 > -(define_insn "kortest" > - [(set (reg:CC FLAGS_REG) > - (unspec:CC > +(define_insn "*kortest" > + [(set (reg FLAGS_REG) > + (unspec > [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand" "k") > (match_operand:SWI1248_AVX512BWDQ 1 "register_operand" "k")] > UNSPEC_KORTEST))] > @@ -2187,6 +2187,30 @@ (define_insn "kortest" > (set_attr "type" "msklog") > (set_attr "prefix" "vex")]) >=20 > +(define_insn "kortest_ccc" > + [(set (reg:CCC FLAGS_REG) > + (unspec:CCC > + [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand") > + (match_operand:SWI1248_AVX512BWDQ 1 "register_operand")] > + UNSPEC_KORTEST))] > + "TARGET_AVX512F") > + > +(define_insn "kortest_ccz" > + [(set (reg:CCZ FLAGS_REG) > + (unspec:CCZ > + [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand") > + (match_operand:SWI1248_AVX512BWDQ 1 "register_operand")] > + UNSPEC_KORTEST))] > + "TARGET_AVX512F") > + > +(define_expand "kortest" > + [(set (reg:CC FLAGS_REG) > + (unspec:CC > + [(match_operand:SWI1248_AVX512BWDQ 0 "register_operand") > + (match_operand:SWI1248_AVX512BWDQ 1 "register_operand")] > + UNSPEC_KORTEST))] > + "TARGET_AVX512F") > + > (define_insn "kunpckhi" > [(set (match_operand:HI 0 "register_operand" "=3Dk") > (ior:HI > @@ -27840,14 +27864,14 @@ (define_insn "_store_mask" >=20 > (define_expand "cbranch4" > [(set (reg:CC FLAGS_REG) > - (compare:CC (match_operand:VI48_AVX 1 "register_operand") > - (match_operand:VI48_AVX 2 "nonimmediate_operand"))) > + (compare:CC (match_operand:VI48_AVX_AVX512F 1 "register_operand") > + (match_operand:VI48_AVX_AVX512F 2 "nonimmediate_operand"))) > (set (pc) (if_then_else > (match_operator 0 "bt_comparison_operator" > [(reg:CC FLAGS_REG) (const_int 0)]) > (label_ref (match_operand 3)) > (pc)))] > - "TARGET_SSE4_1" > + "TARGET_SSE4_1 && ( !=3D 64 || !TARGET_PREFER_AVX256)" > { > ix86_expand_branch (GET_CODE (operands[0]), > operands[1], operands[2], operands[3]); > diff --git a/gcc/testsuite/gcc.target/i386/pr104610-2.c b/gcc/testsuite/gc= c.target/i386/pr104610-2.c > new file mode 100644 > index 00000000000..999ef926a18 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr104610-2.c > @@ -0,0 +1,14 @@ > +/* { dg-do compile } */ > +/* { dg-options "-mavx512f -O2 -mtune=3Dgeneric" } */ > +/* { dg-final { scan-assembler-times {(?n)vpcmpeq.*zmm} 2 } } */ > +/* { dg-final { scan-assembler-times {(?n)kortest.*k[0-7]} 2 } } */ > + > +int compare (const char* s1, const char* s2) > +{ > + return __builtin_memcmp (s1, s2, 64) =3D=3D 0; > +} > + > +int compare1 (const char* s1, const char* s2) > +{ > + return __builtin_memcmp (s1, s2, 64) !=3D 0; > +} > --=20 > 2.31.1 >=20