public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Tamar Christina <Tamar.Christina@arm.com>
To: Philipp Tomsich <philipp.tomsich@vrull.eu>
Cc: Richard Biener <richard.guenther@gmail.com>,
	mtsamis <manolis.tsamis@vrull.eu>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	"jiangning.liu@amperecomputing.com"
	<jiangning.liu@amperecomputing.com>
Subject: RE: [PATCH] Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases.
Date: Tue, 22 Nov 2022 10:57:22 +0000	[thread overview]
Message-ID: <VI1PR08MB5325DC2D3809669C0EA6AB19FF0D9@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <CAAeLtUC9URmq-+_9wuSanrrbf=r6tiB-pftr=Df83y6BuSSAHg@mail.gmail.com>

Hi,

> -----Original Message-----
> From: Philipp Tomsich <philipp.tomsich@vrull.eu>
> Sent: Tuesday, November 22, 2022 10:35 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Biener <richard.guenther@gmail.com>; mtsamis
> <manolis.tsamis@vrull.eu>; GCC Patches <gcc-patches@gcc.gnu.org>;
> jiangning.liu@amperecomputing.com
> Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and +
> multiply to vector compare in some cases.
> 
> Richard & Tamar,
> 
> On Fri, 26 Aug 2022 at 15:29, Tamar Christina <Tamar.Christina@arm.com>
> wrote:
> >
> > > -----Original Message-----
> > > From: Gcc-patches <gcc-patches-
> > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > > Biener via Gcc-patches
> > > Sent: Friday, August 26, 2022 10:08 AM
> > > To: mtsamis <manolis.tsamis@vrull.eu>
> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>;
> > > jiangning.liu@amperecomputing.com; Philipp Tomsich
> > > <philipp.tomsich@vrull.eu>
> > > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise
> > > and + multiply to vector compare in some cases.
> > >
> > > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu>
> wrote:
> > > >
> > > > When using SWAR (SIMD in a register) techniques a comparison
> > > > operation within such a register can be made by using a
> > > > combination of shifts, bitwise and and multiplication. If code
> > > > using this scheme is vectorized then there is potential to replace
> > > > all these operations with a single vector comparison, by
> > > > reinterpreting the vector types to
> > > match the width of the SWAR register.
> > > >
> > > > For example, for the test function packed_cmp_16_32, the original
> > > generated code is:
> > > >
> > > >         ldr     q0, [x0]
> > > >         add     w1, w1, 1
> > > >         ushr    v0.4s, v0.4s, 15
> > > >         and     v0.16b, v0.16b, v2.16b
> > > >         shl     v1.4s, v0.4s, 16
> > > >         sub     v0.4s, v1.4s, v0.4s
> > > >         str     q0, [x0], 16
> > > >         cmp     w2, w1
> > > >         bhi     .L20
> > > >
> > > > with this pattern the above can be optimized to:
> > > >
> > > >         ldr     q0, [x0]
> > > >         add     w1, w1, 1
> > > >         cmlt    v0.8h, v0.8h, #0
> > > >         str     q0, [x0], 16
> > > >         cmp     w2, w1
> > > >         bhi     .L20
> > > >
> > > > The effect is similar for x86-64.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
> > > >
> > > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
> > > > ---
> > > >  gcc/match.pd                                  | 57 +++++++++++++++
> > > >  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72
> > > +++++++++++++++++++
> > > >  2 files changed, 129 insertions(+)  create mode 100644
> > > > gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > 8bbc0dbd5cd..5c768a94846 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > >      (view_convert (bit_and:itype (view_convert @0)
> > > >                                  (ne @1 { build_zero_cst (type);
> > > > })))))))
> > > >
> > > > +/* In SWAR (SIMD in a register) code a comparison of packed data can
> > > > +   be consturcted with a particular combination of shift, bitwise and,
> > > > +   and multiplication by constants.  If that code is vectorized we can
> > > > +   convert this pattern into a more efficient vector comparison.
> > > > +*/ (simplify  (mult (bit_and (rshift @0 @1) @2) @3)
> > >
> > > You should restrict the pattern a bit more, below you use
> > > uniform_integer_cst_p and also require a vector type thus
> > >
> > >   (simplify
> > >    (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2)
> > > VECTOR_CST@3)
> > >
> > >
> > > > + (with {
> > > > +   tree op_type = TREE_TYPE (@0);
> > >
> > > that's the same as 'type' which is already available.
> > >
> > > > +   tree rshift_cst = NULL_TREE;
> > > > +   tree bit_and_cst = NULL_TREE;
> > > > +   tree mult_cst = NULL_TREE;
> > > > +  }
> > > > +  /* Make sure we're working with vectors and uniform vector
> > > > + constants.  */  (if (VECTOR_TYPE_P (op_type)
> > > > +       && (rshift_cst = uniform_integer_cst_p (@1))
> > > > +       && (bit_and_cst = uniform_integer_cst_p (@2))
> > > > +       && (mult_cst = uniform_integer_cst_p (@3)))
> > > > +   /* Compute what constants would be needed for this to
> > > > + represent a
> > > packed
> > > > +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> > > > +   (with {
> > > > +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
> > > > +     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS
> > > > + (op_type).to_constant ();
> > >
> > > you need to check that this isn't a VLA vector operation.
> >
> > Seems like this pattern should be applicable to VLA as well no?
> > So could we not keep vec_nelts as a poly and just use exact_div Below
> > in the division? The pattern is only valid if cmp_bits_i is a multiple
> > of vec_elem_bits anyway.  The build_vector_* should then do the right
> > thing.
> 
> Seems like we never agreed on what should go into the next version.
> Am I right in assuming that applicability to VLA is ok and that we should
> primarily focus on addressing the below comments for v2?

I don't think there was a disagreement here per say.  Richard's point was that in the
current form the patch is unsafe for VLA, as to_constant () would fail.  That is
if the idea was not to support VLA then this needs a check, or use the checking
version of to_constant.

I was saying that we don’t need to make this ignore VLA, that the pattern should
be applicable for VLA as well.  That is to say, properly handling VLA would address
Richard's concerns as well.

Cheers,
Tamar

> 
> Cheers,
> Philipp.
> 
> > >
> > > > +     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
> > > > +
> > > > +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> > > > +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> > > > +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
> > >
> > > and that the rshift_cst and others actually fit an uhwi.
> > >
> > > > +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> > > > +
> > > > +     mult_i = tree_to_uhwi (mult_cst);
> > > > +     bit_and_i = tree_to_uhwi (bit_and_cst);
> > > > +     target_bit_and_i = 0;
> > > > +
> > > > +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> > > > +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
> > >
> > > it would be nice to have a comment on what this actually does ...
> > >
> > > > +    }
> > > > +    (if ((exact_log2 (cmp_bits_i)) >= 0
> > > > +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> > > > +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> > > > +        && tree_fits_uhwi_p (rshift_cst)
> > > > +        && tree_fits_uhwi_p (mult_cst)
> > > > +        && tree_fits_uhwi_p (bit_and_cst)
> > > > +        && target_mult_i == mult_i
> > > > +        && target_bit_and_i == bit_and_i)
> > > > +     /* Compute the vector shape for the comparison and check if
> > > > + the
> > > target is
> > > > +       able to expand the comparison with that type.  */
> > > > +     (with {
> > > > +       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
> > > > +       int vector_type_nelts = vec_bits / cmp_bits_i;
> > > > +       tree vector_type = build_vector_type (bool_type,
> > > > + vector_type_nelts);
> > >
> > > why do you build a bool vector type here and then ...
> > >
> > > > +       tree zeros = build_zero_cst (vector_type);
> > > > +       tree mask_type = truth_type_for (vector_type);
> > >
> > > ... its truth type?  Note both might not be actually supported by
> > > the target and thus receive BLKmode or an integer mode.  The latter
> > > is a problem for expand_vec_cmp_expr_p as that might pick up a pattern
> not suitable
> > > here.   Also note that truth_type_for can result in a mask mode, aka
> > > QImode with AVX512 or some VnBImode on other archs - those are not
> > > OK to be simply view_converted back to op_type.  In general a vector
> > > compare operation yields a mask and you can convert that to a -1/0
> > > value using a vec_cond_expr.  I think we have a pattern that can
> > > then properly simplify the case where this can be expressed as a
> > > view_convert, but of course you then also need to check for
> vec_cond_expr support.
> > >
> > > I would suggest you make 'vector_type' an integer element type (that
> > > also properly specifies the sign of the comparison!) and check you
> > > end up with a vector mode and the mode of the mask_type agrees with
> > > that if you don't want to go the vec_cond_expr route.
> > >
> > >
> > > > +      }
> > > > +      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
> > > > +       (view_convert:op_type (lt:mask_type
> > > > + (view_convert:vector_type
> > > @0)
> > > > +                                          { zeros; })))))))))
> > > > +
> > > >  (for cmp (gt ge lt le)
> > > >       outp (convert convert negate negate)
> > > >       outn (negate negate convert convert) diff --git
> > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > new file mode 100644
> > > > index 00000000000..26f9ad9ef28
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > @@ -0,0 +1,72 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -ftree-vectorize" } */
> > > > +
> > > > +typedef unsigned char uint8_t;
> > > > +typedef unsigned short uint16_t;
> > > > +typedef unsigned int uint32_t;
> > > > +
> > > > +/* 8-bit SWAR tests.  */
> > > > +
> > > > +static uint8_t packed_cmp_8_8(uint8_t a) {
> > > > +  return ((a >> 7) & 0x1U) * 0xffU; }
> > > > +
> > > > +/* 16-bit SWAR tests.  */
> > > > +
> > > > +static uint16_t packed_cmp_8_16(uint16_t a) {
> > > > +  return ((a >> 7) & 0x101U) * 0xffU; }
> > > > +
> > > > +static uint16_t packed_cmp_16_16(uint16_t a) {
> > > > +  return ((a >> 15) & 0x1U) * 0xffffU; }
> > > > +
> > > > +/* 32-bit SWAR tests.  */
> > > > +
> > > > +static uint32_t packed_cmp_8_32(uint32_t a) {
> > > > +  return ((a >> 7) & 0x1010101U) * 0xffU; }
> > > > +
> > > > +static uint32_t packed_cmp_16_32(uint32_t a) {
> > > > +  return ((a >> 15) & 0x10001U) * 0xffffU; }
> > > > +
> > > > +static uint32_t packed_cmp_32_32(uint32_t a) {
> > > > +  return ((a >> 31) & 0x1U) * 0xffffffffU; }
> > > > +
> > > > +/* Driver function to test the vectorized code generated for the
> different
> > > > +   packed_cmp variants.  */
> > > > +
> > > > +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> > > > +  void vectorized_cmp_##FUNC(T* a, int n)      \
> > > > +  {                                            \
> > > > +    n = (n / 32) * 32;                         \
> > > > +    for(int i = 0; i < n; i += 4)              \
> > > > +    {                                          \
> > > > +      a[i + 0] = FUNC(a[i + 0]);               \
> > > > +      a[i + 1] = FUNC(a[i + 1]);               \
> > > > +      a[i + 2] = FUNC(a[i + 2]);               \
> > > > +      a[i + 3] = FUNC(a[i + 3]);               \
> > > > +    }                                          \
> > > > +  }
> > > > +
> > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> > > > +
> > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> > > > +
> > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> > > > +
> > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> > > > --
> > > > 2.34.1
> > > >

  reply	other threads:[~2022-11-22 10:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-13  9:58 mtsamis
2022-08-18  1:18 ` Hans-Peter Nilsson
2022-08-18  2:08 ` Andrew Pinski
2022-08-26  9:08 ` Richard Biener
2022-08-26 13:29   ` Tamar Christina
2022-11-22 10:34     ` Philipp Tomsich
2022-11-22 10:57       ` Tamar Christina [this message]
2022-11-29 10:11         ` Manolis Tsamis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB5325DC2D3809669C0EA6AB19FF0D9@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jiangning.liu@amperecomputing.com \
    --cc=manolis.tsamis@vrull.eu \
    --cc=philipp.tomsich@vrull.eu \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).