From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ZbTK=36=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x232.google.com (mail-lj1-x232.google.com [IPv6:2a00:1450:4864:20::232])
	by sourceware.org (Postfix) with ESMTPS id 62A7D3858D28
	for <gcc-patches@gcc.gnu.org>; Wed, 30 Nov 2022 07:44:09 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 62A7D3858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x232.google.com with SMTP id z24so19940465ljn.4
        for <gcc-patches@gcc.gnu.org>; Tue, 29 Nov 2022 23:44:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=H1snxqkopTuwedybImBKXUtQ85bPCu/EwjaihhybPok=;
        b=KRw+AbydzztWQVDDOpp/xBd6z74ddNgCARopPRKZr/uDyIDYqqros0MI660Kl4m4N3
         PDOP0/pW43Y5xzdv+ki/vuKi/OJ3B119s+1bkaJZoYrpmH2ldLU9z9Gkb7CMWcoCbI9o
         k5vKERlOtWOxIGeji4IEt9DIONR80ZesJ0XK0l/PKKKOVVmaOevJSHd+Uu5loeOJygF2
         bDfw/9RijKmYBVmhA+9XxN96EHPI2m2uQHXc4MRJFSyw+fMersC8/aQOodIqT+YVlCrn
         K/PDgHTItWleqWVd0oUN8SSih+w5Gi+O4NqO/juvsEha9/FDMNiWGGNg5n5h1CDNlKrW
         Valw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=H1snxqkopTuwedybImBKXUtQ85bPCu/EwjaihhybPok=;
        b=r8ZKm2/uYTT8VChF3v9D7CLAFrvGfGcJDYKSvUfNGmHhJGHMPLs5cK1qu4v/w++Um2
         U8WaINJNDJXQndR6w/K/utmsT3h13ukgY9ZCjtiHSOgOQCT/nd/OSU5qn945Zipw9abK
         ckLM62JEiB7PBmwndluLe3Up/Rv8SivrlJL0RbI1ajBkYzIEPyB6uZgPfa/sbZZJoLc3
         59xbgepCsuCdYuKp2ngOvK7w6t5KjKXm6izoD8T//F+bzCGYU38yiRGnzdWDSrxxQfV1
         b4SVf5SIjsW0wqLrQv0zk8+UiDN2H5ASw11iYB10vn2xuizWuTKqkv6yeX2bRYJ5aHk0
         v5Sg==
X-Gm-Message-State: ANoB5pmy1PeaMJinJucpEyrJvzhbBuVqInCSU6qVfiF4t3oFNU4ceba0
	YAY3M6cHjshjg2RhZkzOnBJIsCCSCuiddHkp7I4=
X-Google-Smtp-Source: AA0mqf6Gym+KzSvUbxSKuv8Ct9QK6fc60bYuFRgZ7O8wmz33P880V4BfR+MqEE43FPB1U84IABeOFycgURHyh2snlsg=
X-Received: by 2002:a2e:b5b3:0:b0:279:86ec:a686 with SMTP id
 f19-20020a2eb5b3000000b0027986eca686mr9287224ljn.399.1669794247614; Tue, 29
 Nov 2022 23:44:07 -0800 (PST)
MIME-Version: 1.0
References: <20221129100446.3875697-1-manolis.tsamis@vrull.eu>
In-Reply-To: <20221129100446.3875697-1-manolis.tsamis@vrull.eu>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 30 Nov 2022 08:43:55 +0100
Message-ID: <CAFiYyc2zi9twchGSqKjvK0OYydA_gKbuane3UorYQ0X_gHr1GQ@mail.gmail.com>
Subject: Re: [PATCH v2] Add pattern to convert vector shift + bitwise and +
 multiply to vector compare in some cases.
To: Manolis Tsamis <manolis.tsamis@vrull.eu>
Cc: gcc-patches@gcc.gnu.org, Philipp Tomsich <philipp.tomsich@vrull.eu>, 
	Tamar Christina <Tamar.Christina@arm.com>, jiangning.liu@amperecomputing.com, 
	Christoph Muellner <christoph.muellner@vrull.eu>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Nov 29, 2022 at 11:05 AM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote:
>
> When using SWAR (SIMD in a register) techniques a comparison operation within
> such a register can be made by using a combination of shifts, bitwise and and
> multiplication. If code using this scheme is vectorized then there is potential
> to replace all these operations with a single vector comparison, by reinterpreting
> the vector types to match the width of the SWAR register.
>
> For example, for the test function packed_cmp_16_32, the original generated code is:
>
>         ldr     q0, [x0]
>         add     w1, w1, 1
>         ushr    v0.4s, v0.4s, 15
>         and     v0.16b, v0.16b, v2.16b
>         shl     v1.4s, v0.4s, 16
>         sub     v0.4s, v1.4s, v0.4s
>         str     q0, [x0], 16
>         cmp     w2, w1
>         bhi     .L20
>
> with this pattern the above can be optimized to:
>
>         ldr     q0, [x0]
>         add     w1, w1, 1
>         cmlt    v0.8h, v0.8h, #0
>         str     q0, [x0], 16
>         cmp     w2, w1
>         bhi     .L20
>
> The effect is similar for x86-64.
>
> Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
>
> gcc/ChangeLog:
>
>         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
>
> ---
>
> Changes in v2:
>         - Changed pattern to use vec_cond_expr.
>         - Changed pattern to work with VLA vector.
>         - Added more checks and comments.
>
>  gcc/match.pd                                  | 60 ++++++++++++++++
>  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72 +++++++++++++++++++
>  2 files changed, 132 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 67a0a682f31..05e7fc79ba8 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>      (view_convert (bit_and:itype (view_convert @0)
>                                  (ne @1 { build_zero_cst (type); })))))))
>
> +/* In SWAR (SIMD in a register) code a signed comparison of packed data can
> +   be constructed with a particular combination of shift, bitwise and,
> +   and multiplication by constants.  If that code is vectorized we can
> +   convert this pattern into a more efficient vector comparison.  */
> +(simplify
> + (mult (bit_and (rshift @0 uniform_integer_cst_p@1)
> +           uniform_integer_cst_p@2)
> +    uniform_integer_cst_p@3)

Please use VECTOR_CST in the match instead of uniform_integer_cst_p
and instead ...

> + (with {
> +   tree rshift_cst = uniform_integer_cst_p (@1);
> +   tree bit_and_cst = uniform_integer_cst_p (@2);
> +   tree mult_cst = uniform_integer_cst_p (@3);
> +  }
> +  /* Make sure we're working with vectors and uniform vector constants.  */
> +  (if (VECTOR_TYPE_P (type)

... test for non-NULL *_cst here where you can use uniform_vector_p instead
of uniform_integer_cst_p.  You can elide the VECTOR_TYPE_P check then
and instead do INTEGRAL_TYPE_P (TREE_TYPE (type)).

> +       && tree_fits_uhwi_p (rshift_cst)
> +       && tree_fits_uhwi_p (mult_cst)
> +       && tree_fits_uhwi_p (bit_and_cst))
> +   /* Compute what constants would be needed for this to represent a packed
> +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> +   (with {
> +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (type);
> +     poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type);
> +     poly_int64 vec_bits = vec_elem_bits * vec_nelts;
> +
> +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
> +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> +
> +     mult_i = tree_to_uhwi (mult_cst);
> +     bit_and_i = tree_to_uhwi (bit_and_cst);
> +     target_bit_and_i = 0;
> +
> +     /* The bit pattern in BIT_AND_I should be a mask for the least
> +        significant bit of each packed element that is CMP_BITS wide.  */
> +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
> +    }
> +    (if ((exact_log2 (cmp_bits_i)) >= 0
> +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> +        && multiple_p (vec_bits, cmp_bits_i)
> +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> +        && target_mult_i == mult_i
> +        && target_bit_and_i == bit_and_i)
> +     /* Compute the vector shape for the comparison and check if the target is
> +       able to expand the comparison with that type.  */
> +     (with {
> +       /* We're doing a signed comparison.  */
> +       tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, 0);
> +       poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i);
> +       tree vector_cmp_type = build_vector_type (cmp_type, vector_type_nelts);
> +       tree zeros = build_zero_cst (vector_cmp_type);
> +       tree ones = build_all_ones_cst (vector_cmp_type);
> +      }
> +      (if (expand_vec_cmp_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR))
> +       (view_convert:type (vec_cond (lt (view_convert:vector_cmp_type @0)
> +                                    { zeros; })
> +                          { ones; } { zeros; })))))))))

You are testing whether we can expand the comparison but not whether we can
expand the vector condition here?  The set of combinations supported are
tricky, I think your test is conservatively OK but it might fail to match AVX512
and will fail targets with masks (SVE, GCN?).  I think adding

  || expand_vec_cond_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR)

would fix that (but there will be extra cost in producing the final
result vector
from the comparison mask).

I do wonder if, as this is targeted at vectorization, this shouldn't
be a vectorizer pattern instead of a post-processing transform?  That
way we would
get the costing in the vectorizer correct and not rely on integer
vector shift or
multiplication and also get the cost of producing the result vector correct?  I
can't fully decipher the trick though, but assume that vector_cmp_type
always has
smaller elements than 'type'.

Thanks,
Richard.

> +
>  (for cmp (gt ge lt le)
>       outp (convert convert negate negate)
>       outn (negate negate convert convert)
> diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> new file mode 100644
> index 00000000000..26f9ad9ef28
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> @@ -0,0 +1,72 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize" } */
> +
> +typedef unsigned char uint8_t;
> +typedef unsigned short uint16_t;
> +typedef unsigned int uint32_t;
> +
> +/* 8-bit SWAR tests.  */
> +
> +static uint8_t packed_cmp_8_8(uint8_t a)
> +{
> +  return ((a >> 7) & 0x1U) * 0xffU;
> +}
> +
> +/* 16-bit SWAR tests.  */
> +
> +static uint16_t packed_cmp_8_16(uint16_t a)
> +{
> +  return ((a >> 7) & 0x101U) * 0xffU;
> +}
> +
> +static uint16_t packed_cmp_16_16(uint16_t a)
> +{
> +  return ((a >> 15) & 0x1U) * 0xffffU;
> +}
> +
> +/* 32-bit SWAR tests.  */
> +
> +static uint32_t packed_cmp_8_32(uint32_t a)
> +{
> +  return ((a >> 7) & 0x1010101U) * 0xffU;
> +}
> +
> +static uint32_t packed_cmp_16_32(uint32_t a)
> +{
> +  return ((a >> 15) & 0x10001U) * 0xffffU;
> +}
> +
> +static uint32_t packed_cmp_32_32(uint32_t a)
> +{
> +  return ((a >> 31) & 0x1U) * 0xffffffffU;
> +}
> +
> +/* Driver function to test the vectorized code generated for the different
> +   packed_cmp variants.  */
> +
> +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> +  void vectorized_cmp_##FUNC(T* a, int n)      \
> +  {                                            \
> +    n = (n / 32) * 32;                         \
> +    for(int i = 0; i < n; i += 4)              \
> +    {                                          \
> +      a[i + 0] = FUNC(a[i + 0]);               \
> +      a[i + 1] = FUNC(a[i + 1]);               \
> +      a[i + 2] = FUNC(a[i + 2]);               \
> +      a[i + 3] = FUNC(a[i + 3]);               \
> +    }                                          \
> +  }
> +
> +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> +
> +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> +
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> +
> +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> --
> 2.34.1
>