From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533])
	by sourceware.org (Postfix) with ESMTPS id 53B92385782C
	for <gcc-patches@gcc.gnu.org>; Tue,  8 Nov 2022 14:37:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 53B92385782C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-ed1-x533.google.com with SMTP id u24so22716387edd.13
        for <gcc-patches@gcc.gnu.org>; Tue, 08 Nov 2022 06:37:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=68AWLD1jCimdYJ7DyuKzWD/UnoDwKvfv0S+4MU5vt0g=;
        b=G/s5LWqEvrDT7EawmEt3wns03N54/QWBQ2b/NIR0ViHCnQEWjTELtqvpBUIHOcWOiU
         yFM9sIetLtm3+OM3xivn8SuDzNdS0gjArG4x5ySt2iuEeu2goj99tNvunD9h+Nme1r/i
         zM17vUO07YkaUbg7R9++0tehYRtommmU8fFXoBP8nkfIeQ1JkoXdySDuuKNo7pvxec6o
         LIBr3sqCU5txOLORNKAXfYQD9YD2oq+smRc30Oefc+7rHYyuDM99IRpXPNkScWmAQR4t
         QtiSzFQxdBvCZ8TASrJJPQX3QthRkxkD8T8WoY4K9v8yfSpBZAzeLImto4XBexoZN115
         BEMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=68AWLD1jCimdYJ7DyuKzWD/UnoDwKvfv0S+4MU5vt0g=;
        b=ogHQMXj/vcKUIJ7WkapjgBOEmx7BcD/NPZNd/DPm+xIGPy8mIoxr00y2mz7UQMEVLZ
         kgOiMTBM6EukE6KLDvyEvf4mXlOcTAOjnCDwIIcaua0zrUOH0fpfHS6hPm/4PG6YVXBE
         iExLDmErmcEoL3P272EIOLU9d2X6UHmqGxOtAxHGHlOu8bDfJluUrp004bMto7MCP+dh
         FUFZIsGbrFR6S9VCzYZwyTIBu2wbgP0qfsegc9EatJFSOLMrCsDr2wyp0EUMndNOpPMw
         Stb1/VMii4fDI5EwQ0atr1dICqnzxEOfp1BTrP8EuJ8zSUHhaiFYLOm1YVrmAoEPIPMs
         DweQ==
X-Gm-Message-State: ACrzQf0VYHEKMZhW7G8ChhiO7nXZgmG8IcKYNcUdYAZtvrwQGAmtrLfs
	luIt1Ovf5RCdJRP1ML3dOEthG9GCr1WCBBpVO10=
X-Google-Smtp-Source: AMsMyM7REeNM68CpUt+1ciz0XqX9qwuaFKXX3eHg37BHrPLFE6M1sEyBEqCYE7ZzW3Jj2Fs05JnKjuEw8LEwecgGYPo=
X-Received: by 2002:a05:6402:3457:b0:463:2017:ae64 with SMTP id
 l23-20020a056402345700b004632017ae64mr52347512edc.218.1667918268781; Tue, 08
 Nov 2022 06:37:48 -0800 (PST)
MIME-Version: 1.0
References: <20221104000432.15254-1-hongyu.wang@intel.com> <CAAgBjM=uZ_+057TkSd9wHXfs3D1770GfRsbeR7QfCWpfo3o8FQ@mail.gmail.com>
In-Reply-To: <CAAgBjM=uZ_+057TkSd9wHXfs3D1770GfRsbeR7QfCWpfo3o8FQ@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 8 Nov 2022 15:37:36 +0100
Message-ID: <CAFiYyc3cckQfGUtnwua2s=XQnz-fovLc=vuHOFjweV1uUS_dow@mail.gmail.com>
Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and
 operation [PR98167]
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>, 
	Richard Sandiford <richard.sandiford@arm.com>
Cc: Hongyu Wang <hongyu.wang@intel.com>, hongtao.liu@intel.com, gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Hi,
> >
> > This is a follow-up patch for PR98167
> >
> > The sequence
> >      c1 = VEC_PERM_EXPR (a, a, mask)
> >      c2 = VEC_PERM_EXPR (b, b, mask)
> >      c3 = c1 op c2
> > can be optimized to
> >      c = a op b
> >      c3 = VEC_PERM_EXPR (c, c, mask)
> > for all integer vector operation, and float operation with
> > full permutation.
> >
> > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         PR target/98167
> >         * match.pd: New perm + vector op patterns for int and fp vector.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/98167
> >         * gcc.target/i386/pr98167.c: New test.
> > ---
> >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> >  2 files changed, 93 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 194ba8f5188..b85ad34f609 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -8189,3 +8189,52 @@ and,
> >   (bit_and (negate @0) integer_onep@1)
> >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> >    (bit_and @0 @1)))
> > +
> > +/* Optimize
> > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > +   c3 = c1 op c2
> > +   -->
> > +   c = a op b
> > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > +   For all integer non-div operations.  */
> > +(for op (plus minus mult bit_and bit_ior bit_xor
> > +        lshift rshift)
> > + (simplify
> > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> Just wondering, why should mask be CST here ?
> I guess the transform should work as long as mask is same for both
> vectors even if it's
> not constant ?

Yes, please change accordingly (and maybe push separately).

> > +
> > +/* Similar for float arithmetic when permutation constant covers
> > +   all vector elements.  */
> > +(for op (plus minus mult)
> > + (simplify
> > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > +     (with
> > +      {
> > +       tree perm_cst = @2;
> > +       vec_perm_builder builder;
> > +       bool full_perm_p = false;
> > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > +         {
> > +           /* Create a vec_perm_indices for the integer vector.  */
> > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> If this transform is meant only for VLS vectors, I guess you should
> bail out if TYPE_VECTOR_SUBPARTS is not constant,
> otherwise it will crash for VLA vectors.

I suppose it's difficult to create a VLA permute that covers all elements
and that is not trivial though.  But indeed add ().is_constant to the
VECTOR_FLOAT_TYPE_P guard.

>
> Thanks,
> Prathamesh
> > +           vec_perm_indices sel (builder, 1, nelts);
> > +
> > +           /* Check if perm indices covers all vector elements.  */
> > +           int count = 0, i, j;
> > +           for (i = 0; i < nelts; i++)
> > +             for (j = 0; j < nelts; j++)

Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
(as said I can't think of a non-full encoding that isn't trivial
but covers all elements) and then simply .qsort () the vector_builder
(it derives
from vec<>) so the scan is O(n log n).

Maybe Richard has a better idea here though.

Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
always that they explode the match decision tree, we'd ideally have a way to
match those with (op ..) (op ..) first to be able to share more of the matching
code.  That said, match.pd is a less than ideal place for these (but mostly
because of the way we code generate *-match.cc)

Richard.

> > +               {
> > +                 if (sel[j].to_constant () == i)
> > +                   {
> > +                     count++;
> > +                     break;
> > +                   }
> > +               }
> > +           full_perm_p = count == nelts;
> > +         }
> > +       }
> > +       (if (full_perm_p)
> > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > new file mode 100644
> > index 00000000000..40e0ac11332
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > @@ -0,0 +1,44 @@
> > +/* PR target/98167 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx2" } */
> > +
> > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > +
> > +#define VEC_PERM_4 \
> > +  2, 3, 1, 0
> > +#define VEC_PERM_8 \
> > +  4, 5, 6, 7, 3, 2, 1, 0
> > +#define VEC_PERM_16 \
> > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > +
> > +#define TYPE_PERM_OP(type, size, op, name) \
> > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > +                                             v##size##s##type b) \
> > +  { \
> > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > +                                                  VEC_PERM_##size); \
> > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > +                                                  VEC_PERM_##size); \
> > +    return a1 op b1; \
> > +  }
> > +
> > +#define INT_PERMS(op, name) \
> > +  TYPE_PERM_OP (int, 4, op, name) \
> > +
> > +#define FP_PERMS(op, name) \
> > +  TYPE_PERM_OP (float, 4, op, name) \
> > +
> > +INT_PERMS (+, add)
> > +INT_PERMS (-, sub)
> > +INT_PERMS (*, mul)
> > +INT_PERMS (|, ior)
> > +INT_PERMS (^, xor)
> > +INT_PERMS (&, and)
> > +INT_PERMS (<<, shl)
> > +INT_PERMS (>>, shr)
> > +FP_PERMS (+, add)
> > +FP_PERMS (-, sub)
> > +FP_PERMS (*, mul)
> > +
> > --
> > 2.18.1
> >