From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=VCQ8=D7=linaro.org=prathamesh.kulkarni@sourceware.org>
Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f])
	by sourceware.org (Postfix) with ESMTPS id 03027385828D
	for <gcc-patches@gcc.gnu.org>; Mon, 14 Aug 2023 14:58:53 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 03027385828D
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-wr1-x42f.google.com with SMTP id ffacd0b85a97d-318015ade49so3987652f8f.0
        for <gcc-patches@gcc.gnu.org>; Mon, 14 Aug 2023 07:58:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1692025131; x=1692629931;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=UfcNq/a1NEH4SfKrTdJu9nbVTgzSSjJimGvwevhSxBI=;
        b=dAhiFiGopNaBIOz6BbdJxUzDtwCxaW7gJXbw/nhgZ+YG1wRpArkjpJIVjyof/4hqRm
         megg/uBZe+ivuMqEZKWzJw+wdmYf8FyVc0Va2vEko+0Ma/0p/gVOyGmPg2Zc9fDFAoMn
         gy50U18XtEW5e6tO7tscPcWVbHKzOhAgqEJEowgojN+VK2yPkY/BDFx7YaNea/MyFBAl
         xw+cHpYLbLY/M5dQOl7FSAVxKrRWt6Z1QKmjyo1hKZ2Rno8GGdw/ZCrjkxiEhSoOXsyc
         0tG3PvOb4RZ15lXUloYmO2bkTAXPnntgSH7FkC7XrezLoKqyFThvN7dQhAFXyunCMyUr
         xqpA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1692025131; x=1692629931;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=UfcNq/a1NEH4SfKrTdJu9nbVTgzSSjJimGvwevhSxBI=;
        b=OcwLkfO41/t1a91Du3+5crXisxQv8GblD9qYaEARrbumSt2PxmHjU3/jRN9TGQuvjR
         lB3jnPDjRWL3I1h01xQJTDuu42tENigOk6p6sZJrp6dQezNyiqzkdnmEZ23JKsDQVCjD
         cmU6T9YQwBeHWz/6J3jLIb/1n4uSUVh9VC1+EqzTgwj2eIikKQ5U5nvveDZn9c1yGmrf
         pGCQ3pLlkMOFuSx5wvvvErYXUU4xwLNdpt0yQhigkA8j2wSfDqaAVZETYOu40Cjmy/iE
         1XB+N2fjlx4JFHpSeEyPohIF3Io2llZ+1KtURurwvaOgbPnriXKtzDfYCfa1BFR9a44x
         a9iw==
X-Gm-Message-State: AOJu0Yx3CuZ7ej4PhUb2LdwwCf6/ftWIpPJXkNhdaM+YjASGTTfjZATK
	dE3RQgE71HAdeHgdp2BbtSMUvN+gSjGI8wFoXfwFiA==
X-Google-Smtp-Source: AGHT+IHmK8eQW18aLbu7hf7ihTdW/QeJRw64P+yi8XTq+XbSyKYprS9HO1YNHz2vArFxOaSaL5PnVczlqBnstuhmCCo=
X-Received: by 2002:adf:e648:0:b0:319:76ce:32ba with SMTP id
 b8-20020adfe648000000b0031976ce32bamr2436488wrn.21.1692025131271; Mon, 14 Aug
 2023 07:58:51 -0700 (PDT)
MIME-Version: 1.0
References: <20230728070552.50C1413276@imap2.suse-dmz.suse.de>
 <320f94c3-8805-ec64-dd4b-0454c8ecce14@gmail.com> <nycvar.YFH.7.77.849.2308020846180.12935@jbgna.fhfr.qr>
 <CAAgBjMmFJFSSVQHJjNnpyRXh4S330_g6T4ZMSfQQP-oL8UJSfw@mail.gmail.com>
 <nycvar.YFH.7.77.849.2308031214200.12935@jbgna.fhfr.qr> <nycvar.YFH.7.77.849.2308031215440.12935@jbgna.fhfr.qr>
 <nycvar.YFH.7.77.849.2308031217390.12935@jbgna.fhfr.qr> <CAAgBjMkZC4XhNwdEFV8PBExQBPUQXuGNqX22iGaHhkNuc7LfCQ@mail.gmail.com>
 <CAFiYyc0OaYqEULUiytCSAmHE-7eJPevCf4xHj3KugmY0PqAAHg@mail.gmail.com>
In-Reply-To: <CAFiYyc0OaYqEULUiytCSAmHE-7eJPevCf4xHj3KugmY0PqAAHg@mail.gmail.com>
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Date: Mon, 14 Aug 2023 20:28:15 +0530
Message-ID: <CAAgBjMkyPNAhaELEbgi6MBCRx_Ccajtwct6nJpTt4TdwzD8NsQ@mail.gmail.com>
Subject: Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics
 for vectorization
To: Richard Biener <richard.guenther@gmail.com>
Cc: Richard Biener <rguenther@suse.de>, Jeff Law <jeffreyalaw@gmail.com>, gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wr=
ote:
>
> On Mon, Aug 7, 2023 at 2:05=E2=80=AFAM Prathamesh Kulkarni via Gcc-patche=
s
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > >
> > > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > > >
> > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > > > >
> > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > > > The following delays sinking of loads within the same inn=
ermost
> > > > > > > > > loop when it was unconditional before.  That's a not unco=
mmon
> > > > > > > > > issue preventing vectorization when masked loads are not =
available.
> > > > > > > > >
> > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > > > >
> > > > > > > > > I have a followup patch improving sinking that without th=
is would
> > > > > > > > > cause more of the problematic sinking - now that we have =
a second
> > > > > > > > > sink pass after loop opts this looks like a reasonable ap=
proach?
> > > > > > > > >
> > > > > > > > > OK?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Richard.
> > > > > > > > >
> > > > > > > > >  PR tree-optimization/92335
> > > > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > > > >  in innermost loops to conditional executed places.
> > > > > > > > >
> > > > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > > > >  expect predictive commoning to happen instead of sinking=
.
> > > > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > > > I think it's reasonable -- there's probably going to be cas=
es where it's not
> > > > > > > > great, but more often than not I think it's going to be a r=
easonable
> > > > > > > > heuristic.
> > > > > > > >
> > > > > > > > If there is undesirable fallout, better to find it over the=
 coming months than
> > > > > > > > next spring.  So I'd suggest we go forward now to give more=
 time to find any
> > > > > > > > pathological cases (if they exist).
> > > > > > >
> > > > > > > Agreed, I've pushed this now.
> > > > > > Hi Richard,
> > > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651=
c4d6f6a0),
> > > > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VEC=
TORIZED"
> > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > > >
> > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { t=
arget {
> > > > > > ! vect_fold_extract_last } } } } */
> > > > > >
> > > > > > With your commit, condition_reduction in pr65947-7.c gets vecto=
rized
> > > > > > regardless of vect_fold_extract_last,
> > > > > > which gates the above test (which is an improvement, because th=
e
> > > > > > function didn't get vectorized before the commit).
> > > > > >
> > > > > > The attached patch thus removes the gating on vect_fold_extract=
_last,
> > > > > > and the test passes again.
> > > > > > OK to commit ?
> > > > >
> > > > > OK.
> > > >
> > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > > > critical target condition is missing.  Can you figure out which?
> > >
> > > I see
> > >
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > note:   vect_is_simple_use: operand last_19 =3D PHI <last_8(7), 108(1=
5)>,
> > > type of def: reduction
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > note:   vect_is_simple_use: vectype vector(4) int
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > missed:   multiple types in double reduction or condition reduction o=
r
> > > fold-left reduction.
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> > > missed:   not vectorized: relevant phi not supported: last_19 =3D PHI
> > > <last_8(7), 108(15)>
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > missed:  bad operation or unsupported loop bound.
> > Hi Richard,
> > Looking at the aarch64 vect dump, it seems the loop in
> > condition_reduction gets vectorized with V4HI mode
> > while fails for other modes in vectorizable_condition:
> >
> >   if ((double_reduc || reduction_type !=3D TREE_CODE_REDUCTION)
> >       && ncopies > 1)
> >     {
> >       if (dump_enabled_p ())
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >                          "multiple types in double reduction or conditi=
on "
> >                          "reduction or fold-left reduction.\n");
> >       return false;
> >     }
> >
> > From the dump:
> > foo.c:9:21: note:   =3D=3D=3D vect_analyze_loop_operations =3D=3D=3D
> > foo.c:9:21: note:   examining phi: last_19 =3D PHI <last_8(7), 108(15)>
> > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > def: internal
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> > foo.c:9:21: note:   vect_is_simple_use: operand last_19 =3D PHI
> > <last_8(7), 108(15)>, type of def: reduction
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> >
> > For V8HI, VF =3D 8, and vectype_in =3D vector(4) int.
> > Thus ncopies =3D VF / length(vectype_in) =3D 2, which is greater than 1=
,
> > and thus fails:
> > foo.c:9:21: missed:   multiple types in double reduction or condition
> > reduction or fold-left reduction.
> > foo.c:4:1: missed:   not vectorized: relevant phi not supported:
> > last_19 =3D PHI <last_8(7), 108(15)>
> > While for V4HI, VF =3D 4 and thus ncopies =3D 1, so it succeeds.
> >
> > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
> > If I "force" the vectorizer to use V4HI mode, we get the following dump=
:
> > foo.c:9:21: note:   =3D=3D=3D vect_analyze_loop_operations =3D=3D=3D
> > foo.c:9:21: note:   examining phi: last_19 =3D PHI <last_8(7), 108(15)>
> > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > def: internal
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > foo.c:9:21: note:   vect_is_simple_use: operand last_19 =3D PHI
> > <last_8(7), 108(15)>, type of def: reduction
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > foo.c:9:21: missed:   multiple types in double reduction or condition
> > reduction or fold-left reduction.
> >
> > Not sure tho if this is the only reason for the test to fail to
> > vectorize on the target.
> > Will investigate in more details next week.
>
> The odd thing is that you say
>
>   for (int i =3D 0; i < N; i++)
>     {
>       aval =3D a[i];
>       if (b[i] < min_v)
>         last =3D aval;
>     }
>
> fails to vectorize but
>
>   for (int i =3D 0; i < N; i++)
>     {
>       if (b[i] < min_v)
>         last =3D a[i];
>     }
>
> succeeds?  The IL difference should be irrelevant for the reduction
Hi Richard,
Sorry for late response.
No this case containing a conditional load doesn't vectorize on aarch64 eit=
her:
foo2.c:9:21: note:  =3D=3D=3D analyze_loop_nest =3D=3D=3D
foo2.c:9:21: note:   =3D=3D=3D vect_analyze_loop_form =3D=3D=3D
foo2.c:9:21: missed:   not vectorized: control flow in loop.
foo2.c:9:21: missed:  bad loop form.

For this test:
for (int i =3D 0; i < N; i++)
    {
      aval =3D a[i];
      if (b[i] < min_v)
        last =3D aval;
    }

IIUC sink pass made the load conditional preventing vectorization
(similar to above),
but your PR92335 fix delays the sinking of load before loop opts, and
thus gets vectorized.
Till vect pass, the dumps are similar for x86 and aarch64.
> vectorization:
>
>   <bb 3> [local count: 1049367889]:
>   # last_19 =3D PHI <last_8(7), 108(15)>
>   # i_21 =3D PHI <i_17(7), 0(15)>
>   # ivtmp_18 =3D PHI <ivtmp_10(7), 43(15)>
>   _1 =3D (long unsigned int) i_21;
>   _2 =3D _1 * 2;
>   _3 =3D a_12(D) + _2;
>   aval_13 =3D *_3;
>   _5 =3D _1 * 4;
>   _6 =3D b_14(D) + _5;
>   _7 =3D *_6;
>   last_16 =3D (int) aval_13;
>   _9 =3D _7 < min_v_15(D);
>   last_8 =3D _9 ? last_16 : last_19;
>   i_17 =3D i_21 + 1;
>   ivtmp_10 =3D ivtmp_18 - 1;
>   if (ivtmp_10 !=3D 0)
>     goto <bb 7>; [97.68%]
>
> vs
>
>   <bb 3> [local count: 1049367889]:
>   # last_19 =3D PHI <last_9(7), 108(15)>
>   # i_21 =3D PHI <i_17(7), 0(15)>
>   # ivtmp_11 =3D PHI <ivtmp_10(7), 43(15)>
>   _1 =3D (long unsigned int) i_21;
>   _2 =3D _1 * 4;
>   _3 =3D b_13(D) + _2;
>   _4 =3D *_3;
>   _5 =3D _4 < min_v_14(D);
>   _6 =3D _1 * 2;
>   _38 =3D _37 + _6;
>   _7 =3D (short int *) _38;
>   _8 =3D .MASK_LOAD (_7, 16B, _5);
>   last_16 =3D (int) _8;
>   last_9 =3D _5 ? last_16 : last_19;
>   i_17 =3D i_21 + 1;
>   ivtmp_10 =3D ivtmp_11 - 1;
>   if (ivtmp_10 !=3D 0)
>     goto <bb 7>; [97.68%]
>
> maybe since the "mask" is used twice with the .MASK_LOAD
> we are not actually looking at the def (the comparison) and it's
> the comparison which would introduce the "multiple types"?
>
> That is, I wonder why not sinking the load, avoiding a conditional
> load, makes a difference to vectorizing the condition/extract last reduct=
ion.
IIUC, the issue is that the vector type used for reduction seems to be
different than
the vector type used for determining VF, and it passes the above check
in vectoriable_reduction,
only if VF matches the length of the vector type used for reduction.

For V4HI mode on aarch64, it sets VF =3D 4 which matches with the length of=
 vector
type used for reduction (vector (4) int):
foo.c:9:21: note:   examining phi: last_19 =3D PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 =3D PHI
<last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
>
> It doesn't seem to make a difference for x86.  That said, the "fix" is
> probably sticking the correct target on the dump-check, it seems
> that vect_fold_extract_last is no longer correct here.
Um sorry, I did go thru various checks in target-supports.exp, but not
sure which one will be appropriate for this case,
and am stuck here :/ Could you please suggest how to proceed ?

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Richard.