From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=01ZB=DY=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x234.google.com (mail-lj1-x234.google.com [IPv6:2a00:1450:4864:20::234])
	by sourceware.org (Postfix) with ESMTPS id 2C4D83858D28
	for <gcc-patches@gcc.gnu.org>; Mon,  7 Aug 2023 07:49:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2C4D83858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x234.google.com with SMTP id 38308e7fff4ca-2b9cf2b1309so51330361fa.0
        for <gcc-patches@gcc.gnu.org>; Mon, 07 Aug 2023 00:49:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1691394572; x=1691999372;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=jwu1RAdJfA8HKRQldN73hsKAlT5lUnJh9XJ79iGO3/0=;
        b=P3wYZfowk4a4dF3l+szvRyXMzmpdO9+RL3aFSomCWxQ8GfM+y6qm8AUU5TI2weuV8A
         0euneX3TuZKeRJGnY7oBtb5lfcISxkFX5nBiCDarCzVNWub44/qg04qJ19ErhIo1vpPe
         miodxSDptC/PvWNqP6wnRTO9gYNVyUCsMxilJzVYhXMhMZMre5tnY/LBtpvQyU3J4lU3
         c82K8qpT3VkE8REE+qRYb+VpgVPNhvb2EF74l0xA1ZpJRfNAoib4mfWTsqr0vdabpqtD
         5vRA7KlBpxFW8qpAnRYp5lSB+ocT4f3RWLMbgEcLUMDETwO4k+1xvWzwmtzWkqKUsJYC
         aBPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691394572; x=1691999372;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=jwu1RAdJfA8HKRQldN73hsKAlT5lUnJh9XJ79iGO3/0=;
        b=QBzpOOdStFyz6IkbI5Uo7BPnBAMyarVcMFRjt4/7zmNlWIJCVd2h/jCPrm7zJuYnr4
         SJ/N/xS/auyiO64DWiOUsN63pYxEHQJ1dbudAEG7A2aFpIVTw9Uuflrkuw6PItTMRr63
         AEJwpIuL0g21m36mYvceObVsGBrmbbbe7N/Kc6ddHO4kSzI4iyKYikF8gvgFMC1RvzWm
         /IxWoGFXZodiSPLpTNu26Ag+B5tH6UJDhoRiuZ2DX/d4rXZTZM41S49vNCtdLO416yDM
         osfhOZAbbj6j+VenTH7WqlyP/NWlUCp7mG5NQOGtWxjpwFTaAX+EHEueQ90lr84HLJQY
         90Vg==
X-Gm-Message-State: AOJu0YyLjZwSvjpusMmpo6TkDDCZnuhJIvQbxsGfuU0x+w0eOxv+K826
	Mjn1gJmZPOFUIN+0v9Iqmpq/CKEa6W/ShAU4E+M=
X-Google-Smtp-Source: AGHT+IG8pFbzyBZEs46zlbvBPuwWv67n2ngzR57eTtiK0I2AGgmsFZEufs4JPPjh9VMmHnoTKlDuvHcAWzGUXs0a9DI=
X-Received: by 2002:a05:651c:1255:b0:2b6:9e1d:cd0f with SMTP id
 h21-20020a05651c125500b002b69e1dcd0fmr2141458ljh.12.1691394572142; Mon, 07
 Aug 2023 00:49:32 -0700 (PDT)
MIME-Version: 1.0
References: <20230728070552.50C1413276@imap2.suse-dmz.suse.de>
 <320f94c3-8805-ec64-dd4b-0454c8ecce14@gmail.com> <nycvar.YFH.7.77.849.2308020846180.12935@jbgna.fhfr.qr>
 <CAAgBjMmFJFSSVQHJjNnpyRXh4S330_g6T4ZMSfQQP-oL8UJSfw@mail.gmail.com>
 <nycvar.YFH.7.77.849.2308031214200.12935@jbgna.fhfr.qr> <nycvar.YFH.7.77.849.2308031215440.12935@jbgna.fhfr.qr>
 <nycvar.YFH.7.77.849.2308031217390.12935@jbgna.fhfr.qr> <CAAgBjMkZC4XhNwdEFV8PBExQBPUQXuGNqX22iGaHhkNuc7LfCQ@mail.gmail.com>
In-Reply-To: <CAAgBjMkZC4XhNwdEFV8PBExQBPUQXuGNqX22iGaHhkNuc7LfCQ@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 7 Aug 2023 09:48:28 +0200
Message-ID: <CAFiYyc0OaYqEULUiytCSAmHE-7eJPevCf4xHj3KugmY0PqAAHg@mail.gmail.com>
Subject: Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics
 for vectorization
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Cc: Richard Biener <rguenther@suse.de>, Jeff Law <jeffreyalaw@gmail.com>, gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Mon, Aug 7, 2023 at 2:05=E2=80=AFAM Prathamesh Kulkarni via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote:
> >
> > On Thu, 3 Aug 2023, Richard Biener wrote:
> >
> > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > >
> > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > > >
> > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > > The following delays sinking of loads within the same inner=
most
> > > > > > > > loop when it was unconditional before.  That's a not uncomm=
on
> > > > > > > > issue preventing vectorization when masked loads are not av=
ailable.
> > > > > > > >
> > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > > >
> > > > > > > > I have a followup patch improving sinking that without this=
 would
> > > > > > > > cause more of the problematic sinking - now that we have a =
second
> > > > > > > > sink pass after loop opts this looks like a reasonable appr=
oach?
> > > > > > > >
> > > > > > > > OK?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > >  PR tree-optimization/92335
> > > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > > >  in innermost loops to conditional executed places.
> > > > > > > >
> > > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > > I think it's reasonable -- there's probably going to be cases=
 where it's not
> > > > > > > great, but more often than not I think it's going to be a rea=
sonable
> > > > > > > heuristic.
> > > > > > >
> > > > > > > If there is undesirable fallout, better to find it over the c=
oming months than
> > > > > > > next spring.  So I'd suggest we go forward now to give more t=
ime to find any
> > > > > > > pathological cases (if they exist).
> > > > > >
> > > > > > Agreed, I've pushed this now.
> > > > > Hi Richard,
> > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4=
d6f6a0),
> > > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTO=
RIZED"
> > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > >
> > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { tar=
get {
> > > > > ! vect_fold_extract_last } } } } */
> > > > >
> > > > > With your commit, condition_reduction in pr65947-7.c gets vectori=
zed
> > > > > regardless of vect_fold_extract_last,
> > > > > which gates the above test (which is an improvement, because the
> > > > > function didn't get vectorized before the commit).
> > > > >
> > > > > The attached patch thus removes the gating on vect_fold_extract_l=
ast,
> > > > > and the test passes again.
> > > > > OK to commit ?
> > > >
> > > > OK.
> > >
> > > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > > critical target condition is missing.  Can you figure out which?
> >
> > I see
> >
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > note:   vect_is_simple_use: operand last_19 =3D PHI <last_8(7), 108(15)=
>,
> > type of def: reduction
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > note:   vect_is_simple_use: vectype vector(4) int
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > missed:   multiple types in double reduction or condition reduction or
> > fold-left reduction.
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> > missed:   not vectorized: relevant phi not supported: last_19 =3D PHI
> > <last_8(7), 108(15)>
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > missed:  bad operation or unsupported loop bound.
> Hi Richard,
> Looking at the aarch64 vect dump, it seems the loop in
> condition_reduction gets vectorized with V4HI mode
> while fails for other modes in vectorizable_condition:
>
>   if ((double_reduc || reduction_type !=3D TREE_CODE_REDUCTION)
>       && ncopies > 1)
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "multiple types in double reduction or condition=
 "
>                          "reduction or fold-left reduction.\n");
>       return false;
>     }
>
> From the dump:
> foo.c:9:21: note:   =3D=3D=3D vect_analyze_loop_operations =3D=3D=3D
> foo.c:9:21: note:   examining phi: last_19 =3D PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 =3D PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
>
> For V8HI, VF =3D 8, and vectype_in =3D vector(4) int.
> Thus ncopies =3D VF / length(vectype_in) =3D 2, which is greater than 1,
> and thus fails:
> foo.c:9:21: missed:   multiple types in double reduction or condition
> reduction or fold-left reduction.
> foo.c:4:1: missed:   not vectorized: relevant phi not supported:
> last_19 =3D PHI <last_8(7), 108(15)>
> While for V4HI, VF =3D 4 and thus ncopies =3D 1, so it succeeds.
>
> For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
> If I "force" the vectorizer to use V4HI mode, we get the following dump:
> foo.c:9:21: note:   =3D=3D=3D vect_analyze_loop_operations =3D=3D=3D
> foo.c:9:21: note:   examining phi: last_19 =3D PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 =3D PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> foo.c:9:21: missed:   multiple types in double reduction or condition
> reduction or fold-left reduction.
>
> Not sure tho if this is the only reason for the test to fail to
> vectorize on the target.
> Will investigate in more details next week.

The odd thing is that you say

  for (int i =3D 0; i < N; i++)
    {
      aval =3D a[i];
      if (b[i] < min_v)
        last =3D aval;
    }

fails to vectorize but

  for (int i =3D 0; i < N; i++)
    {
      if (b[i] < min_v)
        last =3D a[i];
    }

succeeds?  The IL difference should be irrelevant for the reduction
vectorization:

  <bb 3> [local count: 1049367889]:
  # last_19 =3D PHI <last_8(7), 108(15)>
  # i_21 =3D PHI <i_17(7), 0(15)>
  # ivtmp_18 =3D PHI <ivtmp_10(7), 43(15)>
  _1 =3D (long unsigned int) i_21;
  _2 =3D _1 * 2;
  _3 =3D a_12(D) + _2;
  aval_13 =3D *_3;
  _5 =3D _1 * 4;
  _6 =3D b_14(D) + _5;
  _7 =3D *_6;
  last_16 =3D (int) aval_13;
  _9 =3D _7 < min_v_15(D);
  last_8 =3D _9 ? last_16 : last_19;
  i_17 =3D i_21 + 1;
  ivtmp_10 =3D ivtmp_18 - 1;
  if (ivtmp_10 !=3D 0)
    goto <bb 7>; [97.68%]

vs

  <bb 3> [local count: 1049367889]:
  # last_19 =3D PHI <last_9(7), 108(15)>
  # i_21 =3D PHI <i_17(7), 0(15)>
  # ivtmp_11 =3D PHI <ivtmp_10(7), 43(15)>
  _1 =3D (long unsigned int) i_21;
  _2 =3D _1 * 4;
  _3 =3D b_13(D) + _2;
  _4 =3D *_3;
  _5 =3D _4 < min_v_14(D);
  _6 =3D _1 * 2;
  _38 =3D _37 + _6;
  _7 =3D (short int *) _38;
  _8 =3D .MASK_LOAD (_7, 16B, _5);
  last_16 =3D (int) _8;
  last_9 =3D _5 ? last_16 : last_19;
  i_17 =3D i_21 + 1;
  ivtmp_10 =3D ivtmp_11 - 1;
  if (ivtmp_10 !=3D 0)
    goto <bb 7>; [97.68%]

maybe since the "mask" is used twice with the .MASK_LOAD
we are not actually looking at the def (the comparison) and it's
the comparison which would introduce the "multiple types"?

That is, I wonder why not sinking the load, avoiding a conditional
load, makes a difference to vectorizing the condition/extract last reductio=
n.

It doesn't seem to make a difference for x86.  That said, the "fix" is
probably sticking the correct target on the dump-check, it seems
that vect_fold_extract_last is no longer correct here.

Richard.

> Thanks,
> Prathamesh
> >
> > Richard.