From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com
 [IPv6:2a00:1450:4864:20::62d])
 by sourceware.org (Postfix) with ESMTPS id 90AAF3855024
 for <gcc@gcc.gnu.org>; Fri,  2 Jul 2021 11:29:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 90AAF3855024
Received: by mail-ej1-x62d.google.com with SMTP id bu12so15704831ejb.0
 for <gcc@gcc.gnu.org>; Fri, 02 Jul 2021 04:29:06 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=aMnqsY+EGWJDRNI+iEXdGEzsrS+VU5v2oc8wqk068qk=;
 b=Xp0wgABZgqv5Wzlu4i5QuWWhe5uStnD68onKPj43/tyXFWpA3jCaugE+vrVWD6J1qI
 1zbvf185YQxgMi/N5obkB7ChxfVzUrLVjFxLMlKRaBSvYKpR4t7MEspD4UOoHEjlOREH
 lcjaRq+Iw7eQcyAtRYKdkaEApt7DE36es9tis2+KTaUQZx7Ze5LFsuMLilF7UGbYFoHJ
 S5OJ/znxmDqBAsJtIQuIHWJ3EkFEJgneqcYbwnqow/VatcFESBKYBK2H7IioUzISg3KD
 4Xay4C00heMYxQI1SgVwIfkHjhtRM+zgM0iG8foLk7WkCd9ZqESqtaTmK0kb8iwjldnF
 S6tg==
X-Gm-Message-State: AOAM530lm4zS+0EWJodMBnz4cmmMQ+j82nX/9kDymax0yVV1shP0vJ8k
 rXWIaNj7mBUJJpAZ+5c+BuW9k+zOnCzzYsmorpA=
X-Google-Smtp-Source: ABdhPJxJTFOvfTHJd4OnJ9Hb19wpl42F6hRopAMMFgR7489KOVIppkYPdPTAdK7mC+XgRJR5hZnerdwJh6wPGAovhc0=
X-Received: by 2002:a17:907:9812:: with SMTP id
 ji18mr4962057ejc.138.1625225345573; 
 Fri, 02 Jul 2021 04:29:05 -0700 (PDT)
MIME-Version: 1.0
References: <1338ef7b-57f4-a376-5827-c85392ed53a8@linux.ibm.com>
 <CAFiYyc15i7ErH6K+Cptq4Z+23r3iqLW6pGstQvZLix6KnjWi5g@mail.gmail.com>
 <0fd24c58-bcd4-ce7d-d986-bee82d2b7ff5@linux.ibm.com>
In-Reply-To: <0fd24c58-bcd4-ce7d-d986-bee82d2b7ff5@linux.ibm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Fri, 2 Jul 2021 13:28:54 +0200
Message-ID: <CAFiYyc0=ZejPUTPYQKp+5Xrn7oBYvYzf2CFdwtD0foqzs9X0tQ@mail.gmail.com>
Subject: Re: Question on tree LIM
To: "Kewen.Lin" <linkw@linux.ibm.com>
Cc: GCC Development <gcc@gcc.gnu.org>, 
 "Andre Vieira (lists)" <Andre.SimoesDiasVieira@arm.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-3.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Jul 2021 11:29:08 -0000

On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richard,
>
> on 2021/7/2 =E4=B8=8B=E5=8D=884:07, Richard Biener wrote:
> > On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrot=
e:
> >>
> >> Hi,
> >>
> >> I am investigating one degradation related to SPEC2017 exchange2_r,
> >> with loop vectorization on at -O2, it degraded by 6%.  By some
> >> isolation, I found it isn't directly caused by vectorization itself,
> >> but exposed by vectorization, some stuffs for vectorization
> >> condition checks are hoisted out and they increase the register
> >> pressure, finally results in more spillings than before.  If I simply
> >> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
> >> the original), if further disable rtl lim, it just becomes to 30% of
> >> the original.  It seems to indicate there is some room to improve in
> >> both LIMs.
> >>
> >> By quick scanning in tree LIM, I noticed that there seems no any
> >> considerations on register pressure, it looked intentional? I am
> >> wondering what's the design philosophy behind it?  Is it because that
> >> it's hard to model register pressure well here?  If so, it seems to
> >> put the burden onto late RA, which needs to have a good
> >> rematerialization support.
> >
> > Yes, it is "intentional" in that doing any kind of prioritization based
> > on register pressure is hard on the GIMPLE level since most
> > high-level transforms try to expose followup transforms which you'd
> > somehow have to anticipate.  Note that LIMs "cost model" (if you can
> > call it such...) is too simplistic to be a good base to decide which
> > 10 of the 20 candidates you want to move (and I've repeatedly pondered
> > to remove it completely).
> >
>
> Thanks for the explanation!  Do you really want to remove it completely
> rather than just improve it with a better one?  :-\

;)  For example the LIM cost model makes it not hoist an invariant (int)x
but then PRE which detects invariant motion opportunities as partial
redundances happily does (because PRE has no cost model at all - heh).

> Here there are some PRs (PR96825, PR98782) related to exchange2_r which
> seems to suffer from high register pressure and bad spillings.  Not sure
> whether they are also somehow related to the pressure given from LIM, but
> the trigger is commit
> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
> frequency, maybe it's worth to re-visiting this idea about considering
> BB frequency in LIM cost model:
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Note most "problems", and those which are harder to undo, stem from
LIMs store-motion which increases register pressure inside loops by
adding loop-carried dependences.  The BB frequency might be a way
to order candidates when we have a way to set a better cap on the
number of refs to move.  Note the current "cost" model is rather a
benefit model and causes us to not move cheap things (like the above
conversion) because it seems not worth the trouble.

Note a very simple way would be to have a --param specifying a
maximum number of refs to move (but note there are several
LIM/store-motion passes so any such static limit would have
surprising effects).  For store-motion I considered a hard limit on
the number of loop carried dependences (PHIs) and counting both
existing and added ones (to avoid the surprise).

Note how such limits or other cost models should consider inner and
outer loop behavior remains to be determined - at least LIM works
at the level of whole loop nests and there's a rough idea of dependent
transforms but simply gathering candidates and stripping some isn't
going to work without major surgery in that area I think.

> > As to putting the burden on RA - yes, that's one possibility.  The othe=
r
> > possibility is to use the register-pressure aware scheduler, though not
> > sure if that will ever move things into loop bodies.
> >
>
> Brandly new idea!  IIUC it requires a global scheduler, not sure how well
> GCC global scheduler performs, generally speaking the register-pressure
> aware scheduler will prefer the insn which has more deads (for that
> intensive regclass), for this problem the modeling seems a bit different,
> it has to care about total interference numbers between two "equivalent"
> blocks (src/dest), not sure if it's easier to do than rematerialization.

No idea either but as said above undoing store-motion is harder than
scheduling or RA remat.

> >> btw, the example loop is at line 1150 from src exchange2.fppized.f90
> >>
> >>    1150 block(rnext:9, 7, i7) =3D block(rnext:9, 7, i7) + 10
> >>
> >> The extra hoisted statements after the vectorization on this loop
> >> (cheap cost model btw) are:
> >>
> >>     _686 =3D (integer(kind=3D8)) rnext_679;
> >>     _1111 =3D (sizetype) _19;
> >>     _1112 =3D _1111 * 12;
> >>     _1927 =3D _1112 + 12;
> >>   * _1895 =3D _1927 - _2650;
> >>     _1113 =3D (unsigned long) rnext_679;
> >>   * niters.6220_1128 =3D 10 - _1113;
> >>   * _1021 =3D 9 - _1113;
> >>   * bnd.6221_940 =3D niters.6220_1128 >> 2;
> >>   * niters_vector_mult_vf.6222_939 =3D niters.6220_1128 & 184467440737=
09551612;
> >>     _144 =3D niters_vector_mult_vf.6222_939 + _1113;
> >>     tmp.6223_934 =3D (integer(kind=3D8)) _144;
> >>     S.823_1004 =3D _1021 <=3D 2 ? _686 : tmp.6223_934;
> >>   * ivtmp.6410_289 =3D (unsigned long) S.823_1004;
> >>
> >> PS: * indicates the one has a long live interval.
> >
> > Note for the vectorizer generated conditions there's quite some room fo=
r
> > improvements to reduce the amount of semi-redundant computations.  I've
> > pointed out some to Andre, in particular suggesting to maintain a singl=
e
> > "remaining scalar iterations" IV across all the checks to avoid keeping
> > 'niters' live and doing all the above masking & shifting repeatedly bef=
ore
> > the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
> > he got with that idea.
> >
>
> Great, it definitely helps to mitigate this problem.  Thanks for the info=
rmation.
>
>
> BR,
> Kewen