From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xfDt=CZ=kam.mff.cuni.cz=hubicka@sourceware.org>
Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16])
	by sourceware.org (Postfix) with ESMTPS id 845DB3858C41
	for <gcc-patches@gcc.gnu.org>; Fri,  7 Jul 2023 14:10:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 845DB3858C41
Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz
Authentication-Results: sourceware.org; spf=none smtp.mailfrom=kam.mff.cuni.cz
Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202)
	id 702D1281C5A; Fri,  7 Jul 2023 16:10:39 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1;
	t=1688739039;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=VZlq+DJakrmzcCTd0DcjgDiILPBiK5Ku9sGjhfmjnik=;
	b=RJwB6v+dmW4JMCdy0F2oCLigO0QFjUek3Y6ALq2uw11L/YDcMfbOi+uH5vLckfk8wOSBa0
	0FJmTguaMm503xjB3k2pSGPar3e6etxhhTaTaJcLcPz8hz4xP808QvTozR4HsG1a7B1gH1
	z8MR7dBPGlTZ2JjqxbUEZd6RTghqADo=
Date: Fri, 7 Jul 2023 16:10:39 +0200
From: Jan Hubicka <hubicka@ucw.cz>
To: Tamar Christina <Tamar.Christina@arm.com>
Cc: Richard Biener <rguenther@suse.de>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, "jlaw@ventanamicro.com" <jlaw@ventanamicro.com>
Subject: Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on
 multiple-exits
Message-ID: <ZKgc32QMnU0YgpYd@kam.mff.cuni.cz>
References: <ZJw5BcegL6zob2rC@arm.com>
 <nycvar.YFH.7.77.849.2307041137250.4723@jbgna.fhfr.qr>
 <ZKQzSJcEUL5SH0KG@kam.mff.cuni.cz>
 <ZKbQ7DE2cxU1hA96@kam.mff.cuni.cz>
 <nycvar.YFH.7.77.849.2307070557080.4723@jbgna.fhfr.qr>
 <ZKgC/9WHxvuD3qdA@kam.mff.cuni.cz>
 <VI1PR08MB53258470C38896D93FD58B5CFF2DA@VI1PR08MB5325.eurprd08.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <VI1PR08MB53258470C38896D93FD58B5CFF2DA@VI1PR08MB5325.eurprd08.prod.outlook.com>
X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

> Hi Both,
> 
> Thanks for all the reviews/patches so far 😊
> 
> > >
> > > Looks good, but I wonder what we can do to at least make the multiple
> > > exit case behave reasonably?  The vectorizer keeps track
> > 
> > > of a "canonical" exit, would it be possible to pass in the main exit
> > > edge and use that instead of single_exit (), would other exits then
> > > behave somewhat reasonable or would we totally screw things up here?
> > > That is, the "canonical" exit would be the counting exit while the
> > > other exits are on data driven conditions and thus wouldn't change
> > > probability when we reduce the number of iterations(?)
> > 
> > I can add canonical_exit parameter and make the function to direct flow to it if
> > possible.  However overall I think fixup depends on what transformation led to
> > the change.
> > 
> > Assuming that vectorizer did no prologues and apilogues and we vectorized
> > with factor N, then I think the update could be done more specifically as
> > follows.
> > 
> 
> If it helps, how this patch series addresses multiple exits by forcing a scalar
> epilogue, all non canonical_exits would have been redirected to this scalar
> epilogue, so the remaining scalar iteration count will be at most VF.

It looks like profile update after vectorization needs quite some TLC.
My student Ondrej Kubanek also implemented loop histogram profiling
which gives better idea on how commonly prologues/epilogues are needed
and it would be also nice to handle it.
> > ;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe
> > hot
> > ;;    prev block 9, next block 13, flags: (NEW, VISITED)
> > ;;    pred:       8 [50.0% (adjusted)]  count:10737418 (estimated locally)
> > (FALSE_VALUE,EXECUTABLE)
> > ;;    succ:       13 [always]  count:10737416 (estimated locally) (FALLTHRU)
> > 
> > ;;   basic block 13, loop depth 1, count 1063004409 (estimated locally),
> > maybe hot
> > ;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       14 [always]  count:1052266996 (estimated locally)
> > (FALLTHRU,DFS_BACK,EXECUTABLE)
> > ;;                12 [always]  count:10737416 (estimated locally) (FALLTHRU)
> >   # i_30 = PHI <i_36(14), 98(12)>
> >   # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>
> >   _33 = a[i_30];
> >   _34 = _33 + 1;
> >   a[i_30] = _34;
> >   i_36 = i_30 + 1;
> >   ivtmp_37 = ivtmp_32 - 1;
> >   if (ivtmp_37 != 0)
> >     goto <bb 14>; [98.99%]
> >   else
> >     goto <bb 4>; [1.01%]

Actually it seems that the scalar epilogue loop is with oriignal profile
(predicted to iterate 99 times) which is quite wrong.
Looking at the statistics for yesterday patch, on tramp3d we got 86%
reduction in cummulative profile mismatches after whole optimization
pipeline.  More interestingly however the overall time esimtate
dropped by 18%, so it seems that the profile adjustment done by cunroll
are afecting the profile a lot.

I think the fact that iteration counts of epilogues is not capped is one
of main problems.

We seem to call scale_loop_profile 3 times:

       scale_loop_profile (loop, prob_vector, -1);

This seems to account for the probability that control flow is
redirected to prolog/epilog later.  So it only scales down the profile
but is not responsible 

       scale_loop_profile (prolog, prob_prolog, bound_prolog - 1);

This is does prolog and sets bound.

       scale_loop_profile (epilog, prob_epilog, -1);

This scales epilog but does not set bound at all. 
I think the information is availale since we update the loop_info
datastructures.

Honza