From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-433798-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 62291 invoked by alias); 11 Aug 2016 14:27:05 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 62241 invoked by uid 89); 11 Aug 2016 14:27:04 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=0.1 required=5.0 tests=AWL,BAYES_50,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=wash, decisions, FSM, fsm
X-HELO: nikam.ms.mff.cuni.cz
Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 11 Aug 2016 14:27:03 +0000
Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202)	id C0041544C01; Thu, 11 Aug 2016 16:27:00 +0200 (CEST)
Date: Thu, 11 Aug 2016 14:27:00 -0000
From: Jan Hubicka <hubicka@ucw.cz>
To: Richard Biener <rguenther@suse.de>
Cc: Jan Hubicka <hubicka@ucw.cz>, gcc-patches@gcc.gnu.org, law@redhat.com
Subject: Re: Early jump threading
Message-ID: <20160811142700.GA57561@kam.mff.cuni.cz>
References: <20160811140235.GA68714@kam.mff.cuni.cz> <alpine.LSU.2.11.1608111604411.26629@t29.fhfr.qr>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.LSU.2.11.1608111604411.26629@t29.fhfr.qr>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-SW-Source: 2016-08/txt/msg00923.txt.bz2

> On Thu, 11 Aug 2016, Jan Hubicka wrote:
> 
> > Hi,
> > this patch adds early jump threading pass.  Jump threading is one of most
> > common cases where estimated profile becomes corrupted, because the branches
> > are predicted independently beforehand. This patch simply adds early mode to
> > jump threader that does not permit code growth and thus only win/win threads
> > are done before profile is constructed.
> > 
> > Naturally this also improves IPA decisions because code size/time is estimated
> > more acurately.
> > 
> > It is not very cool to add another pass and the jump threader is already
> > run 5 times. I think incrementally we could drop one of late threaders at least.
> > I tried to measure compile time effects but they are in wash. Moreover the patch
> > pays for itself in cc1plus code size:
> > 
> > Before patch to tweak size estimates: 27779964  
> > Current mainline:                     27748900  
> > With patch applied:                   27716173  
> > 
> > So despite adding few functions the code size effect is positive which I think
> > is quite nice.
> > 
> > Given the fact that jump threading seems quite lightweit, I wonder why it is
> > guarded by flag_expensive_optimizations? Is it expensive in terms of code
> > size?
> > 
> > The effectivity of individual threading passes is as follows (for tramp3d)
> > 
> >       mainline                              with patch
> > pass  thread count     profile mismatches   thread count    profile mismatch
> > early                                       525
> > 1     1853             1900                 316             337
> > 2     4                812                  4               288
> > 3     24               1450                 32              947
> > 4     76               1457                 75              975
> > 
> > So at least tramp3d data seems to suggest that we can drop the second occurence
> > of jump threading and that most of the job thread1 does can be done by the
> > size restricted early version (the lower absolute counts are caused by the
> > fact that threadable paths gets duplicated by the inliner). 50% drop in
> > profile distortion is not bad. I wonder why basically every threaded paths seems
> > to introduce a mismatch.
> > 
> > The patch distorts testusite somewhat, in most cases we only want to update
> > dump files or disable early threading:
> > 
> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> > +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
> > +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
> > +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
> > +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
> > +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
> > +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
> > +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
> > +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
> > 
> > This testcase is the now probably unnecesary heuristics (it may still be
> > relevant in cases we do not thread because of size bounds but its effectivity
> > may not be worth the maintenance cost):
> > 
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > 
> > If the patch seems acceptable, I will do the updates. One option why I did
> > not do that is that it seems to be now posisble to pass parameters to passes
> > from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
> > consistent with way we handle other early passes.
> 
> I wonder why you choose to put the FSM threader early which only does
> backward threading(?!).  I'd expect forward threading to be more
> profitable (though we don't have a separate threader for that and
> would need VRP or DOM - but it seems we will get an early VRP anyway).

On tramp3d all VRP passes threads together 730 branches, all DOM passes 393, so
FSM threading (with 1957 branches) is the most effective one. Perhaps eventually
early VRP can also do bit of work.

I am not 100% sure from where "backward" is comming from. I guess is means that
analysis goes backward from conditionals to definitions: it looks for
conditional driven by a PHI statement that has a constant value on some paths
and duplicates for those. This seems cheap and rather effective way of getting
good part of the threading oppurtunities (most expensive part is probably
identifying and walking paths that will not be threaded at the end).

BTW I wonder if the same analysis can't be done for other instructions where constant
operands are very profitable, like division or multiplication.

Honza