From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aldyh@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTP id A3681398502E
 for <gcc@gcc.gnu.org>; Wed,  9 Jun 2021 11:48:58 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A3681398502E
Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com
 [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-565-nkFYjTLON8CakqGyNr5x4Q-1; Wed, 09 Jun 2021 07:48:57 -0400
X-MC-Unique: nkFYjTLON8CakqGyNr5x4Q-1
Received: by mail-wr1-f71.google.com with SMTP id
 h10-20020a5d688a0000b0290119c2ce2499so5710590wru.19
 for <gcc@gcc.gnu.org>; Wed, 09 Jun 2021 04:48:57 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:subject:to:cc:message-id:date:user-agent
 :mime-version:content-language:content-transfer-encoding;
 bh=nI1uWTUzjnLV+ZiDvWJMwPHPRSZisnsi+BLTJBfNFso=;
 b=sw7sScmoBaXz5eqlrry9ydfLDcgg2eMplt4wdi7Jtu66OsjNQsHWNhqSFBxSj/5/vT
 Qgezmg2Jxiaow/PD7awkfhXIEVYflz4841N7ZkGqcOBmFx/W+dDBtmqBbeKyGtq7Z1rt
 6TlqW7ZpNvZHHvWlGvoOoJ+EajYODxzS0V8KLBY6KqixYuzgZ4Rn8MI65hV0uAOnYUjK
 JBZPlvKHAlr/f0rNNtPAPWQ6+qRrl1jtxeWENEju8Hl8DrskRBYYkAl3AwzHwYnv0NMl
 6adtJJaJN+aSzmyooIEccugVJFs5/H9dnA8faeJ8U2w/2Z4xkycbBkV9kRJGOR1phTnZ
 Yzcg==
X-Gm-Message-State: AOAM5318/f4KKatMm1+lhjosPVfxuibCVGiU6MaSo/yBK1f92Tc5orI5
 lXp6NAkfzS8Qd5QLXmadXT0WrlfB+H+kTBjGgxfKr6dJ7RmNhjs70HCeAKCqSlAKXCuihMNx0U7
 n3UhBJ8mI9MYmRpEZChpEwnYT38ixZEQxIypHNzPg1YMl+EhOr5T6eC8=
X-Received: by 2002:a05:600c:4f0b:: with SMTP id
 l11mr9205733wmq.126.1623239335827; 
 Wed, 09 Jun 2021 04:48:55 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJwmfV0U1UgXu/efyn9hq9Z9cpBfA+5gj0RbxY99yuRTniScnZKQmxABzNTeZyJCO3aB6rSghA==
X-Received: by 2002:a05:600c:4f0b:: with SMTP id
 l11mr9205703wmq.126.1623239335477; 
 Wed, 09 Jun 2021 04:48:55 -0700 (PDT)
Received: from abulafia.quesejoda.com ([95.169.237.215])
 by smtp.gmail.com with ESMTPSA id x20sm16252925wmc.39.2021.06.09.04.48.54
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Wed, 09 Jun 2021 04:48:55 -0700 (PDT)
From: Aldy Hernandez <aldyh@redhat.com>
Subject: replacing the backwards threader and more
To: Jeff Law <jeffreyalaw@gmail.com>
Cc: Andrew MacLeod <amacleod@redhat.com>, GCC Mailing List <gcc@gcc.gnu.org>
Message-ID: <07775b9d-b8eb-48cb-57ef-9cc278d38967@redhat.com>
Date: Wed, 9 Jun 2021 13:48:54 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF,
 RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Jun 2021 11:49:00 -0000

Hi Jeff.  Hi folks.

What started as a foray into severing the old (forward) threader's 
dependency on evrp, turned into a rewrite of the backwards threader 
code.  I'd like to discuss the possibility of replacing the current 
backwards threader with a new one that gets far more threads and can 
potentially subsume all threaders in the future.

I won't include code here, as it will just detract from the high level 
discussion.  But if it helps, I could post what I have, which just needs 
some cleanups and porting to the latest trunk changes Andrew has made.

Currently the backwards threader works by traversing DEF chains through 
PHIs leading to possible paths that start in a constant.  When such a 
path is found, it is checked to see if it is profitable, and if so, the 
constant path is threaded.  The current implementation is rather limited 
since backwards paths must end in a constant.  For example, the 
backwards threader can't get any of the tests in 
gcc.dg/tree-ssa/ssa-thread-14.c:

   if (a && b)
     foo ();
   if (!b && c)
     bar ();

etc.

After my refactoring patches to the threading code, it is now possible 
to drop in an alternate implementation that shares the profitability 
code (is this path profitable?), the jump registry, and the actual jump 
threading code.  I have leveraged this to write a ranger-based threader 
that gets every single thread the current code gets, plus 90-130% more.

Here are the details from the branch, which should be very similar to 
trunk.  I'm presenting the branch numbers because they contain Andrew's 
upcoming relational query which significantly juices up the results.

New threader:
          ethread:65043    (+3.06%)
          dom:32450      (-13.3%)
          backwards threader:72482   (+89.6%)
          vrp:40532      (-30.7%)
   Total threaded:  210507 (+6.70%)

This means that the new code gets 89.6% more jump threading 
opportunities than the code I want to replace.  In doing so, it reduces 
the amount of DOM threading opportunities by 13.3% and by 30.7% from the 
VRP jump threader.  The total  improvement across the jump threading 
opportunities in the compiler is 6.70%.

However, these are pessimistic numbers...

I have noticed that some of the threading opportunities that DOM and VRP 
now get are not because they're smarter, but because they're picking up 
opportunities that the new code exposes.  I experimented with running an 
iterative threader, and then seeing what VRP and DOM could actually get. 
  This is too expensive to do in real life, but it at least shows what 
the effect of the new code is on DOM/VRP's abilities:

   Iterative threader:
     ethread:65043    (+3.06%)
     dom:31170    (-16.7%)
         thread:86717    (+127%)
         vrp:33851    (-42.2%)
   Total threaded:  216781 (+9.90%)

This means that the new code not only gets 127% more cases, but it 
reduces the DOM and VRP opportunities considerably (16.7% and 42.2% 
respectively).   The end result is that we have the possibility of 
getting almost 10% more jump threading opportunities in the entire 
compilation run.

(Note that the new code gets even more opportunities, but I'm only 
reporting the profitable ones that made it all the way through to the 
threader backend, and actually eliminated a branch.)

The overall compilation hit from this work is currently 1.38% as 
measured by callgrind.  We should be able to reduce this a bit, plus we 
could get some of that back if we can replace the DOM and VRP threaders 
(future work).

My proposed implementation should be able to get any threading 
opportunity, and will get more as range-ops and ranger improve.

I can go into the details if necessary, but the gist of it is that we 
leverage the import facility in the ranger to only look up paths that 
have a direct repercussion in the conditional being threaded, thus 
reducing the search space.  This enhanced path discovery, plus an engine 
to resolve conditionals based on knowledge from a CFG path, is all that 
is needed to register new paths.  There is no limit to how far back we 
look, though in practice, we stop looking once a path is too expensive 
to continue the search in a given direction.

The solver API is simple:

// This class is a thread path solver.  Given a set of BBs indicating
// a path through the CFG, range_in_path() will return the range
// of an SSA as if the BBs in the path would have been executed in
// order.
//
// Note that the blocks are in reverse order, thus the exit block is 
path[0].

class thread_solver : gori_compute
{

public:
   thread_solver (gimple_ranger &ranger);
   virtual ~thread_solver ();
   void set_path (const vec<basic_block> *, const bitmap_head *imports);
   void range_in_path (irange &, tree name);
   void range_in_path (irange &, gimple *);
...
};

Basically, as we're discovering paths, we ask the solver what the value 
of the final conditional in a BB is in a given path.  If it resolves, we 
register the path.

A follow-up project would be to analyze what DOM/VRP are actually 
getting that we don't, because in theory with an enhanced ranger, we 
should be able to get everything they do (minus some float stuff, and 
some CSE things DOM does).  However, IMO, this is good enough to at 
least replace the current backwards threading code.

My suggestion would be to keep both implementations, defaulting to the 
ranger based, and running the old code immediately after-- trapping if 
it can find any threading opportunities.  After a few weeks, we could 
kill the old code.

Thoughts?

Aldy

p.s. BTW, ranger-based is technically a minomer.  It's gori based.  We 
don't need the entire ranger caching ability here.  I'm only using it to 
get the imports for the interesting conditionals, since those are static.