From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id 717E83858D1E
 for <gcc-patches@gcc.gnu.org>; Wed, 17 Aug 2022 07:53:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 717E83858D1E
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out2.suse.de (Postfix) with ESMTP id 58AE61FAD7;
 Wed, 17 Aug 2022 07:53:58 +0000 (UTC)
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id 512482C174;
 Wed, 17 Aug 2022 07:53:58 +0000 (UTC)
Date: Wed, 17 Aug 2022 07:53:58 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Aldy Hernandez <aldyh@redhat.com>
cc: gcc-patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] Refactor back_threader_profitability
In-Reply-To: <CAGm3qMWV3RoEKA3vRChHunqLp+_GjW28Rd052ELSe8ZzyHeRFQ@mail.gmail.com>
Message-ID: <nycvar.YFH.7.77.849.2208170746280.13569@jbgna.fhfr.qr>
References: <81970.122081610045900133@us-mta-26.us.mimecast.lan>
 <CAGm3qMWV3RoEKA3vRChHunqLp+_GjW28Rd052ELSe8ZzyHeRFQ@mail.gmail.com>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Aug 2022 07:54:00 -0000

On Wed, 17 Aug 2022, Aldy Hernandez wrote:

> I just have a few high level comments.
> 
> On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > The following refactors profitable_path_p in the backward threader,
> > splitting out parts that can be computed once the exit block is known,
> > parts that contiguously update and that can be checked allowing
> > for the path to be later identified as FSM with larger limits,
> > possibly_profitable_path_p, and final checks done when the whole
> > path is known, profitable_path_p.
> 
> I thought we were removing references to FSM, as they were leftovers
> from some previous incarnation.  For that matter, I don't think I ever
> understood what they are, so if we're gonna keep them, could you
> comment what makes FSM threads different from other threads?

I don't know exactly what 'FSM' stands for but the FSM threader
specifically tried to cover

for (;;)
 {
   switch (state)
    {
    case 1:
      /* sth */
      state = 3;
      break;
    ...
    case 3:
      ...
    }
}

so optimizing state machine transitions.  That is, these are all
multiway (switch or goto, but goto support has been removed with the
ranger rewrite it seems) and the thread path going through the
loop backedge.  This is why FSM has different size limits because
it was thought those threadings are especially worthwhile and
they tend to be larger.  To make discovery cheaper a TODO item
would be to skip to the loop backedge when we reach the regular
thread limit and only continue the real search there (and
pick the "smallest" ways through any diamonds on the way).

> In your possibly_profitable_path_p function, could you document a bit
> better what's the difference between profitable_path_p and
> possibly_profitable_path_p?

Sure.

> >
> > I've removed the back_threader_profitability instance from the
> > back_threader class and instead instantiate it once per path
> > discovery.  I've kept the size compute non-incremental to simplify
> > the patch and not worry about unwinding.
> >
> > There's key changes to previous behavior - namely we apply
> > the param_max_jump_thread_duplication_stmts early only when
> > we know the path cannot become an FSM one (multiway + thread through
> > latch) but make sure to elide the path query when we we didn't
> > yet discover that but are over this limit.  Similarly the
> > speed limit is now used even when we did not yet discover a
> > hot BB on the path.  Basically the idea is to only stop path
> > discovery when we know the path will never become profitable
> > but avoid the expensive path range query when we know it's
> > currently not.
> >
> > I've done a few cleanups, merging functions, on the way.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > Statistics show an overall slight increase in threading but
> > looking at different files theres noise up and down.  That's
> > somewhat expected since we now are applying the "more correct"
> > limits in the end.  Unless I made big mistakes of course.
> >
> > The next thing cost-wise would be to raise the backwards
> > threading limit to the limit of DOM so we don't get
> > artificial high counts for that.
> 
> The DOM threader has limits?  I thought most of those limits were just
> due to the fact that it couldn't determine long enough paths?  Either
> way, I like that we're merging the necessary forward threader bits
> here, in preparation for its demise ;-).

Both use param_max_jump_thread_duplication_stmts, but the backwards
threader applies this limit to non-FSM threads with

  /* The generic copier used by the backthreader does not re-use an
     existing threading path to reduce code duplication.  So for that
     case, drastically reduce the number of statements we are allowed
     to copy.  */
  if (!(threaded_through_latch && threaded_multiway_branch)
      && (n_insns * param_fsm_scale_path_stmts
          >= param_max_jump_thread_duplication_stmts))

and param_fsm_scale_path_stmts happens to be two.  I have no idea
why we apply this scaling, the scaling is otherwise used in

  /* We avoid creating irreducible inner loops unless we thread through
     a multiway branch, in which case we have deemed it worth losing
     other loop optimizations later.

     We also consider it worth creating an irreducible inner loop if
     the number of copied statement is low relative to the length of
     the path -- in that case there's little the traditional loop
     optimizer would have done anyway, so an irreducible loop is not
     so bad.  */
  if (!threaded_multiway_branch
      && creates_irreducible_loop
      && *creates_irreducible_loop
      && (n_insns * (unsigned) param_fsm_scale_path_stmts
          > (m_path.length () *
             (unsigned) param_fsm_scale_path_blocks)))

so my idea is to drop the scaling from applying the
param_max_jump_thread_duplication_stmts limit which raises the
effective default limit from 15 / 2 to 15, just like what DOM
applies (where DOM also estimates some optimization on the path,
reducing the stmt count).

> Looks good.

Thanks - I'll adjust the commentary and push.

Richard.