From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=b7mt=G2=vrull.eu=manos.anagnostakis@sourceware.org>
Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530])
	by sourceware.org (Postfix) with ESMTPS id 299AC3858C01
	for <gcc-patches@gcc.gnu.org>; Mon, 13 Nov 2023 17:22:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 299AC3858C01
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 299AC3858C01
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::530
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1699896154; cv=none;
	b=jWttp2RpriXM5YYETmVz1bjFkg54MgEwiRBRrOBKj1Sl+4fV+lKMX36lmyPytvFO8k+QFdo9mLd0v/vZPeKZEBjXmlMSTQWe63Oz9IgdMy/7cR1uOtGER0RiORL8km7s1XX7kf+s6G0PrRQlen+xbIc4z792jHVj9QG0Jk374RU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1699896154; c=relaxed/simple;
	bh=JRd84pRi9UHx5YBpnm3b9FtYjLir6k+E4zjCicvoODI=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=iABBltEQldr0Jr5eav+n3y1JuMf/vEdplNfa1ijeNEZC1nwCennqvquzi5qLUwTNud6dAsPXBbww2NsKC40HZ91x7eKpqatIsyTlLzlTceVcII3V4OCNhLx2gjZOxIdQmsLCGXJy/rFwKVlIyYxB7VsttSf6V/lgF+mMN8lfKo0=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-pg1-x530.google.com with SMTP id 41be03b00d2f7-5bd6ac9833fso2786403a12.0
        for <gcc-patches@gcc.gnu.org>; Mon, 13 Nov 2023 09:22:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=vrull.eu; s=google; t=1699896146; x=1700500946; darn=gcc.gnu.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=tb5tJIBJ9NPx5x/wjXS/gaJCWAilW7CHrP3X2rK+iE8=;
        b=hZx4PrAoDNGcuZ5YLv7yb2fg7VqSnCnGmpXnrsmGSCcDGyLzfLI/auhnkH2myMwCJb
         9pDi0OWfu0bxjO2CUHIoP08RwATjknZIyQB1dFoTjKRhKOMW4d3G57n84aLVJWirUAdg
         1yspjCLfuD8aa0RUEoIqPQyy8yfz0nQA6KFfNnYX5Yy/3EARqIfQI/hRk85foI5g5YuW
         AdMtsmM4+N+rc68DVIfV7vgwmz5CGV3F7Rz3xuhTs3tB3eWHW7MKHzTlRBv7gOiTALXX
         NBcG8TicsGpK0pWdX2xJLina4P49KwlqISljWxaQcckl547UUGAwf1deKcTqXD4eM0tg
         8qKA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1699896146; x=1700500946;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=tb5tJIBJ9NPx5x/wjXS/gaJCWAilW7CHrP3X2rK+iE8=;
        b=VhiozrExWpCFVvxgSx99f1gjTAai2GuGAL7k5s0q8tDqfyLFSFQHLQo2MyMOVSx19h
         uDRVaEqOg2vPcl6F9LFfjrj8NDe5iEa5+K8rAkAgM5qBDkfcWXBFUsHQAMCYfTiMDmSg
         A31Vse6hfJvFKPZymGZSckBq9Ljy27cXa4XIBre264OzuPMbMYZ6NJJ28RwACLpk6MGB
         TFm/net9Lon4n5hiLyxGjLSaJm6EH9AuQOpbBYC6emheVJoV7/vdMo/0KCEHmW9nDaOp
         Ua3d3tx/4F0vn5vz6XvokdCa3ut/Fv7f+b+w+wryZYW0crpDWyj+e/7AwzCiZ1xDUaqj
         /28g==
X-Gm-Message-State: AOJu0Yw2X1QY9IF+IUbcpqBcgKv9fVTw6c3zMDszVUqWJSZuzYX6g3Qe
	imeMcYVkxn0lNnCcM3jGGQO6ZzI34UtBz5ytYFh0rA==
X-Google-Smtp-Source: AGHT+IFrt605a94QpXN4wMKE00ftMvkio/tuAeMy2+UIbmOzExAkXl3sMwPcYPYG7kxMP4CwB6SmrVz6/vBWoLy/qHU=
X-Received: by 2002:a17:90b:4d06:b0:283:23d9:2bcf with SMTP id
 mw6-20020a17090b4d0600b0028323d92bcfmr4430361pjb.0.1699896144983; Mon, 13 Nov
 2023 09:22:24 -0800 (PST)
MIME-Version: 1.0
References: <20231107172931.25778-1-manos.anagnostakis@vrull.eu> <mptr0kwz0r4.fsf@arm.com>
In-Reply-To: <mptr0kwz0r4.fsf@arm.com>
From: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
Date: Mon, 13 Nov 2023 19:22:13 +0200
Message-ID: <CAL3A+7YAtQGXyFW4aC6dbNGgPS7Ms1W78e=N8gg09Q7MyLgFvg@mail.gmail.com>
Subject: Re: [PATCH] aarch64: New RTL optimization pass avoid-store-forwarding.
To: Manos Anagnostakis <manos.anagnostakis@vrull.eu>, 
	Richard Sandiford <richard.sandiford@arm.com>
Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>, 
	Christoph Muellner <christoph.muellner@vrull.eu>, Kyrylo Tkachov <kyrylo.tkachov@arm.com>, 
	Manolis Tsamis <manolis.tsamis@vrull.eu>, gcc-patches@gcc.gnu.org
Content-Type: multipart/alternative; boundary="000000000000e09646060a0be97c"
X-Spam-Status: No, score=-10.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,HTML_MESSAGE,JMQ_SPF_NEUTRAL,KAM_EU,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

--000000000000e09646060a0be97c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Richard,

thank you for reviewing the patch.

On Sat, Nov 11, 2023 at 6:57=E2=80=AFPM Richard Sandiford <richard.sandifor=
d@arm.com>
wrote:

> Thanks for the patch.
>
> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> > This is an RTL pass that detects store forwarding from stores to larger
> loads (load pairs).
> >
> > This optimization is SPEC2017-driven and was found to be beneficial for
> some benchmarks,
> > through testing on ampere1/ampere1a machines.
> >
> > For example, it can transform cases like
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldp  d31, d17, [sp, #312] # Large load from small store
> >
> > to
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldr  d31, [sp, #312]
> > ldr  d17, [sp, #320]
>
> For this particular case, it would be good to have a spill forwarding
> pass that inserts:
>
>   mov d17, d5
>
> before the fmul (possibly with further clean-up beyond that).
>
> But I realise the patch handles the general case where that isn't possibl=
e.
> The patch can also afford to be approximate.
>
> Some of the insn recognition code will probably need to be updated for
> Alex's pass that forms LDP/STP, since that's going to have writeback
> support.  We'll also need to arrange a pass ording that works for both
> passes.
>
> The patch generally looks good, and it's nicely compact.  Some specific
> comments below:
>
> > Currently, the pass is disabled by default on all architectures and
> enabled by a target-specific option.
> >
> > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> > or other architectures as well, without needing to be turned on by this
> option.
> >
> > Bootstrapped and regtested on aarch64-linux.
> >
> > gcc/ChangeLog:
> >
> >         * alias.cc (memrefs_conflict_p): Expose static function.
> >         * alias.h (memrefs_conflict_p): Expose static function.
> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> pass.
> >         * config/aarch64/aarch64-protos.h
> (make_pass_avoid_store_forwarding): Declare.
> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> option.
> >       (aarch64-store-forwarding-threshold): New param.
> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >         * doc/invoke.texi: Document new option and new param.
> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > ---
> >  gcc/alias.cc                                  |   2 +-
> >  gcc/alias.h                                   |   1 +
> >  gcc/config.gcc                                |   1 +
> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> >  .../aarch64/aarch64-store-forwarding.cc       | 347 ++++++++++++++++++
> >  gcc/config/aarch64/aarch64.opt                |   9 +
> >  gcc/config/aarch64/t-aarch64                  |  10 +
> >  gcc/doc/invoke.texi                           |  12 +-
> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> >  12 files changed, 481 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >
> > diff --git a/gcc/alias.cc b/gcc/alias.cc
> > index 86d8f7104ad..303683f85e3 100644
> > --- a/gcc/alias.cc
> > +++ b/gcc/alias.cc
> > @@ -2488,7 +2488,7 @@ offset_overlap_p (poly_int64 c, poly_int64 xsize,
> poly_int64 ysize)
> >     one for X + non-constant and Y + non-constant when X and Y are equa=
l.
> >     If that is fixed the TBAA hack for union type-punning can be
> removed.  */
> >
> > -static int
> > +int
> >  memrefs_conflict_p (poly_int64 xsize, rtx x, poly_int64 ysize, rtx y,
> >                   poly_int64 c)
> >  {
> > diff --git a/gcc/alias.h b/gcc/alias.h
> > index ab06ac9055f..49836f7d808 100644
> > --- a/gcc/alias.h
> > +++ b/gcc/alias.h
> > @@ -41,6 +41,7 @@ bool alias_ptr_types_compatible_p (tree, tree);
> >  int compare_base_decls (tree, tree);
> >  bool refs_same_for_tbaa_p (tree, tree);
> >  bool mems_same_for_tbaa_p (rtx, rtx);
> > +int memrefs_conflict_p (poly_int64, rtx, poly_int64, rtx, poly_int64);
> >
> >  /* This alias set can be used to force a memory to conflict with all
> >     other memories, creating a barrier across which no memory reference
> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > index ba6d63e33ac..ae50d36004f 100644
> > --- a/gcc/config.gcc
> > +++ b/gcc/config.gcc
> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> >       cxx_target_objs=3D"aarch64-c.o"
> >       d_target_objs=3D"aarch64-d.o"
> >       extra_objs=3D"aarch64-builtins.o aarch-common.o
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
> cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > +     extra_objs=3D"${extra_objs} aarch64-store-forwarding.o"
> >       target_gtfiles=3D"\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> >       target_has_targetm_common=3Dyes
> >       ;;
> > diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> > index 6ace797b738..fa79e8adca8 100644
> > --- a/gcc/config/aarch64/aarch64-passes.def
> > +++ b/gcc/config/aarch64/aarch64-passes.def
> > @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1,
> pass_track_speculation);
> >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance=
);
> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index 60a55f4bc19..5a7e68f22e5 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1049,6 +1049,7 @@ rtl_opt_pass *make_pass_track_speculation
> (gcc::context *);
> >  rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >
> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >
> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > new file mode 100644
> > index 00000000000..30148fb4435
> > --- /dev/null
> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > @@ -0,0 +1,347 @@
> > +/* Avoid store forwarding optimization pass.
> > +   Copyright (C) 2015-2023 Free Software Foundation, Inc.
>
> Just checking the copyright: does the file include code from other
> files?  The line above is OK if so, otherwise I guess it should
> start at 2023.
>
> +   Contributed by VRULL GmbH.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify it
> > +   under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful, but
> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   General Public License for more details.
> > +
> > +   You should have received a copy of the GNU General Public License
> > +   along with GCC; see the file COPYING3.  If not see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#define IN_TARGET_CODE 1
> > +
> > +#include "config.h"
> > +#define INCLUDE_VECTOR
> > +#define INCLUDE_MAP
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "alias.h"
> > +#include "rtlanal.h"
> > +#include "tree-pass.h"
> > +#include "cselib.h"
> > +
> > +/* This is an RTL pass that detects store forwarding from stores to
> larger
> > +   loads (load pairs). For example, it can transform cases like
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldp  d31, d17, [sp, #312] # Large load from small store
> > +
> > +   to
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldr  d31, [sp, #312]
> > +   ldr  d17, [sp, #320]
> > +
> > +   Design: The pass follows a straightforward design.  It starts by
> > +   initializing the alias analysis and the cselib.  Both of these are
> used to
> > +   find stores and larger loads with overlapping addresses, which are
> > +   candidates for store forwarding optimizations.  It then scans on
> basic block
> > +   level to find stores that forward to larger loads and handles them
> > +   accordingly as described in the above example.  Finally, the alias
> analysis
> > +   and the cselib library are closed.  */
> > +
> > +typedef std::vector<std::pair<rtx_insn *, rtx>> vec_rtx_pair;
> > +typedef std::map<rtx, uint32_t> map_rtx_dist;
> > +
> > +/* Statistics counters.  */
> > +static unsigned int stats_store_count =3D 0;
> > +static unsigned int stats_ldp_count =3D 0;
> > +static unsigned int stats_ssll_count =3D 0;
> > +static unsigned int stats_transformed_count =3D 0;
> > +
> > +/* Default.  */
> > +static rtx dummy;
> > +static bool is_store (rtx expr, rtx &op_0=3Ddummy);
> > +static bool is_load (rtx expr, rtx &op_1=3Ddummy);
> > +
> > +/* Return true if EXPR is a store; otherwise false.  OP_0 will contain
> the MEM
> > +   operand of the store.  */
>
> Probably worth saying "Return true if SET expression EXPR...".
>
> > +static bool
> > +is_store (rtx expr, rtx &op_0)
> > +{
> > +  op_0 =3D XEXP (expr, 0);
>
> Using SET_DEST would be clearer.
>
> > +
> > +  if (GET_CODE (op_0) =3D=3D ZERO_EXTEND
> > +      || GET_CODE (op_0) =3D=3D SIGN_EXTEND)
> > +    op_0 =3D XEXP (op_0, 0);
> > +
> > +  return MEM_P (op_0);
> > +}
> > +
> > +/* Return true if EXPR is a load; otherwise false.  OP_1 will contain
> the MEM
> > +   operand of the load.  */
>
> Same as above.
>
> > +
> > +static bool
> > +is_load (rtx expr, rtx &op_1)
> > +{
> > +  op_1 =3D XEXP (expr, 1);
>
> SET_SRC.
>
> > +
> > +  if (GET_CODE (op_1) =3D=3D ZERO_EXTEND
> > +      || GET_CODE (op_1) =3D=3D SIGN_EXTEND)
> > +    op_1 =3D XEXP (op_1, 0);
> > +
> > +  return MEM_P (op_1);
> > +}
> > +
> > +/* Check the maximum instruction distance threshold and if crossed,
> remove the
> > +   store from STORE_EXPRS.  DISTANCES contains the intruction
> distances.  */
> > +
> > +void
> > +update_store_load_distances (vec_rtx_pair &store_exprs, map_rtx_dist
> &distances)
> > +{
> > +  for (auto iter =3D store_exprs.begin (); iter < store_exprs.end ();)
> > +    {
> > +      (distances[(*iter).second])++;
> > +      iter =3D distances[(*iter).second]
> > +          > (uint32_t) aarch64_store_forwarding_threshold_param
> > +          ? store_exprs.erase (iter) : iter + 1;
>
> What are the guidelines for picking a good param value?  It looks
> like this is called one for each instruction, without taking issue
> rate into account, so I assume the param value is supposed to account
> for that instead.
>
>
The param value specifies a threshold of RTL instructions that follow each
store.
We actually needed a way to count instruction distance between stores and
loads
to determine a way to distinguish store forwardings that have or do not
have impact
on performance.
A store that forwards to a "very far" load pair should have no impact.
We have chosen this method for simplicity, despite the limitations and
possible inaccuracies
that exist.
If you happen to know a more accurate way to track instruction distance, I
would gladly use it instead.


> > +    }
> > +}
> > +
> > +/* Return true if STORE_MEM is forwarding to LOAD_MEM; otherwise
> false.  */
> > +
> > +static bool
> > +is_forwarding (rtx store_mem, rtx load_mem)
> > +{
> > +  gcc_checking_assert (MEM_P (store_mem) && MEM_P (load_mem));
> > +
> > +  rtx store_mem_addr =3D canon_rtx (get_addr (XEXP (store_mem, 0)));
> > +  rtx load_mem_addr =3D canon_rtx (get_addr (XEXP (load_mem, 0)));
> > +
> > +  return (memrefs_conflict_p (GET_MODE_SIZE (GET_MODE (store_mem)),
> > +                           store_mem_addr,
> > +                           GET_MODE_SIZE (GET_MODE (load_mem)),
> > +                           load_mem_addr, 0) =3D=3D 1
> > +       || rtx_equal_for_cselib_1 (store_mem_addr,
> > +                                  load_mem_addr,
> > +                                  GET_MODE (store_mem), 0));
> > +}
> > +
> > +/* Return true if INSN is a load pair, preceded by a store forwarding
> to it;
> > +   otherwise false.  STORE_EXPRS contains the stores.  */
> > +
> > +static bool
> > +is_small_store_to_large_load (vec_rtx_pair &store_exprs, rtx_insn *ins=
n)
> > +{
> > +  unsigned int load_count =3D 0;
> > +  bool forwarding =3D false;
> > +  rtx expr =3D PATTERN (insn);
> > +
> > +  if (GET_CODE (expr) !=3D PARALLEL
> > +      || XVECLEN (expr, 0) !=3D 2)
> > +    return false;
> > +
> > +  for (int i =3D 0; i < XVECLEN (expr, 0); i++)
> > +    {
> > +      rtx op_1;
> > +      rtx out_exp =3D XVECEXP (expr, 0, i);
> > +
> > +      if (GET_CODE (out_exp) !=3D SET)
> > +     continue;
> > +
> > +      if (!is_load (out_exp, op_1))
> > +     {
> > +       load_count =3D 0;
> > +       continue;
> > +     }
> > +
> > +      load_count++;
> > +
> > +      for (std::pair<rtx_insn *, rtx> str : store_exprs)
> > +     {
> > +       if (!is_forwarding (XEXP (str.second, 0), op_1))
> > +         continue;
> > +
> > +       if (dump_file)
> > +         {
> > +           fprintf (dump_file,
> > +                    "Store forwarding to PARALLEL with loads:\n");
> > +           fprintf (dump_file, "  From: ");
> > +           print_rtl_single (dump_file, str.first);
> > +           fprintf (dump_file, "  To: ");
> > +           print_rtl_single (dump_file, insn);
> > +         }
> > +
> > +       forwarding =3D true;
> > +       }
> > +    }
> > +
> > +  if (load_count =3D=3D 2)
> > +    stats_ldp_count++;
> > +
> > +  return load_count =3D=3D 2 && forwarding;
> > +}
> > +
> > +/* If the memory register in EXPR is ovewriten between the store and
> the larger
> > +   load, we need to avoid handling that case and thus we remove the
> store from
> > +   STORE_EXPRS.  */
> > +
> > +static void check_memory_reg_overwrite (rtx expr, vec_rtx_pair
> &store_exprs)
>
> Formatting nit: new line after "void".
>
> Could you explain more detail why this is needed?  Doesn't cselib
> handle the definition itself?
>
>
Actually this is supposed to be catching cases, where a limitation might be
causing
cselib to return false, while memrefs_conflict_p is returning true in
"is_forwarding" function.
In this case, memrefs_conflict_p will miss a register overwrite that
happened between
the store and the load pair and will report a false-positive.
I'll perform a re-evaluation to make sure this is indeed needed or if we
can do this by
just utilizing the cselib.


> > +{
> > +  if (REG_P (XEXP (expr, 0)))
> > +    for (auto iter =3D store_exprs.begin (); iter < store_exprs.end ()=
;)
> > +      iter =3D reg_mentioned_p (XEXP (expr, 0), XEXP ((*iter).second, =
0))
> > +          ? store_exprs.erase (iter) : iter + 1;
> > +}
> > +
> > +/* Break a load pair into its 2 distinct loads, except if the base
> source
> > +   address to load from is overwriten in the first load.  INSN should
> be the
> > +   PARALLEL of the load pair.  */
> > +
> > +static void
> > +break_ldp (rtx_insn *insn)
> > +{
> > +  rtx expr =3D PATTERN (insn);
> > +
> > +  gcc_checking_assert (GET_CODE (expr) =3D=3D PARALLEL && XVECLEN (exp=
r, 0)
> =3D=3D 2);
> > +
> > +  rtx load_0 =3D XVECEXP (expr, 0, 0);
> > +  rtx load_1 =3D XVECEXP (expr, 0, 1);
> > +
> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > +
> > +  /* The base address was overwriten in the first load.  */
> > +  if (reg_mentioned_p (XEXP (load_0, 0), XEXP (load_1, 1)))
>
> SET_DEST and SET_SRC here.
>
> > +    return;
> > +
> > +  emit_insn_after (load_1, insn);
> > +  emit_insn_after (load_0, insn);
> > +  remove_insn (insn);
>
> It's probably safer to insert before the instruction, since that copes
> correctly with non-call exceptions (where a load could raise an exception=
).
> It's unlikely to matter as things stand, but LDPs could be formed earlier
> in future.
>
> > +
> > +  stats_transformed_count++;
> > +}
> > +
> > +static void
> > +scan_and_transform_bb_level ()
> > +{
> > +  rtx_insn *insn, *next;
> > +  basic_block bb;
> > +  FOR_EACH_BB_FN (bb, cfun)
> > +    {
> > +      vec_rtx_pair store_exprs;
> > +      map_rtx_dist store_cnt_byte_dist;
> > +      for (insn =3D BB_HEAD (bb); insn !=3D NEXT_INSN (BB_END (bb)); i=
nsn =3D
> next)
> > +     {
> > +       next =3D NEXT_INSN (insn);
> > +
> > +       /* If we cross a CALL_P insn, clear the vectors, because the
> > +          small-store-to-large-load is unlikely to cause performance
> > +          difference.  */
> > +       if (CALL_P (insn))
> > +         {
> > +           store_exprs.clear ();
> > +           store_cnt_byte_dist.clear ();
> > +         }
> > +
> > +       if (!NONJUMP_INSN_P (insn))
> > +         continue;
> > +
> > +       cselib_process_insn (insn);
> > +
> > +       rtx expr =3D PATTERN (insn);
> > +
> > +       if (GET_CODE (expr) =3D=3D SET)
>
> Would be good to use single_set here, so that we pick up sets that are
> parallel with clobbers.
>
> > +         {
> > +           /* If a store is encountered, push it to the store_exprs
> vector
> > +              to check it later.  */
> > +           if (is_store (expr))
> > +             {
> > +               store_exprs.push_back ({insn, expr});
> > +
> > +               /* The store insn shouldn't have been encountered
> before.  */
> > +               gcc_checking_assert (store_cnt_byte_dist.find (expr)
> > +                                    =3D=3D store_cnt_byte_dist.end ());
> > +               store_cnt_byte_dist.insert ({ expr, 0 });
> > +               stats_store_count++;
> > +             }
> > +
> > +           check_memory_reg_overwrite (expr, store_exprs);
>
> Depending on the answer to the question above, we probably ought to
> use note_stores to iterate over all stores for check_memory_reg_overwrite.
> Or, alternatively:
>
>   vec_rtx_properties properties;
>   properties.add_insn (insn, false);
>   for (rtx_obj_reference ref : properties.refs ())
>     if (ref.is_write () && ref.is_reg ())
>       check_memory_reg_overwrite ...
>
>
Thanks for the point.
I may use it in V2 depending on the outcome of the re-evaluation.


> > +         }
> > +
> > +       /* Check for small-store-to-large-load.  */
> > +       if (is_small_store_to_large_load (store_exprs, insn))
> > +         {
> > +           stats_ssll_count++;
> > +           break_ldp (insn);
> > +         }
> > +
> > +       update_store_load_distances (store_exprs, store_cnt_byte_dist);
> > +     }
> > +    }
>
> It looks like the loop carries information across BB boundaries.
> Is that deliberate, to try to detect forwarding in EBBs?  If so,
> I'm not sure whether BBs are guaranteed to be in fallthrough order.
>
> The information probably ought to be reset after non-fallthru.
>
>
I am not sure I understood exactly what you said here, but the vectors are
local to each BB iteration so no information is carried across BBs.
If I am missing something, please let me know.

Thanks,
> Richard
>
> > +}
> > +
> > +static void
> > +execute_avoid_store_forwarding ()
> > +{
> > +  init_alias_analysis ();
> > +  cselib_init (0);
> > +  scan_and_transform_bb_level ();
> > +  end_alias_analysis ();
> > +  cselib_finish ();
> > +  statistics_counter_event (cfun, "Number of stores identified: ",
> > +                         stats_store_count);
> > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> > +                         stats_ldp_count);
> > +  statistics_counter_event (cfun,
> > +                         "Number of forwarding cases identified: ",
> > +                         stats_ssll_count);
> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> > +                         stats_transformed_count);
> > +}
> > +
> > +const pass_data pass_data_avoid_store_forwarding =3D
> > +{
> > +  RTL_PASS, /* type.  */
> > +  "avoid_store_forwarding", /* name.  */
> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > +  TV_NONE, /* tv_id.  */
> > +  0, /* properties_required.  */
> > +  0, /* properties_provided.  */
> > +  0, /* properties_destroyed.  */
> > +  0, /* todo_flags_start.  */
> > +  0 /* todo_flags_finish.  */
> > +};
> > +
> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return aarch64_flag_avoid_store_forwarding && optimize >=3D 2;
> > +    }
> > +
> > +  virtual unsigned int execute (function *)
> > +    {
> > +      execute_avoid_store_forwarding ();
> > +      return 0;
> > +    }
> > +
> > +}; // class pass_avoid_store_forwarding
> > +
> > +/* Create a new avoid store forwarding pass instance.  */
> > +
> > +rtl_opt_pass *
> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > +{
> > +  return new pass_avoid_store_forwarding (ctxt);
> > +}
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index f5a518202a1..a9e944a272f 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -304,6 +304,10 @@ moutline-atomics
> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> >  Generate local calls to out-of-line atomic operations.
> >
> > +mavoid-store-forwarding
> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
> Optimization
> > +Avoid store forwarding to load pairs.
> > +
> >  -param=3Daarch64-sve-compare-costs=3D
> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> >  When vectorizing for SVE, consider using unpacked vectors for smaller
> elements and use the cost model to pick the cheapest approach.  Also use
> the cost model to choose between SVE and Advanced SIMD vectorization.
> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
> >
> >  EnumValue
> >  Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > +
> > +-param=3Daarch64-store-forwarding-threshold=3D
> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param)
> Init(20) Param
> > +Maximum instruction distance allowed between a store and a load pair
> for this to be
> > +considered a candidate to avoid when using
> aarch64-avoid-store-forwarding.
> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > index a9a244ab6d6..7639b50358d 100644
> > --- a/gcc/config/aarch64/t-aarch64
> > +++ b/gcc/config/aarch64/t-aarch64
> > @@ -176,6 +176,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> >               $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >
> > +aarch64-store-forwarding.o: \
> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
> $(RTL_BASE_H) \
> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
> $(RECOG_H) \
> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > +    $(srcdir)/config/aarch64/aarch64-protos.h
> > +     $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > +             $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > +
> >  comma=3D,
> >  MULTILIB_OPTIONS    =3D $(subst $(comma),/, $(patsubst %, mabi=3D%, $(=
subst
> $(comma),$(comma)mabi=3D,$(TM_MULTILIB_CONFIG))))
> >  MULTILIB_DIRNAMES   =3D $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 2caa7ceec40..ca30f52109a 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -792,7 +792,7 @@ Objective-C and Objective-C++ Dialects}.
> >  -moverride=3D@var{string}  -mverbose-cost-dump
> >  -mstack-protector-guard=3D@var{guard}
> -mstack-protector-guard-reg=3D@var{sysreg}
> >  -mstack-protector-guard-offset=3D@var{offset} -mtrack-speculation
> > --moutline-atomics }
> > +-moutline-atomics -mavoid-store-forwarding}
> >
> >  @emph{Adapteva Epiphany Options}
> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> > @@ -16583,6 +16583,11 @@ With @option{--param=3Daarch64-stp-policy=3Dne=
ver},
> do not emit stp.
> >  With @option{--param=3Daarch64-stp-policy=3Daligned}, emit stp only if=
 the
> >  source pointer is aligned to at least double the alignment of the type.
> >
> > +@item aarch64-store-forwarding-threshold
> > +Maximum allowed instruction distance between a store and a load pair f=
or
> > +this to be considered a candidate to avoid when using
> > +aarch64-avoid-store-forwarding.
> > +
> >  @item aarch64-loop-vect-issue-rate-niters
> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> >  rates into account when deciding whether a loop should be vectorized
> > @@ -20558,6 +20563,11 @@ Generate code which uses only the
> general-purpose registers.  This will prevent
> >  the compiler from using floating-point and Advanced SIMD registers but
> will not
> >  impose any restrictions on the assembler.
> >
> > +@item -mavoid-store-forwarding
> > +@itemx -mno-avoid-store-forwarding
> > +Avoid store forwarding to load pairs.
> > +@option{-mavoid-store-forwarding} at levels @option{-O2}, @option{-O3}.
> > +
> >  @opindex mlittle-endian
> >  @item -mlittle-endian
> >  Generate little-endian code.  This is the default when GCC is
> configured for an
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > new file mode 100644
> > index 00000000000..b77de6c64b6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Different address, same offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] =3D i; \
> > +     ld_arr[0] =3D dummy; \
> > +     r =3D st_arr_2[0]; \
> > +     y =3D st_arr_2[1]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } =
*/
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > new file mode 100644
> > index 00000000000..f1b3a66abfd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, different offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE
> i, TYPE dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] =3D i; \
> > +     ld_arr[0] =3D dummy; \
> > +     r =3D st_arr[10]; \
> > +     y =3D st_arr[11]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } =
*/
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > new file mode 100644
> > index 00000000000..8d5ce5cc87e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, same offset, overlap  */
> > +
> > +#define LDP_SSLL_OVERLAP(TYPE) \
> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE
> dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] =3D i; \
> > +     ld_arr[0] =3D dummy; \
> > +     r =3D st_arr[0]; \
> > +     y =3D st_arr[1]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_OVERLAP(uint32_t)
> > +LDP_SSLL_OVERLAP(uint64_t)
> > +LDP_SSLL_OVERLAP(int32_t)
> > +LDP_SSLL_OVERLAP(int64_t)
> > +LDP_SSLL_OVERLAP(int)
> > +LDP_SSLL_OVERLAP(long)
> > +LDP_SSLL_OVERLAP(float)
> > +LDP_SSLL_OVERLAP(double)
> > +LDP_SSLL_OVERLAP(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } =
*/
> > --
> > 2.40.1
>

As for the rest of your points, it should be trivial to address them.

--=20
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

--000000000000e09646060a0be97c--