From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xA8g=HP=vrull.eu=manos.anagnostakis@sourceware.org>
Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034])
	by sourceware.org (Postfix) with ESMTPS id C3C8E3858018
	for <gcc-patches@gcc.gnu.org>; Mon,  4 Dec 2023 19:43:43 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C3C8E3858018
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=vrull.eu
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C3C8E3858018
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::1034
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701719034; cv=none;
	b=INl1DfvF0Eef9nkCOb0PAbYXNwk29t5AMNBmEnaRxcSkv2qZzwJMad9jPv+OU1zwVeUFp4gX4YZvwoc5uGawqQtKQK2b1v020hA7+8jPQFuo8eABJbNIlgQZ07EDiz/2TUtWnvJoB2gHq9gV5hEgY1G1UKpYeOnPhoKWj/C/jxM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1701719034; c=relaxed/simple;
	bh=jZ8fAltcnA7OIHGvv3hFWS9YACYFbMMdODl1RTufQGY=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=MOKELkjDGk6fwSSMc0WRndWzCDvzdSz6YOvP3DX6ZXu2lo5C0gZ/qFUDTfU7W15uaVgCXB6+7Gy1Kcvu3mUJSAGlpjgkuFN5osvWF6i/zaB1bC435cJwVDpp2izW454kOAhcJbG9tHDiVTE+4xYISA0t+jQziUtnoyq9C99X8pw=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-pj1-x1034.google.com with SMTP id 98e67ed59e1d1-2866e4ac34bso2359601a91.1
        for <gcc-patches@gcc.gnu.org>; Mon, 04 Dec 2023 11:43:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=vrull.eu; s=google; t=1701719022; x=1702323822; darn=gcc.gnu.org;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :from:to:cc:subject:date:message-id:reply-to;
        bh=8M4wIgHDmzrSQVAY/Crb+4nzDrYfaqYYn6lfcgd/CKQ=;
        b=dk3JPHO3IrC5K/GBe7XBrR/57buyFkEoRk39yWOWEwXRyKxMFlLD0YgjpDMxZrLYbR
         CzVGtkzKGo87hPaXbS5ow0rsxsqMeUV8Ifck6RGuLfR05PYuH/z4AhHS9ObsJ5W2DtQR
         77WzaYAMyzOQg/fJ2aaA4yTsV7y34d4bpXFdjNtYeNE1rNsItk4iCS1w1BJqttMTLdXM
         APk6fslHJ4R1NsqpiwCp8VOR7DSHrmFbqKJSaXBzlBmgHCsvJEPW0P1XxrFMSgpdvc+/
         JOMKzWDMceIpV7bGDysZE81cpioPDpAfU5d3YXPe86X8uVJo1KfGHdZgbuVxKt9g3aRW
         83Tw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1701719022; x=1702323822;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=8M4wIgHDmzrSQVAY/Crb+4nzDrYfaqYYn6lfcgd/CKQ=;
        b=Rm7RA2s5LarZ5luMLKAXD6j0rYhYoxzZNpZSu+OevDECSZ1nEV94l9KVuzOcAGETCg
         l4FeY3JZOy3G6N/oH9TqQa7qgUUnfH58QYdMZD3e7V9C8zkiDfETxWAweMfaaQhmMew9
         rhl5rOpOZrGD8C7wTNY1EjinbHUVPrE+nB8RJ7jDSdXvKlHZj3WhUiu4pncc3VU6oEVv
         SqTNryXVN5XLYoNevLtO5cGUpYGqrGVeEEkeThJ8GUpPKHUnnomQg1rX40uaXIGeijgU
         UmpcQuKzUzttSvYadzyetNL+BBzI7dDL85yqqYNHYidVgogqnBWCFjkh/5fNaVGhLE0q
         yIWg==
X-Gm-Message-State: AOJu0YzVvnBBKveNU72mnBzKT2nGciFkeNsXCgbCmOzDBZqwGWBAKhl8
	aG9daG8GgQE4rtY7jHWZN61ghS5TH/t1Yzy5QCPw3Fe16q5R1r+7
X-Google-Smtp-Source: AGHT+IGHnRAA1BJCB3zTmAPczGr+Uz6VqMEkh7vMarKj+PXCFAhtOG8gzaBxP6KwF6FT0iU3TxbbBrJLIT9pVr+VASA=
X-Received: by 2002:a17:90b:17c5:b0:286:b853:e67d with SMTP id
 me5-20020a17090b17c500b00286b853e67dmr103460pjb.15.1701719022027; Mon, 04 Dec
 2023 11:43:42 -0800 (PST)
MIME-Version: 1.0
References: <20231204180042.12450-1-manos.anagnostakis@vrull.eu> <mptjzpt6aeg.fsf@arm.com>
In-Reply-To: <mptjzpt6aeg.fsf@arm.com>
From: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
Date: Mon, 4 Dec 2023 21:43:29 +0200
Message-ID: <CAL3A+7Zz5h0SEHGBxQ7tSKb+jVrhqGfTP_iiKXCtbU0HpHumBg@mail.gmail.com>
Subject: Re: [PATCH v4] aarch64: New RTL optimization pass avoid-store-forwarding.
To: Manos Anagnostakis <manos.anagnostakis@vrull.eu>, gcc-patches@gcc.gnu.org, 
	Philipp Tomsich <philipp.tomsich@vrull.eu>, Manolis Tsamis <manolis.tsamis@vrull.eu>, 
	Richard Sandiford <richard.sandiford@arm.com>
Content-Type: multipart/alternative; boundary="000000000000d0dd79060bb4552d"
X-Spam-Status: No, score=-9.0 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,HTML_MESSAGE,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

--000000000000d0dd79060bb4552d
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

=CE=A3=CF=84=CE=B9=CF=82 =CE=94=CE=B5=CF=85 4 =CE=94=CE=B5=CE=BA 2023, 21:2=
2 =CE=BF =CF=87=CF=81=CE=AE=CF=83=CF=84=CE=B7=CF=82 Richard Sandiford <
richard.sandiford@arm.com> =CE=AD=CE=B3=CF=81=CE=B1=CF=88=CE=B5:

> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> > This is an RTL pass that detects store forwarding from stores to larger
> loads (load pairs).
> >
> > This optimization is SPEC2017-driven and was found to be beneficial for
> some benchmarks,
> > through testing on ampere1/ampere1a machines.
> >
> > For example, it can transform cases like
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldp  d31, d17, [sp, #312] # Large load from small store
> >
> > to
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldr  d31, [sp, #312]
> > ldr  d17, [sp, #320]
> >
> > Currently, the pass is disabled by default on all architectures and
> enabled by a target-specific option.
> >
> > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> > or other architectures as well, without needing to be turned on by this
> option.
> >
> > Bootstrapped and regtested on aarch64-linux.
> >
> > gcc/ChangeLog:
> >
> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> pass.
> >         * config/aarch64/aarch64-protos.h
> (make_pass_avoid_store_forwarding): Declare.
> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> option.
> >       (aarch64-store-forwarding-threshold): New param.
> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >         * doc/invoke.texi: Document new option and new param.
> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > ---
> > Changes in v4:
> >       - I had problems to make cselib_subst_to_values work correctly
> >         so I used cselib_lookup to implement the exact same behaviour a=
nd
> >         record the store value at the time we iterate over it.
> >       - Removed the store/load_mem_addr check from is_forwarding as
> >         unnecessary.
> >       - The pass is called on all optimization levels right now.
> >       - The threshold check should remain as it is as we only care for
> >         the front element of the list. The comment above the check
> explains
> >         why a single if is enough.
>
> I still think this is structurally better as a while.  There's no reason
> in principle we why wouldn't want to record the stores in:
>
>         stp     x0, x1, [x4, #8]
>         ldp     x0, x1, [x4, #0]
>         ldp     x2, x3, [x4, #16]
>
> and then the two stores should have the same distance value.
> I realise we don't do that yet, but still.
>
Ah, you mean forwarding from stp. I was a bit confused with what you meant
the previous time. This was not initially meant for this patch, but I think
it wouldn't take long to implement that before pushing this. It is your
call of course if I should include it.

>
> >       - The documentation changes requested.
> >       - Adjusted a comment.
> >
> >  gcc/config.gcc                                |   1 +
> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> >  .../aarch64/aarch64-store-forwarding.cc       | 321 ++++++++++++++++++
> >  gcc/config/aarch64/aarch64.opt                |   9 +
> >  gcc/config/aarch64/t-aarch64                  |  10 +
> >  gcc/doc/invoke.texi                           |  11 +-
> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> >  10 files changed, 452 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >
> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > index 748430194f3..2ee3b61c4fa 100644
> > --- a/gcc/config.gcc
> > +++ b/gcc/config.gcc
> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> >       cxx_target_objs=3D"aarch64-c.o"
> >       d_target_objs=3D"aarch64-d.o"
> >       extra_objs=3D"aarch64-builtins.o aarch-common.o
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
> cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > +     extra_objs=3D"${extra_objs} aarch64-store-forwarding.o"
> >       target_gtfiles=3D"\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> >       target_has_targetm_common=3Dyes
> >       ;;
> > diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> > index 6ace797b738..fa79e8adca8 100644
> > --- a/gcc/config/aarch64/aarch64-passes.def
> > +++ b/gcc/config/aarch64/aarch64-passes.def
> > @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1,
> pass_track_speculation);
> >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance=
);
> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index d2718cc87b3..7d9dfa06af9 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1050,6 +1050,7 @@ rtl_opt_pass *make_pass_track_speculation
> (gcc::context *);
> >  rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >
> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >
> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > new file mode 100644
> > index 00000000000..ae3cbe519cd
> > --- /dev/null
> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > @@ -0,0 +1,321 @@
> > +/* Avoid store forwarding optimization pass.
> > +   Copyright (C) 2023 Free Software Foundation, Inc.
> > +   Contributed by VRULL GmbH.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify it
> > +   under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful, but
> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   General Public License for more details.
> > +
> > +   You should have received a copy of the GNU General Public License
> > +   along with GCC; see the file COPYING3.  If not see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#define IN_TARGET_CODE 1
> > +
> > +#include "config.h"
> > +#define INCLUDE_LIST
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "alias.h"
> > +#include "rtlanal.h"
> > +#include "tree-pass.h"
> > +#include "cselib.h"
> > +
> > +/* This is an RTL pass that detects store forwarding from stores to
> larger
> > +   loads (load pairs). For example, it can transform cases like
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldp  d31, d17, [sp, #312] # Large load from small store
> > +
> > +   to
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldr  d31, [sp, #312]
> > +   ldr  d17, [sp, #320]
> > +
> > +   Design: The pass follows a straightforward design.  It starts by
> > +   initializing the alias analysis and the cselib.  Both of these are
> used to
> > +   find stores and larger loads with overlapping addresses, which are
> > +   candidates for store forwarding optimizations.  It then scans on
> basic block
> > +   level to find stores that forward to larger loads and handles them
> > +   accordingly as described in the above example.  Finally, the alias
> analysis
> > +   and the cselib library are closed.  */
> > +
> > +typedef struct
> > +{
> > +  rtx_insn *store_insn;
> > +  rtx store_mem_addr;
> > +  unsigned int insn_cnt;
> > +} store_info;
> > +
> > +typedef std::list<store_info> list_store_info;
> > +
> > +/* Statistics counters.  */
> > +static unsigned int stats_store_count =3D 0;
> > +static unsigned int stats_ldp_count =3D 0;
> > +static unsigned int stats_ssll_count =3D 0;
> > +static unsigned int stats_transformed_count =3D 0;
> > +
> > +/* Default.  */
> > +static rtx dummy;
> > +static bool is_load (rtx expr, rtx &op_1=3Ddummy);
> > +
> > +/* Return true if SET expression EXPR is a store; otherwise false.  */
> > +
> > +static bool
> > +is_store (rtx expr)
> > +{
> > +  return MEM_P (SET_DEST (expr));
> > +}
> > +
> > +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1
> will
> > +   contain the MEM operand of the load.  */
> > +
> > +static bool
> > +is_load (rtx expr, rtx &op_1)
> > +{
> > +  op_1 =3D SET_SRC (expr);
> > +
> > +  if (GET_CODE (op_1) =3D=3D ZERO_EXTEND
> > +      || GET_CODE (op_1) =3D=3D SIGN_EXTEND)
> > +    op_1 =3D XEXP (op_1, 0);
> > +
> > +  return MEM_P (op_1);
> > +}
> > +
> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
> LOAD_MEM;
> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx
> containing
> > +   STORE_MEM_ADDR.  */
> > +
> > +static bool
> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
> store_mem_mode)
> > +{
> > +  /* Sometimes we do not have the proper value.  */
> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> > +    return false;
> > +
> > +  gcc_checking_assert (MEM_P (load_mem));
> > +
> > +  rtx load_mem_addr =3D get_addr (XEXP (load_mem, 0));
> > +  machine_mode load_mem_mode =3D GET_MODE (load_mem);
> > +  load_mem_addr =3D cselib_lookup (load_mem_addr, load_mem_mode, 1,
> > +                              load_mem_mode)->val_rtx;
>
> Like I said in the previous review, it shouldn't be necessary to do any
> manual lookup on the load address.  rtx_equal_for_cselib_1 does the
> lookup itself.  Does that not work?
>
I thought you meant only that the if check was redundant here, which it
was. I'll reply if cselib can handle the load all by itself.

Thanks for the review!
Manos.

>
> The patch is OK with the four lines above deleted, if that works,
> and with s/if/while/.  But please reply if that combination doesn't work.
>
> Thanks,
> Richard
>
> > +  return rtx_equal_for_cselib_1 (store_mem_addr,
> > +                              load_mem_addr,
> > +                              store_mem_mode, 0);
> > +}
> > +
> > +/* Return true if INSN is a load pair, preceded by a store forwarding
> to it;
> > +   otherwise false.  STORE_EXPRS contains the stores.  */
> > +
> > +static bool
> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn
> *insn)
> > +{
> > +  unsigned int load_count =3D 0;
> > +  bool forwarding =3D false;
> > +  rtx expr =3D PATTERN (insn);
> > +
> > +  if (GET_CODE (expr) !=3D PARALLEL
> > +      || XVECLEN (expr, 0) !=3D 2)
> > +    return false;
> > +
> > +  for (int i =3D 0; i < XVECLEN (expr, 0); i++)
> > +    {
> > +      rtx op_1;
> > +      rtx out_exp =3D XVECEXP (expr, 0, i);
> > +
> > +      if (GET_CODE (out_exp) !=3D SET)
> > +     continue;
> > +
> > +      if (!is_load (out_exp, op_1))
> > +     continue;
> > +
> > +      load_count++;
> > +
> > +      for (store_info str : store_exprs)
> > +     {
> > +       rtx store_insn =3D str.store_insn;
> > +
> > +       if (!is_forwarding (str.store_mem_addr, op_1,
> > +                           GET_MODE (SET_DEST (PATTERN (store_insn)))))
> > +         continue;
> > +
> > +       if (dump_file)
> > +         {
> > +           fprintf (dump_file,
> > +                    "Store forwarding to PARALLEL with loads:\n");
> > +           fprintf (dump_file, "  From: ");
> > +           print_rtl_single (dump_file, store_insn);
> > +           fprintf (dump_file, "  To: ");
> > +           print_rtl_single (dump_file, insn);
> > +         }
> > +
> > +       forwarding =3D true;
> > +       }
> > +    }
> > +
> > +  if (load_count =3D=3D 2)
> > +    stats_ldp_count++;
> > +
> > +  return load_count =3D=3D 2 && forwarding;
> > +}
> > +
> > +/* Break a load pair into its 2 distinct loads, except if the base
> source
> > +   address to load from is overwriten in the first load.  INSN should
> be the
> > +   PARALLEL of the load pair.  */
> > +
> > +static void
> > +break_ldp (rtx_insn *insn)
> > +{
> > +  rtx expr =3D PATTERN (insn);
> > +
> > +  gcc_checking_assert (GET_CODE (expr) =3D=3D PARALLEL && XVECLEN (exp=
r, 0)
> =3D=3D 2);
> > +
> > +  rtx load_0 =3D XVECEXP (expr, 0, 0);
> > +  rtx load_1 =3D XVECEXP (expr, 0, 1);
> > +
> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > +
> > +  /* The base address was overwriten in the first load.  */
> > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> > +    return;
> > +
> > +  emit_insn_before (load_0, insn);
> > +  emit_insn_before (load_1, insn);
> > +  remove_insn (insn);
> > +
> > +  stats_transformed_count++;
> > +}
> > +
> > +static void
> > +scan_and_transform_bb_level ()
> > +{
> > +  rtx_insn *insn, *next;
> > +  basic_block bb;
> > +  FOR_EACH_BB_FN (bb, cfun)
> > +    {
> > +      list_store_info store_exprs;
> > +      unsigned int insn_cnt =3D 0;
> > +      for (insn =3D BB_HEAD (bb); insn !=3D NEXT_INSN (BB_END (bb)); i=
nsn =3D
> next)
> > +     {
> > +       next =3D NEXT_INSN (insn);
> > +
> > +       /* If we cross a CALL_P insn, clear the list, because the
> > +          small-store-to-large-load is unlikely to cause performance
> > +          difference.  */
> > +       if (CALL_P (insn))
> > +         store_exprs.clear ();
> > +
> > +       if (!NONJUMP_INSN_P (insn))
> > +         continue;
> > +
> > +       cselib_process_insn (insn);
> > +
> > +       rtx expr =3D single_set (insn);
> > +
> > +       /* If a store is encountered, append it to the store_exprs list
> to
> > +          check it later.  */
> > +       if (expr && is_store (expr))
> > +         {
> > +           rtx store_mem =3D SET_DEST (expr);
> > +           rtx store_mem_addr =3D get_addr (XEXP (store_mem, 0));
> > +           machine_mode store_mem_mode =3D GET_MODE (store_mem);
> > +           store_mem_addr =3D cselib_lookup (store_mem_addr,
> > +                                           store_mem_mode, 1,
> > +                                           store_mem_mode)->val_rtx;
> > +           store_exprs.push_back ({ insn, store_mem_addr, insn_cnt++ }=
);
> > +           stats_store_count++;
> > +         }
> > +
> > +       /* Check for small-store-to-large-load.  */
> > +       if (is_small_store_to_large_load (store_exprs, insn))
> > +         {
> > +           stats_ssll_count++;
> > +           break_ldp (insn);
> > +         }
> > +
> > +       /* Pop the first store from the list if it's distance crosses t=
he
> > +          maximum accepted threshold.  The list contains unique values
> > +          sorted in ascending order, meaning that only one distance can
> be
> > +          off at a time.  */
> > +       if (!store_exprs.empty ()
> > +           && (insn_cnt - store_exprs.front ().insn_cnt
> > +              > (unsigned int)
> aarch64_store_forwarding_threshold_param))
> > +         store_exprs.pop_front ();
> > +     }
> > +    }
> > +}
> > +
> > +static void
> > +execute_avoid_store_forwarding ()
> > +{
> > +  init_alias_analysis ();
> > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> > +  scan_and_transform_bb_level ();
> > +  end_alias_analysis ();
> > +  cselib_finish ();
> > +  statistics_counter_event (cfun, "Number of stores identified: ",
> > +                         stats_store_count);
> > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> > +                         stats_ldp_count);
> > +  statistics_counter_event (cfun,
> > +                         "Number of forwarding cases identified: ",
> > +                         stats_ssll_count);
> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> > +                         stats_transformed_count);
> > +}
> > +
> > +const pass_data pass_data_avoid_store_forwarding =3D
> > +{
> > +  RTL_PASS, /* type.  */
> > +  "avoid_store_forwarding", /* name.  */
> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > +  TV_NONE, /* tv_id.  */
> > +  0, /* properties_required.  */
> > +  0, /* properties_provided.  */
> > +  0, /* properties_destroyed.  */
> > +  0, /* todo_flags_start.  */
> > +  0 /* todo_flags_finish.  */
> > +};
> > +
> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return aarch64_flag_avoid_store_forwarding;
> > +    }
> > +
> > +  virtual unsigned int execute (function *)
> > +    {
> > +      execute_avoid_store_forwarding ();
> > +      return 0;
> > +    }
> > +
> > +}; // class pass_avoid_store_forwarding
> > +
> > +/* Create a new avoid store forwarding pass instance.  */
> > +
> > +rtl_opt_pass *
> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > +{
> > +  return new pass_avoid_store_forwarding (ctxt);
> > +}
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index f5a518202a1..e4498d53b46 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -304,6 +304,10 @@ moutline-atomics
> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> >  Generate local calls to out-of-line atomic operations.
> >
> > +mavoid-store-forwarding
> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
> Optimization
> > +Avoid store forwarding to load pairs.
> > +
> >  -param=3Daarch64-sve-compare-costs=3D
> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> >  When vectorizing for SVE, consider using unpacked vectors for smaller
> elements and use the cost model to pick the cheapest approach.  Also use
> the cost model to choose between SVE and Advanced SIMD vectorization.
> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
> >
> >  EnumValue
> >  Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > +
> > +-param=3Daarch64-store-forwarding-threshold=3D
> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param)
> Init(20) Param
> > +Maximum instruction distance allowed between a store and a load pair
> for this to be
> > +considered a candidate to avoid when using -mavoid-store-forwarding.
> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > index a9a244ab6d6..7639b50358d 100644
> > --- a/gcc/config/aarch64/t-aarch64
> > +++ b/gcc/config/aarch64/t-aarch64
> > @@ -176,6 +176,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> >               $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >
> > +aarch64-store-forwarding.o: \
> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
> $(RTL_BASE_H) \
> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
> $(RECOG_H) \
> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > +    $(srcdir)/config/aarch64/aarch64-protos.h
> > +     $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > +             $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > +
> >  comma=3D,
> >  MULTILIB_OPTIONS    =3D $(subst $(comma),/, $(patsubst %, mabi=3D%, $(=
subst
> $(comma),$(comma)mabi=3D,$(TM_MULTILIB_CONFIG))))
> >  MULTILIB_DIRNAMES   =3D $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 2b51ff304f6..39dbc04207e 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -798,7 +798,7 @@ Objective-C and Objective-C++ Dialects}.
> >  -moverride=3D@var{string}  -mverbose-cost-dump
> >  -mstack-protector-guard=3D@var{guard}
> -mstack-protector-guard-reg=3D@var{sysreg}
> >  -mstack-protector-guard-offset=3D@var{offset} -mtrack-speculation
> > --moutline-atomics }
> > +-moutline-atomics -mavoid-store-forwarding}
> >
> >  @emph{Adapteva Epiphany Options}
> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> > @@ -16738,6 +16738,11 @@ With @option{--param=3Daarch64-stp-policy=3Dne=
ver},
> do not emit stp.
> >  With @option{--param=3Daarch64-stp-policy=3Daligned}, emit stp only if=
 the
> >  source pointer is aligned to at least double the alignment of the type.
> >
> > +@item aarch64-store-forwarding-threshold
> > +Maximum allowed instruction distance between a store and a load pair f=
or
> > +this to be considered a candidate to avoid when using
> > +@option{-mavoid-store-forwarding}.
> > +
> >  @item aarch64-loop-vect-issue-rate-niters
> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> >  rates into account when deciding whether a loop should be vectorized
> > @@ -20763,6 +20768,10 @@ Generate code which uses only the
> general-purpose registers.  This will prevent
> >  the compiler from using floating-point and Advanced SIMD registers but
> will not
> >  impose any restrictions on the assembler.
> >
> > +@item -mavoid-store-forwarding
> > +@itemx -mno-avoid-store-forwarding
> > +Avoid store forwarding to load pairs.
> > +
> >  @opindex mlittle-endian
> >  @item -mlittle-endian
> >  Generate little-endian code.  This is the default when GCC is
> configured for an
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > new file mode 100644
> > index 00000000000..b77de6c64b6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Different address, same offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] =3D i; \
> > +     ld_arr[0] =3D dummy; \
> > +     r =3D st_arr_2[0]; \
> > +     y =3D st_arr_2[1]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } =
*/
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > new file mode 100644
> > index 00000000000..f1b3a66abfd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, different offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE
> i, TYPE dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] =3D i; \
> > +     ld_arr[0] =3D dummy; \
> > +     r =3D st_arr[10]; \
> > +     y =3D st_arr[11]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } =
*/
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > new file mode 100644
> > index 00000000000..8d5ce5cc87e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, same offset, overlap  */
> > +
> > +#define LDP_SSLL_OVERLAP(TYPE) \
> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE
> dummy){ \
> > +     TYPE r, y; \
> > +     st_arr[0] =3D i; \
> > +     ld_arr[0] =3D dummy; \
> > +     r =3D st_arr[0]; \
> > +     y =3D st_arr[1]; \
> > +     return r + y; \
> > +}
> > +
> > +LDP_SSLL_OVERLAP(uint32_t)
> > +LDP_SSLL_OVERLAP(uint64_t)
> > +LDP_SSLL_OVERLAP(int32_t)
> > +LDP_SSLL_OVERLAP(int64_t)
> > +LDP_SSLL_OVERLAP(int)
> > +LDP_SSLL_OVERLAP(long)
> > +LDP_SSLL_OVERLAP(float)
> > +LDP_SSLL_OVERLAP(double)
> > +LDP_SSLL_OVERLAP(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } =
*/
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } =
*/
> > --
> > 2.41.0
>

--000000000000d0dd79060bb4552d--