From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by sourceware.org (Postfix) with ESMTPS id 299AC3858C01 for ; Mon, 13 Nov 2023 17:22:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 299AC3858C01 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 299AC3858C01 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::530 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1699896154; cv=none; b=jWttp2RpriXM5YYETmVz1bjFkg54MgEwiRBRrOBKj1Sl+4fV+lKMX36lmyPytvFO8k+QFdo9mLd0v/vZPeKZEBjXmlMSTQWe63Oz9IgdMy/7cR1uOtGER0RiORL8km7s1XX7kf+s6G0PrRQlen+xbIc4z792jHVj9QG0Jk374RU= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1699896154; c=relaxed/simple; bh=JRd84pRi9UHx5YBpnm3b9FtYjLir6k+E4zjCicvoODI=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=iABBltEQldr0Jr5eav+n3y1JuMf/vEdplNfa1ijeNEZC1nwCennqvquzi5qLUwTNud6dAsPXBbww2NsKC40HZ91x7eKpqatIsyTlLzlTceVcII3V4OCNhLx2gjZOxIdQmsLCGXJy/rFwKVlIyYxB7VsttSf6V/lgF+mMN8lfKo0= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-pg1-x530.google.com with SMTP id 41be03b00d2f7-5bd6ac9833fso2786403a12.0 for ; Mon, 13 Nov 2023 09:22:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; t=1699896146; x=1700500946; darn=gcc.gnu.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=tb5tJIBJ9NPx5x/wjXS/gaJCWAilW7CHrP3X2rK+iE8=; b=hZx4PrAoDNGcuZ5YLv7yb2fg7VqSnCnGmpXnrsmGSCcDGyLzfLI/auhnkH2myMwCJb 9pDi0OWfu0bxjO2CUHIoP08RwATjknZIyQB1dFoTjKRhKOMW4d3G57n84aLVJWirUAdg 1yspjCLfuD8aa0RUEoIqPQyy8yfz0nQA6KFfNnYX5Yy/3EARqIfQI/hRk85foI5g5YuW AdMtsmM4+N+rc68DVIfV7vgwmz5CGV3F7Rz3xuhTs3tB3eWHW7MKHzTlRBv7gOiTALXX NBcG8TicsGpK0pWdX2xJLina4P49KwlqISljWxaQcckl547UUGAwf1deKcTqXD4eM0tg 8qKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699896146; x=1700500946; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=tb5tJIBJ9NPx5x/wjXS/gaJCWAilW7CHrP3X2rK+iE8=; b=VhiozrExWpCFVvxgSx99f1gjTAai2GuGAL7k5s0q8tDqfyLFSFQHLQo2MyMOVSx19h uDRVaEqOg2vPcl6F9LFfjrj8NDe5iEa5+K8rAkAgM5qBDkfcWXBFUsHQAMCYfTiMDmSg A31Vse6hfJvFKPZymGZSckBq9Ljy27cXa4XIBre264OzuPMbMYZ6NJJ28RwACLpk6MGB TFm/net9Lon4n5hiLyxGjLSaJm6EH9AuQOpbBYC6emheVJoV7/vdMo/0KCEHmW9nDaOp Ua3d3tx/4F0vn5vz6XvokdCa3ut/Fv7f+b+w+wryZYW0crpDWyj+e/7AwzCiZ1xDUaqj /28g== X-Gm-Message-State: AOJu0Yw2X1QY9IF+IUbcpqBcgKv9fVTw6c3zMDszVUqWJSZuzYX6g3Qe imeMcYVkxn0lNnCcM3jGGQO6ZzI34UtBz5ytYFh0rA== X-Google-Smtp-Source: AGHT+IFrt605a94QpXN4wMKE00ftMvkio/tuAeMy2+UIbmOzExAkXl3sMwPcYPYG7kxMP4CwB6SmrVz6/vBWoLy/qHU= X-Received: by 2002:a17:90b:4d06:b0:283:23d9:2bcf with SMTP id mw6-20020a17090b4d0600b0028323d92bcfmr4430361pjb.0.1699896144983; Mon, 13 Nov 2023 09:22:24 -0800 (PST) MIME-Version: 1.0 References: <20231107172931.25778-1-manos.anagnostakis@vrull.eu> In-Reply-To: From: Manos Anagnostakis Date: Mon, 13 Nov 2023 19:22:13 +0200 Message-ID: Subject: Re: [PATCH] aarch64: New RTL optimization pass avoid-store-forwarding. To: Manos Anagnostakis , Richard Sandiford Cc: Philipp Tomsich , Christoph Muellner , Kyrylo Tkachov , Manolis Tsamis , gcc-patches@gcc.gnu.org Content-Type: multipart/alternative; boundary="000000000000e09646060a0be97c" X-Spam-Status: No, score=-10.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,HTML_MESSAGE,JMQ_SPF_NEUTRAL,KAM_EU,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --000000000000e09646060a0be97c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Richard, thank you for reviewing the patch. On Sat, Nov 11, 2023 at 6:57=E2=80=AFPM Richard Sandiford wrote: > Thanks for the patch. > > Manos Anagnostakis writes: > > This is an RTL pass that detects store forwarding from stores to larger > loads (load pairs). > > > > This optimization is SPEC2017-driven and was found to be beneficial for > some benchmarks, > > through testing on ampere1/ampere1a machines. > > > > For example, it can transform cases like > > > > str d5, [sp, #320] > > fmul d5, d31, d29 > > ldp d31, d17, [sp, #312] # Large load from small store > > > > to > > > > str d5, [sp, #320] > > fmul d5, d31, d29 > > ldr d31, [sp, #312] > > ldr d17, [sp, #320] > > For this particular case, it would be good to have a spill forwarding > pass that inserts: > > mov d17, d5 > > before the fmul (possibly with further clean-up beyond that). > > But I realise the patch handles the general case where that isn't possibl= e. > The patch can also afford to be approximate. > > Some of the insn recognition code will probably need to be updated for > Alex's pass that forms LDP/STP, since that's going to have writeback > support. We'll also need to arrange a pass ording that works for both > passes. > > The patch generally looks good, and it's nicely compact. Some specific > comments below: > > > Currently, the pass is disabled by default on all architectures and > enabled by a target-specific option. > > > > If deemed beneficial enough for a default, it will be enabled on > ampere1/ampere1a, > > or other architectures as well, without needing to be turned on by this > option. > > > > Bootstrapped and regtested on aarch64-linux. > > > > gcc/ChangeLog: > > > > * alias.cc (memrefs_conflict_p): Expose static function. > > * alias.h (memrefs_conflict_p): Expose static function. > > * config.gcc: Add aarch64-store-forwarding.o to extra_objs. > > * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New > pass. > > * config/aarch64/aarch64-protos.h > (make_pass_avoid_store_forwarding): Declare. > > * config/aarch64/aarch64.opt (mavoid-store-forwarding): New > option. > > (aarch64-store-forwarding-threshold): New param. > > * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o > > * doc/invoke.texi: Document new option and new param. > > * config/aarch64/aarch64-store-forwarding.cc: New file. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test. > > * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test. > > * gcc.target/aarch64/ldp_ssll_overlap.c: New test. > > > > Signed-off-by: Manos Anagnostakis > > Co-Authored-By: Manolis Tsamis > > Co-Authored-By: Philipp Tomsich > > --- > > gcc/alias.cc | 2 +- > > gcc/alias.h | 1 + > > gcc/config.gcc | 1 + > > gcc/config/aarch64/aarch64-passes.def | 1 + > > gcc/config/aarch64/aarch64-protos.h | 1 + > > .../aarch64/aarch64-store-forwarding.cc | 347 ++++++++++++++++++ > > gcc/config/aarch64/aarch64.opt | 9 + > > gcc/config/aarch64/t-aarch64 | 10 + > > gcc/doc/invoke.texi | 12 +- > > .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++ > > .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++ > > .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++ > > 12 files changed, 481 insertions(+), 2 deletions(-) > > create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc > > create mode 100644 > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > create mode 100644 > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > > > diff --git a/gcc/alias.cc b/gcc/alias.cc > > index 86d8f7104ad..303683f85e3 100644 > > --- a/gcc/alias.cc > > +++ b/gcc/alias.cc > > @@ -2488,7 +2488,7 @@ offset_overlap_p (poly_int64 c, poly_int64 xsize, > poly_int64 ysize) > > one for X + non-constant and Y + non-constant when X and Y are equa= l. > > If that is fixed the TBAA hack for union type-punning can be > removed. */ > > > > -static int > > +int > > memrefs_conflict_p (poly_int64 xsize, rtx x, poly_int64 ysize, rtx y, > > poly_int64 c) > > { > > diff --git a/gcc/alias.h b/gcc/alias.h > > index ab06ac9055f..49836f7d808 100644 > > --- a/gcc/alias.h > > +++ b/gcc/alias.h > > @@ -41,6 +41,7 @@ bool alias_ptr_types_compatible_p (tree, tree); > > int compare_base_decls (tree, tree); > > bool refs_same_for_tbaa_p (tree, tree); > > bool mems_same_for_tbaa_p (rtx, rtx); > > +int memrefs_conflict_p (poly_int64, rtx, poly_int64, rtx, poly_int64); > > > > /* This alias set can be used to force a memory to conflict with all > > other memories, creating a barrier across which no memory reference > > diff --git a/gcc/config.gcc b/gcc/config.gcc > > index ba6d63e33ac..ae50d36004f 100644 > > --- a/gcc/config.gcc > > +++ b/gcc/config.gcc > > @@ -350,6 +350,7 @@ aarch64*-*-*) > > cxx_target_objs=3D"aarch64-c.o" > > d_target_objs=3D"aarch64-d.o" > > extra_objs=3D"aarch64-builtins.o aarch-common.o > aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o > aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o > cortex-a57-fma-steering.o aarch64-speculation.o > falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o" > > + extra_objs=3D"${extra_objs} aarch64-store-forwarding.o" > > target_gtfiles=3D"\$(srcdir)/config/aarch64/aarch64-builtins.cc > \$(srcdir)/config/aarch64/aarch64-sve-builtins.h > \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc" > > target_has_targetm_common=3Dyes > > ;; > > diff --git a/gcc/config/aarch64/aarch64-passes.def > b/gcc/config/aarch64/aarch64-passes.def > > index 6ace797b738..fa79e8adca8 100644 > > --- a/gcc/config/aarch64/aarch64-passes.def > > +++ b/gcc/config/aarch64/aarch64-passes.def > > @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1, > pass_track_speculation); > > INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance= ); > > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti); > > INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion); > > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding); > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > > index 60a55f4bc19..5a7e68f22e5 100644 > > --- a/gcc/config/aarch64/aarch64-protos.h > > +++ b/gcc/config/aarch64/aarch64-protos.h > > @@ -1049,6 +1049,7 @@ rtl_opt_pass *make_pass_track_speculation > (gcc::context *); > > rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *); > > rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt); > > rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt); > > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt); > > > > poly_uint64 aarch64_regmode_natural_size (machine_mode); > > > > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc > b/gcc/config/aarch64/aarch64-store-forwarding.cc > > new file mode 100644 > > index 00000000000..30148fb4435 > > --- /dev/null > > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc > > @@ -0,0 +1,347 @@ > > +/* Avoid store forwarding optimization pass. > > + Copyright (C) 2015-2023 Free Software Foundation, Inc. > > Just checking the copyright: does the file include code from other > files? The line above is OK if so, otherwise I guess it should > start at 2023. > > + Contributed by VRULL GmbH. > > + > > + This file is part of GCC. > > + > > + GCC is free software; you can redistribute it and/or modify it > > + under the terms of the GNU General Public License as published by > > + the Free Software Foundation; either version 3, or (at your option) > > + any later version. > > + > > + GCC is distributed in the hope that it will be useful, but > > + WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + General Public License for more details. > > + > > + You should have received a copy of the GNU General Public License > > + along with GCC; see the file COPYING3. If not see > > + . */ > > + > > +#define IN_TARGET_CODE 1 > > + > > +#include "config.h" > > +#define INCLUDE_VECTOR > > +#define INCLUDE_MAP > > +#include "system.h" > > +#include "coretypes.h" > > +#include "backend.h" > > +#include "rtl.h" > > +#include "alias.h" > > +#include "rtlanal.h" > > +#include "tree-pass.h" > > +#include "cselib.h" > > + > > +/* This is an RTL pass that detects store forwarding from stores to > larger > > + loads (load pairs). For example, it can transform cases like > > + > > + str d5, [sp, #320] > > + fmul d5, d31, d29 > > + ldp d31, d17, [sp, #312] # Large load from small store > > + > > + to > > + > > + str d5, [sp, #320] > > + fmul d5, d31, d29 > > + ldr d31, [sp, #312] > > + ldr d17, [sp, #320] > > + > > + Design: The pass follows a straightforward design. It starts by > > + initializing the alias analysis and the cselib. Both of these are > used to > > + find stores and larger loads with overlapping addresses, which are > > + candidates for store forwarding optimizations. It then scans on > basic block > > + level to find stores that forward to larger loads and handles them > > + accordingly as described in the above example. Finally, the alias > analysis > > + and the cselib library are closed. */ > > + > > +typedef std::vector> vec_rtx_pair; > > +typedef std::map map_rtx_dist; > > + > > +/* Statistics counters. */ > > +static unsigned int stats_store_count =3D 0; > > +static unsigned int stats_ldp_count =3D 0; > > +static unsigned int stats_ssll_count =3D 0; > > +static unsigned int stats_transformed_count =3D 0; > > + > > +/* Default. */ > > +static rtx dummy; > > +static bool is_store (rtx expr, rtx &op_0=3Ddummy); > > +static bool is_load (rtx expr, rtx &op_1=3Ddummy); > > + > > +/* Return true if EXPR is a store; otherwise false. OP_0 will contain > the MEM > > + operand of the store. */ > > Probably worth saying "Return true if SET expression EXPR...". > > > +static bool > > +is_store (rtx expr, rtx &op_0) > > +{ > > + op_0 =3D XEXP (expr, 0); > > Using SET_DEST would be clearer. > > > + > > + if (GET_CODE (op_0) =3D=3D ZERO_EXTEND > > + || GET_CODE (op_0) =3D=3D SIGN_EXTEND) > > + op_0 =3D XEXP (op_0, 0); > > + > > + return MEM_P (op_0); > > +} > > + > > +/* Return true if EXPR is a load; otherwise false. OP_1 will contain > the MEM > > + operand of the load. */ > > Same as above. > > > + > > +static bool > > +is_load (rtx expr, rtx &op_1) > > +{ > > + op_1 =3D XEXP (expr, 1); > > SET_SRC. > > > + > > + if (GET_CODE (op_1) =3D=3D ZERO_EXTEND > > + || GET_CODE (op_1) =3D=3D SIGN_EXTEND) > > + op_1 =3D XEXP (op_1, 0); > > + > > + return MEM_P (op_1); > > +} > > + > > +/* Check the maximum instruction distance threshold and if crossed, > remove the > > + store from STORE_EXPRS. DISTANCES contains the intruction > distances. */ > > + > > +void > > +update_store_load_distances (vec_rtx_pair &store_exprs, map_rtx_dist > &distances) > > +{ > > + for (auto iter =3D store_exprs.begin (); iter < store_exprs.end ();) > > + { > > + (distances[(*iter).second])++; > > + iter =3D distances[(*iter).second] > > + > (uint32_t) aarch64_store_forwarding_threshold_param > > + ? store_exprs.erase (iter) : iter + 1; > > What are the guidelines for picking a good param value? It looks > like this is called one for each instruction, without taking issue > rate into account, so I assume the param value is supposed to account > for that instead. > > The param value specifies a threshold of RTL instructions that follow each store. We actually needed a way to count instruction distance between stores and loads to determine a way to distinguish store forwardings that have or do not have impact on performance. A store that forwards to a "very far" load pair should have no impact. We have chosen this method for simplicity, despite the limitations and possible inaccuracies that exist. If you happen to know a more accurate way to track instruction distance, I would gladly use it instead. > > + } > > +} > > + > > +/* Return true if STORE_MEM is forwarding to LOAD_MEM; otherwise > false. */ > > + > > +static bool > > +is_forwarding (rtx store_mem, rtx load_mem) > > +{ > > + gcc_checking_assert (MEM_P (store_mem) && MEM_P (load_mem)); > > + > > + rtx store_mem_addr =3D canon_rtx (get_addr (XEXP (store_mem, 0))); > > + rtx load_mem_addr =3D canon_rtx (get_addr (XEXP (load_mem, 0))); > > + > > + return (memrefs_conflict_p (GET_MODE_SIZE (GET_MODE (store_mem)), > > + store_mem_addr, > > + GET_MODE_SIZE (GET_MODE (load_mem)), > > + load_mem_addr, 0) =3D=3D 1 > > + || rtx_equal_for_cselib_1 (store_mem_addr, > > + load_mem_addr, > > + GET_MODE (store_mem), 0)); > > +} > > + > > +/* Return true if INSN is a load pair, preceded by a store forwarding > to it; > > + otherwise false. STORE_EXPRS contains the stores. */ > > + > > +static bool > > +is_small_store_to_large_load (vec_rtx_pair &store_exprs, rtx_insn *ins= n) > > +{ > > + unsigned int load_count =3D 0; > > + bool forwarding =3D false; > > + rtx expr =3D PATTERN (insn); > > + > > + if (GET_CODE (expr) !=3D PARALLEL > > + || XVECLEN (expr, 0) !=3D 2) > > + return false; > > + > > + for (int i =3D 0; i < XVECLEN (expr, 0); i++) > > + { > > + rtx op_1; > > + rtx out_exp =3D XVECEXP (expr, 0, i); > > + > > + if (GET_CODE (out_exp) !=3D SET) > > + continue; > > + > > + if (!is_load (out_exp, op_1)) > > + { > > + load_count =3D 0; > > + continue; > > + } > > + > > + load_count++; > > + > > + for (std::pair str : store_exprs) > > + { > > + if (!is_forwarding (XEXP (str.second, 0), op_1)) > > + continue; > > + > > + if (dump_file) > > + { > > + fprintf (dump_file, > > + "Store forwarding to PARALLEL with loads:\n"); > > + fprintf (dump_file, " From: "); > > + print_rtl_single (dump_file, str.first); > > + fprintf (dump_file, " To: "); > > + print_rtl_single (dump_file, insn); > > + } > > + > > + forwarding =3D true; > > + } > > + } > > + > > + if (load_count =3D=3D 2) > > + stats_ldp_count++; > > + > > + return load_count =3D=3D 2 && forwarding; > > +} > > + > > +/* If the memory register in EXPR is ovewriten between the store and > the larger > > + load, we need to avoid handling that case and thus we remove the > store from > > + STORE_EXPRS. */ > > + > > +static void check_memory_reg_overwrite (rtx expr, vec_rtx_pair > &store_exprs) > > Formatting nit: new line after "void". > > Could you explain more detail why this is needed? Doesn't cselib > handle the definition itself? > > Actually this is supposed to be catching cases, where a limitation might be causing cselib to return false, while memrefs_conflict_p is returning true in "is_forwarding" function. In this case, memrefs_conflict_p will miss a register overwrite that happened between the store and the load pair and will report a false-positive. I'll perform a re-evaluation to make sure this is indeed needed or if we can do this by just utilizing the cselib. > > +{ > > + if (REG_P (XEXP (expr, 0))) > > + for (auto iter =3D store_exprs.begin (); iter < store_exprs.end ()= ;) > > + iter =3D reg_mentioned_p (XEXP (expr, 0), XEXP ((*iter).second, = 0)) > > + ? store_exprs.erase (iter) : iter + 1; > > +} > > + > > +/* Break a load pair into its 2 distinct loads, except if the base > source > > + address to load from is overwriten in the first load. INSN should > be the > > + PARALLEL of the load pair. */ > > + > > +static void > > +break_ldp (rtx_insn *insn) > > +{ > > + rtx expr =3D PATTERN (insn); > > + > > + gcc_checking_assert (GET_CODE (expr) =3D=3D PARALLEL && XVECLEN (exp= r, 0) > =3D=3D 2); > > + > > + rtx load_0 =3D XVECEXP (expr, 0, 0); > > + rtx load_1 =3D XVECEXP (expr, 0, 1); > > + > > + gcc_checking_assert (is_load (load_0) && is_load (load_1)); > > + > > + /* The base address was overwriten in the first load. */ > > + if (reg_mentioned_p (XEXP (load_0, 0), XEXP (load_1, 1))) > > SET_DEST and SET_SRC here. > > > + return; > > + > > + emit_insn_after (load_1, insn); > > + emit_insn_after (load_0, insn); > > + remove_insn (insn); > > It's probably safer to insert before the instruction, since that copes > correctly with non-call exceptions (where a load could raise an exception= ). > It's unlikely to matter as things stand, but LDPs could be formed earlier > in future. > > > + > > + stats_transformed_count++; > > +} > > + > > +static void > > +scan_and_transform_bb_level () > > +{ > > + rtx_insn *insn, *next; > > + basic_block bb; > > + FOR_EACH_BB_FN (bb, cfun) > > + { > > + vec_rtx_pair store_exprs; > > + map_rtx_dist store_cnt_byte_dist; > > + for (insn =3D BB_HEAD (bb); insn !=3D NEXT_INSN (BB_END (bb)); i= nsn =3D > next) > > + { > > + next =3D NEXT_INSN (insn); > > + > > + /* If we cross a CALL_P insn, clear the vectors, because the > > + small-store-to-large-load is unlikely to cause performance > > + difference. */ > > + if (CALL_P (insn)) > > + { > > + store_exprs.clear (); > > + store_cnt_byte_dist.clear (); > > + } > > + > > + if (!NONJUMP_INSN_P (insn)) > > + continue; > > + > > + cselib_process_insn (insn); > > + > > + rtx expr =3D PATTERN (insn); > > + > > + if (GET_CODE (expr) =3D=3D SET) > > Would be good to use single_set here, so that we pick up sets that are > parallel with clobbers. > > > + { > > + /* If a store is encountered, push it to the store_exprs > vector > > + to check it later. */ > > + if (is_store (expr)) > > + { > > + store_exprs.push_back ({insn, expr}); > > + > > + /* The store insn shouldn't have been encountered > before. */ > > + gcc_checking_assert (store_cnt_byte_dist.find (expr) > > + =3D=3D store_cnt_byte_dist.end ()); > > + store_cnt_byte_dist.insert ({ expr, 0 }); > > + stats_store_count++; > > + } > > + > > + check_memory_reg_overwrite (expr, store_exprs); > > Depending on the answer to the question above, we probably ought to > use note_stores to iterate over all stores for check_memory_reg_overwrite. > Or, alternatively: > > vec_rtx_properties properties; > properties.add_insn (insn, false); > for (rtx_obj_reference ref : properties.refs ()) > if (ref.is_write () && ref.is_reg ()) > check_memory_reg_overwrite ... > > Thanks for the point. I may use it in V2 depending on the outcome of the re-evaluation. > > + } > > + > > + /* Check for small-store-to-large-load. */ > > + if (is_small_store_to_large_load (store_exprs, insn)) > > + { > > + stats_ssll_count++; > > + break_ldp (insn); > > + } > > + > > + update_store_load_distances (store_exprs, store_cnt_byte_dist); > > + } > > + } > > It looks like the loop carries information across BB boundaries. > Is that deliberate, to try to detect forwarding in EBBs? If so, > I'm not sure whether BBs are guaranteed to be in fallthrough order. > > The information probably ought to be reset after non-fallthru. > > I am not sure I understood exactly what you said here, but the vectors are local to each BB iteration so no information is carried across BBs. If I am missing something, please let me know. Thanks, > Richard > > > +} > > + > > +static void > > +execute_avoid_store_forwarding () > > +{ > > + init_alias_analysis (); > > + cselib_init (0); > > + scan_and_transform_bb_level (); > > + end_alias_analysis (); > > + cselib_finish (); > > + statistics_counter_event (cfun, "Number of stores identified: ", > > + stats_store_count); > > + statistics_counter_event (cfun, "Number of load pairs identified: ", > > + stats_ldp_count); > > + statistics_counter_event (cfun, > > + "Number of forwarding cases identified: ", > > + stats_ssll_count); > > + statistics_counter_event (cfun, "Number of trasformed cases: ", > > + stats_transformed_count); > > +} > > + > > +const pass_data pass_data_avoid_store_forwarding =3D > > +{ > > + RTL_PASS, /* type. */ > > + "avoid_store_forwarding", /* name. */ > > + OPTGROUP_NONE, /* optinfo_flags. */ > > + TV_NONE, /* tv_id. */ > > + 0, /* properties_required. */ > > + 0, /* properties_provided. */ > > + 0, /* properties_destroyed. */ > > + 0, /* todo_flags_start. */ > > + 0 /* todo_flags_finish. */ > > +}; > > + > > +class pass_avoid_store_forwarding : public rtl_opt_pass > > +{ > > +public: > > + pass_avoid_store_forwarding (gcc::context *ctxt) > > + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt) > > + {} > > + > > + /* opt_pass methods: */ > > + virtual bool gate (function *) > > + { > > + return aarch64_flag_avoid_store_forwarding && optimize >=3D 2; > > + } > > + > > + virtual unsigned int execute (function *) > > + { > > + execute_avoid_store_forwarding (); > > + return 0; > > + } > > + > > +}; // class pass_avoid_store_forwarding > > + > > +/* Create a new avoid store forwarding pass instance. */ > > + > > +rtl_opt_pass * > > +make_pass_avoid_store_forwarding (gcc::context *ctxt) > > +{ > > + return new pass_avoid_store_forwarding (ctxt); > > +} > > diff --git a/gcc/config/aarch64/aarch64.opt > b/gcc/config/aarch64/aarch64.opt > > index f5a518202a1..a9e944a272f 100644 > > --- a/gcc/config/aarch64/aarch64.opt > > +++ b/gcc/config/aarch64/aarch64.opt > > @@ -304,6 +304,10 @@ moutline-atomics > > Target Var(aarch64_flag_outline_atomics) Init(2) Save > > Generate local calls to out-of-line atomic operations. > > > > +mavoid-store-forwarding > > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) > Optimization > > +Avoid store forwarding to load pairs. > > + > > -param=3Daarch64-sve-compare-costs=3D > > Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) > IntegerRange(0, 1) Param > > When vectorizing for SVE, consider using unpacked vectors for smaller > elements and use the cost model to pick the cheapest approach. Also use > the cost model to choose between SVE and Advanced SIMD vectorization. > > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) > Value(AARCH64_LDP_STP_POLICY_NEVER) > > > > EnumValue > > Enum(aarch64_ldp_stp_policy) String(aligned) > Value(AARCH64_LDP_STP_POLICY_ALIGNED) > > + > > +-param=3Daarch64-store-forwarding-threshold=3D > > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) > Init(20) Param > > +Maximum instruction distance allowed between a store and a load pair > for this to be > > +considered a candidate to avoid when using > aarch64-avoid-store-forwarding. > > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64 > > index a9a244ab6d6..7639b50358d 100644 > > --- a/gcc/config/aarch64/t-aarch64 > > +++ b/gcc/config/aarch64/t-aarch64 > > @@ -176,6 +176,16 @@ aarch64-cc-fusion.o: > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \ > > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc > > > > +aarch64-store-forwarding.o: \ > > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \ > > + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h > $(RTL_BASE_H) \ > > + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) > $(RECOG_H) \ > > + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \ > > + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \ > > + $(srcdir)/config/aarch64/aarch64-protos.h > > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc > > + > > comma=3D, > > MULTILIB_OPTIONS =3D $(subst $(comma),/, $(patsubst %, mabi=3D%, $(= subst > $(comma),$(comma)mabi=3D,$(TM_MULTILIB_CONFIG)))) > > MULTILIB_DIRNAMES =3D $(subst $(comma), ,$(TM_MULTILIB_CONFIG)) > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 2caa7ceec40..ca30f52109a 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -792,7 +792,7 @@ Objective-C and Objective-C++ Dialects}. > > -moverride=3D@var{string} -mverbose-cost-dump > > -mstack-protector-guard=3D@var{guard} > -mstack-protector-guard-reg=3D@var{sysreg} > > -mstack-protector-guard-offset=3D@var{offset} -mtrack-speculation > > --moutline-atomics } > > +-moutline-atomics -mavoid-store-forwarding} > > > > @emph{Adapteva Epiphany Options} > > @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs > > @@ -16583,6 +16583,11 @@ With @option{--param=3Daarch64-stp-policy=3Dne= ver}, > do not emit stp. > > With @option{--param=3Daarch64-stp-policy=3Daligned}, emit stp only if= the > > source pointer is aligned to at least double the alignment of the type. > > > > +@item aarch64-store-forwarding-threshold > > +Maximum allowed instruction distance between a store and a load pair f= or > > +this to be considered a candidate to avoid when using > > +aarch64-avoid-store-forwarding. > > + > > @item aarch64-loop-vect-issue-rate-niters > > The tuning for some AArch64 CPUs tries to take both latencies and issue > > rates into account when deciding whether a loop should be vectorized > > @@ -20558,6 +20563,11 @@ Generate code which uses only the > general-purpose registers. This will prevent > > the compiler from using floating-point and Advanced SIMD registers but > will not > > impose any restrictions on the assembler. > > > > +@item -mavoid-store-forwarding > > +@itemx -mno-avoid-store-forwarding > > +Avoid store forwarding to load pairs. > > +@option{-mavoid-store-forwarding} at levels @option{-O2}, @option{-O3}. > > + > > @opindex mlittle-endian > > @item -mlittle-endian > > Generate little-endian code. This is the default when GCC is > configured for an > > diff --git > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > new file mode 100644 > > index 00000000000..b77de6c64b6 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > @@ -0,0 +1,33 @@ > > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */ > > + > > +#include > > + > > +typedef int v4si __attribute__ ((vector_size (16))); > > + > > +/* Different address, same offset, no overlap */ > > + > > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \ > > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, > TYPE *st_arr_2, TYPE i, TYPE dummy){ \ > > + TYPE r, y; \ > > + st_arr[0] =3D i; \ > > + ld_arr[0] =3D dummy; \ > > + r =3D st_arr_2[0]; \ > > + y =3D st_arr_2[1]; \ > > + return r + y; \ > > +} > > + > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(long) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(float) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(double) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si) > > + > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } = */ > > diff --git > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > new file mode 100644 > > index 00000000000..f1b3a66abfd > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > @@ -0,0 +1,33 @@ > > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */ > > + > > +#include > > + > > +typedef int v4si __attribute__ ((vector_size (16))); > > + > > +/* Same address, different offset, no overlap */ > > + > > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \ > > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE > i, TYPE dummy){ \ > > + TYPE r, y; \ > > + st_arr[0] =3D i; \ > > + ld_arr[0] =3D dummy; \ > > + r =3D st_arr[10]; \ > > + y =3D st_arr[11]; \ > > + return r + y; \ > > +} > > + > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(int) > > +LDP_SSLL_NO_OVERLAP_OFFSET(long) > > +LDP_SSLL_NO_OVERLAP_OFFSET(float) > > +LDP_SSLL_NO_OVERLAP_OFFSET(double) > > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si) > > + > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } = */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > new file mode 100644 > > index 00000000000..8d5ce5cc87e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > @@ -0,0 +1,33 @@ > > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */ > > + > > +#include > > + > > +typedef int v4si __attribute__ ((vector_size (16))); > > + > > +/* Same address, same offset, overlap */ > > + > > +#define LDP_SSLL_OVERLAP(TYPE) \ > > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE > dummy){ \ > > + TYPE r, y; \ > > + st_arr[0] =3D i; \ > > + ld_arr[0] =3D dummy; \ > > + r =3D st_arr[0]; \ > > + y =3D st_arr[1]; \ > > + return r + y; \ > > +} > > + > > +LDP_SSLL_OVERLAP(uint32_t) > > +LDP_SSLL_OVERLAP(uint64_t) > > +LDP_SSLL_OVERLAP(int32_t) > > +LDP_SSLL_OVERLAP(int64_t) > > +LDP_SSLL_OVERLAP(int) > > +LDP_SSLL_OVERLAP(long) > > +LDP_SSLL_OVERLAP(float) > > +LDP_SSLL_OVERLAP(double) > > +LDP_SSLL_OVERLAP(v4si) > > + > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } = */ > > -- > > 2.40.1 > As for the rest of your points, it should be trivial to address them. --=20 *Manos Anagnostakis | Compiler Engineer |* E: manos.anagnostakis@vrull.eu *VRULL GmbH *| Beatrixgasse 32 1030 Vienna | W: www.vrull.eu | LinkedIn --000000000000e09646060a0be97c--