[PATCH 1/2] Add new RTX instruction class FILLER

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 1/2] Add new RTX instruction class FILLER_INSN
@ 2020-07-22 10:02 Andrea Corallo
  2020-07-22 10:09 ` [PATCH 2/2] Aarch64: Add branch diluter pass Andrea Corallo
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Andrea Corallo @ 2020-07-22 10:02 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, Kyrylo Tkachov, Richard Earnshaw, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

Hi all,

I'd like to submit the following two patches implementing a new AArch64
specific back-end pass that helps optimize branch-dense code, which can
be a bottleneck for performance on some Arm cores.  This is achieved by
padding out the branch-dense sections of the instruction stream with
nops.

The original patch was already posted some time ago:

https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg200721.html

This follows up splitting as suggested in two patches, rebasing on
master and implementing the suggestions of the first code review.

This first patch implements the addition of a new RTX instruction class
FILLER_INSN, which has been white listed to allow placement of NOPs
outside of a basic block.  This is to allow padding after unconditional
branches.  This is favorable so that any performance gained from
diluting branches is not paid straight back via excessive eating of
nops.

It was deemed that a new RTX class was less invasive than modifying
behavior in regards to standard UNSPEC nops.

1/2 is requirement for 2/2.  Please see this the cover letter of this last
for more details on the pass itself.

Regards

  Andrea

gcc/ChangeLog

2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
	    Carey Williams  <carey.williams@arm.com>

	* cfgbuild.c (inside_basic_block_p): Handle FILLER_INSN.
	* cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
	basic blocks.
	* coretypes.h: New rtx class.
	* emit-rtl.c (emit_filler_after): New function.
	* rtl.def (FILLER_INSN): New rtl define.
	* rtl.h (rtx_filler_insn): Define new structure.
	(FILLER_INSN_P): New macro.
	(is_a_helper <rtx_filler_insn *>::test): New test helper for
	rtx_filler_insn.
	(emit_filler_after): New extern.
	* target-insns.def: Add target insn definition.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Add-new-RTX-instruction-class-FILLER_INSN.patch --]
[-- Type: text/x-diff, Size: 6361 bytes --]

From 475bbb3984ed133b020b344eebc2d4d3bf8ce52f Mon Sep 17 00:00:00 2001
From: Andrea Corallo <andrea.corallo@arm.com>
Date: Thu, 16 Jul 2020 09:21:38 +0100
Subject: [PATCH 1/2] Add new RTX instruction class FILLER_INSN

gcc/ChangeLog

2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
	    Carey Williams  <carey.williams@arm.com>

	* cfgbuild.c (inside_basic_block_p): Handle FILLER_INSN.
	* cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
	basic blocks.
	* coretypes.h: New rtx class.
	* emit-rtl.c (emit_filler_after): New function.
	* rtl.def (FILLER_INSN): New rtl define.
	* rtl.h (rtx_filler_insn): Define new structure.
	(FILLER_INSN_P): New macro.
	(is_a_helper <rtx_filler_insn *>::test): New test helper for
	rtx_filler_insn.
	(emit_filler_after): New extern.
	* target-insns.def: Add target insn definition.
---
 gcc/cfgbuild.c       |  1 +
 gcc/cfgrtl.c         | 16 +++++++++++++++-
 gcc/coretypes.h      |  1 +
 gcc/emit-rtl.c       | 14 ++++++++++++++
 gcc/rtl.def          |  4 ++++
 gcc/rtl.h            | 23 +++++++++++++++++++++++
 gcc/target-insns.def |  1 +
 7 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/gcc/cfgbuild.c b/gcc/cfgbuild.c
index 478afa5fe91c..07cb06afba07 100644
--- a/gcc/cfgbuild.c
+++ b/gcc/cfgbuild.c
@@ -58,6 +58,7 @@ inside_basic_block_p (const rtx_insn *insn)
 
     case JUMP_TABLE_DATA:
     case BARRIER:
+    case FILLER_INSN:
     case NOTE:
       return false;
 
diff --git a/gcc/cfgrtl.c b/gcc/cfgrtl.c
index 827e84a44ddd..02139aaa268d 100644
--- a/gcc/cfgrtl.c
+++ b/gcc/cfgrtl.c
@@ -61,6 +61,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-pass.h"
 #include "print-rtl.h"
+#include "rtl-iter.h"
 
 /* Disable warnings about missing quoting in GCC diagnostics.  */
 #if __GNUC__ >= 10
@@ -3033,7 +3034,20 @@ rtl_verify_bb_layout (void)
 	      break;
 
 	    default:
-	      fatal_insn ("insn outside basic block", x);
+	      /* Allow nops after branches, via FILLER_INSN.  */
+	      bool fail = true;
+	      subrtx_iterator::array_type array;
+	      FOR_EACH_SUBRTX (iter, array, x, ALL)
+		{
+		  const_rtx rtx = *iter;
+		  if (GET_CODE (rtx) == FILLER_INSN)
+		    {
+		      fail = false;
+		      break;
+		    }
+		}
+	      if (fail)
+		fatal_insn ("insn outside basic block", x);
 	    }
 	}
 
diff --git a/gcc/coretypes.h b/gcc/coretypes.h
index 6b6cfcdf210d..5c6633a815c5 100644
--- a/gcc/coretypes.h
+++ b/gcc/coretypes.h
@@ -84,6 +84,7 @@ struct rtx_def;
     struct rtx_call_insn;       /* CALL_P (X) */
     struct rtx_jump_table_data; /* JUMP_TABLE_DATA_P (X) */
     struct rtx_barrier;         /* BARRIER_P (X) */
+    struct rtx_filler_insn;     /* FILLER_INSN_P (X) */
     struct rtx_code_label;      /* LABEL_P (X) */
     struct rtx_note;            /* NOTE_P (X) */
 
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index f9b0e9714d9e..76f25c011b2a 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -4746,6 +4746,20 @@ emit_barrier_after (rtx_insn *after)
   return insn;
 }
 
+/* Make an insn of code FILLER_INSN to
+   pad out the instruction stream.
+   PATTERN should be from gen_filler_insn ().
+   AFTER will typically be an unconditional
+   branch at the end of a basic block.  */
+
+rtx_insn *
+emit_filler_after (rtx pattern, rtx_insn *after)
+{
+  rtx_insn* i = make_insn_raw (pattern);
+  add_insn_after_nobb (i, after);
+  return i;
+}
+
 /* Emit the label LABEL after the insn AFTER.  */
 
 rtx_insn *
diff --git a/gcc/rtl.def b/gcc/rtl.def
index 9754333eafba..0e0040eaa1cf 100644
--- a/gcc/rtl.def
+++ b/gcc/rtl.def
@@ -328,6 +328,10 @@ DEF_RTL_EXPR(RETURN, "return", "", RTX_EXTRA)
    conditional jumps.  */
 DEF_RTL_EXPR(SIMPLE_RETURN, "simple_return", "", RTX_EXTRA)
 
+/* Special filler type, used to pad the instruction stream.  */
+
+DEF_RTL_EXPR(FILLER_INSN, "filler_insn", "", RTX_INSN)
+
 /* Special for EH return from subroutine.  */
 
 DEF_RTL_EXPR(EH_RETURN, "eh_return", "", RTX_EXTRA)
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 0872cc408eb1..60abd609007f 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -674,6 +674,17 @@ struct GTY(()) rtx_barrier : public rtx_insn
      from rtl.def.  */
 };
 
+struct GTY(()) rtx_filler_insn : public rtx_insn
+{
+  /* No extra fields, but adds the invariant:
+       FILLER_INSN_P (X) aka (GET_CODE (X) == FILLER_INSN)
+     i.e. a marker that indicates the INSN stream should be padded.
+
+     This is an instance of:
+       DEF_RTL_EXPR(FILLER_INSN, "filler_insn", "", RTX_INSN)
+     from rtl.def.  */
+};
+
 struct GTY(()) rtx_code_label : public rtx_insn
 {
   /* No extra fields, but adds the invariant:
@@ -865,6 +876,9 @@ struct GTY(()) rtvec_def {
 /* Predicate yielding nonzero iff X is a barrier insn.  */
 #define BARRIER_P(X) (GET_CODE (X) == BARRIER)
 
+/* Predicate yielding nonzero iff X is a filler insn.  */
+#define FILLER_INSN_P(X) (GET_CODE (X) == FILLER_INSN)
+
 /* Predicate yielding nonzero iff X is a data for a jump table.  */
 #define JUMP_TABLE_DATA_P(INSN) (GET_CODE (INSN) == JUMP_TABLE_DATA)
 
@@ -970,6 +984,14 @@ is_a_helper <rtx_barrier *>::test (rtx rt)
   return BARRIER_P (rt);
 }
 
+template <>
+template <>
+inline bool
+is_a_helper <rtx_filler_insn *>::test (rtx rt)
+{
+  return FILLER_INSN_P (rt);
+}
+
 template <>
 template <>
 inline bool
@@ -3300,6 +3322,7 @@ extern rtx_insn *emit_debug_insn_after (rtx, rtx_insn *);
 extern rtx_insn *emit_debug_insn_after_noloc (rtx, rtx_insn *);
 extern rtx_insn *emit_debug_insn_after_setloc (rtx, rtx_insn *, location_t);
 extern rtx_barrier *emit_barrier_after (rtx_insn *);
+extern rtx_insn *emit_filler_after (rtx, rtx_insn *);
 extern rtx_insn *emit_label_after (rtx_insn *, rtx_insn *);
 extern rtx_note *emit_note_after (enum insn_note, rtx_insn *);
 extern rtx_insn *emit_insn (rtx);
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index 4d7eb92cf68c..dfdf0273edc1 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -94,6 +94,7 @@ DEF_TARGET_INSN (sibcall_epilogue, (void))
 DEF_TARGET_INSN (sibcall_value, (rtx x0, rtx x1, rtx opt2, rtx opt3,
 				 rtx opt4))
 DEF_TARGET_INSN (simple_return, (void))
+DEF_TARGET_INSN (filler_insn, (void))
 DEF_TARGET_INSN (split_stack_prologue, (void))
 DEF_TARGET_INSN (split_stack_space_check, (rtx x0, rtx x1))
 DEF_TARGET_INSN (stack_protect_combined_set, (rtx x0, rtx x1))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-22 10:02 [PATCH 1/2] Add new RTX instruction class FILLER_INSN Andrea Corallo
@ 2020-07-22 10:09 ` Andrea Corallo
  2020-07-22 10:39   ` Andrew Pinski
  2020-07-24 22:09   ` Segher Boessenkool
  2020-07-22 12:24 ` [PATCH 1/2] Add new RTX instruction class FILLER_INSN Richard Biener
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Andrea Corallo @ 2020-07-22 10:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, Kyrylo Tkachov, Richard Earnshaw, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 4219 bytes --]

Hi all,

this second patch implements the AArch64 specific back-end pass
'branch-dilution' controllable by the followings command line options:

-mbranch-dilution

--param=aarch64-branch-dilution-granularity={num}

--param=aarch64-branch-dilution-max-branches={num}

Some cores known to be able to benefit from this pass have been given
default tuning values for their granularity and max-branches.  Each
affected core has a very specific granule size and associated max-branch
limit.  This is a microarchitecture specific optimization.  Typical
usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
granularity tuned to 0 will be ignored. Options are provided for
experimentation.

Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
for all the testsuite proved to be ~0.4%.

* Algorithm and Heuristic

The pass takes a very simple 'sliding window' approach to the problem.
We crawl through each instruction (starting at the first branch) and
keep track of the number of branches within the current "granule" (or
window).  When this exceeds the max-branch value, the pass will dilute
the current granule, inserting nops to push out some of the branches.
The heuristic will favor unconditonal branches (for performance
reasons), or branches that are between two other branches (in order to
decrease the likelihood of another dilution call being needed).

Each branch type required a different method for nop insertion due to
RTL/basic_block restrictions:

- Returning calls do not end a basic block so can be handled by
  emitting a generic nop.

- Unconditional branches must be the end of a basic block, and nops
  cannot be outside of a basic block.  Thus the need for FILLER_INSN,
  which allows placement outside of a basic block

- and translates to a nop.

- For most conditional branches we've taken a simple approach and only
  handle the fallthru edge for simplicity, which we do by inserting a
  "nop block" of nops on the fallthru edge, mapping that back to the
  original destination block.

- asm gotos and pcsets are going to be tricky to analyze from a
  dilution perspective so are ignored at present.

* Testing

The two patches has been tested together on top of current master on
aarch64-unknown-linux-gnu as follow:

- Successful compilation of 3 stage bootstrap with the
  pass forced on (for stage 2, 3)

- No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)

- No 'make check' regressions


Regards

  Andrea

gcc/ChangeLog

2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
	    Carey Williams  <carey.williams@arm.com>

	* config.gcc (extra_objs): Add aarch64-branch-dilution.o.
	* config/aarch64/aarch64-branch-dilution.c: New file.
	* config/aarch64/aarch64-passes.def (branch-dilution): Register
	pass.
        * config/aarch64/aarch64-protos.h (struct tune_params): Declare
	tuning parameters bdilution_gsize and bdilution_maxb.
        (make_pass_branch_dilution): New declaration.
        * config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings)
        (cortexa53_tunings, cortexa57_tunings, cortexa72_tunings)
        (cortexa73_tunings, exynosm1_tunings, thunderxt88_tunings)
        (thunderx_tunings, tsv110_tunings, xgene1_tunings)
        (qdf24xx_tunings, saphira_tunings, thunderx2t99_tunings)
	(neoversen1_tunings): Provide default tunings for bdilution_gsize
	and bdilution_maxb.
        * config/aarch64/aarch64.md (filler_insn): Define new insn.
        * config/aarch64/aarch64.opt (-mbranch-dilution)
	(--param=aarch64-branch-dilution-granularity)
	(--param=aarch64-branch-dilution-max-branches): Add new options.
        * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
	for aarch64-branch-dilution.c.
        * doc/invoke.texi (-mbranch-dilution)
	(--param=aarch64-branch-dilution-granularity)
        (--param=aarch64-branch-dilution-max-branches): Document branch
	dilution options.

gcc/testsuite/ChangeLog

2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
	    Carey Williams  <carey.williams@arm.com>

	* gcc.target/aarch64/branch-dilution-off.c: New file.
	* gcc.target/aarch64/branch-dilution-on.c: New file.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0002-Aarch64-Add-branch-diluter-pass.patch --]
[-- Type: text/x-diff, Size: 37485 bytes --]

From 386b3a3131d5f03a4c9fb8ee47b321009f17fab5 Mon Sep 17 00:00:00 2001
From: Andrea Corallo <andrea.corallo@arm.com>
Date: Thu, 16 Jul 2020 09:24:33 +0100
Subject: [PATCH 2/2] Aarch64: Add branch diluter pass

gcc/ChangeLog

2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
	    Carey Williams  <carey.williams@arm.com>

	* config.gcc (extra_objs): Add aarch64-branch-dilution.o.
	* config/aarch64/aarch64-branch-dilution.c: New file.
	* config/aarch64/aarch64-passes.def (branch-dilution): Register
	pass.
        * config/aarch64/aarch64-protos.h (struct tune_params): Declare
	tuning parameters bdilution_gsize and bdilution_maxb.
        (make_pass_branch_dilution): New declaration.
        * config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings)
        (cortexa53_tunings, cortexa57_tunings, cortexa72_tunings)
        (cortexa73_tunings, exynosm1_tunings, thunderxt88_tunings)
        (thunderx_tunings, tsv110_tunings, xgene1_tunings)
        (qdf24xx_tunings, saphira_tunings, thunderx2t99_tunings)
	(neoversen1_tunings): Provide default tunings for bdilution_gsize
	and bdilution_maxb.
        * config/aarch64/aarch64.md (filler_insn): Define new insn.
        * config/aarch64/aarch64.opt (-mbranch-dilution)
	(--param=aarch64-branch-dilution-granularity)
	(--param=aarch64-branch-dilution-max-branches): Add new options.
        * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
	for aarch64-branch-dilution.c.
        * doc/invoke.texi (-mbranch-dilution)
	(--param=aarch64-branch-dilution-granularity)
        (--param=aarch64-branch-dilution-max-branches): Document branch
	dilution options.

gcc/testsuite/ChangeLog

2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
	    Carey Williams  <aarey.williams@arm.com>

	* gcc.target/aarch64/branch-dilution-off.c: New file.
	* gcc.target/aarch64/branch-dilution-on.c: New file.
---
 gcc/config.gcc                                |   2 +-
 gcc/config/aarch64/aarch64-branch-dilution.c  | 648 ++++++++++++++++++
 gcc/config/aarch64/aarch64-passes.def         |   1 +
 gcc/config/aarch64/aarch64-protos.h           |   4 +
 gcc/config/aarch64/aarch64.c                  |  34 +
 gcc/config/aarch64/aarch64.md                 |   7 +
 gcc/config/aarch64/aarch64.opt                |  11 +
 gcc/config/aarch64/t-aarch64                  |   6 +
 gcc/doc/invoke.texi                           |  17 +-
 .../gcc.target/aarch64/branch-dilution-off.c  |  57 ++
 .../gcc.target/aarch64/branch-dilution-on.c   |  58 ++
 11 files changed, 843 insertions(+), 2 deletions(-)
 create mode 100644 gcc/config/aarch64/aarch64-branch-dilution.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/branch-dilution-on.c

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 30b51c3dc81e..0b6b26d973f2 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -321,7 +321,7 @@ aarch64*-*-*)
 	c_target_objs="aarch64-c.o"
 	cxx_target_objs="aarch64-c.o"
 	d_target_objs="aarch64-d.o"
-	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch64-bti-insert.o"
+	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch64-bti-insert.o aarch64-branch-dilution.o"
 	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
 	target_has_targetm_common=yes
 	;;
diff --git a/gcc/config/aarch64/aarch64-branch-dilution.c b/gcc/config/aarch64/aarch64-branch-dilution.c
new file mode 100644
index 000000000000..3535384640fa
--- /dev/null
+++ b/gcc/config/aarch64/aarch64-branch-dilution.c
@@ -0,0 +1,648 @@
+/* Branch dilution optimization pass for AArch64.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   Contributed by Arm Ltd.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#define INCLUDE_LIST
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "df.h"
+#include "insn-config.h"
+#include "regs.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "recog.h"
+#include "cfganal.h"
+#include "insn-attr.h"
+#include "context.h"
+#include "tree-pass.h"
+#include "function-abi.h"
+#include "regrename.h"
+#include "aarch64-protos.h"
+#include "cfghooks.h"
+#include "cfgrtl.h"
+#include "cfgbuild.h"
+#include "errors.h"
+#include "diagnostic.h"
+
+namespace {
+
+unsigned max_branch = 0;
+unsigned granule_size = 0;
+
+static bool
+is_branch (rtx_insn *insn)
+{
+  if (insn == NULL)
+    return false;
+
+  return JUMP_P (insn) || CALL_P (insn) || ANY_RETURN_P (insn);
+}
+
+const pass_data
+pass_data_branch_dilution =
+  { RTL_PASS,      /* type.  */
+    "branch-dilution",   /* name.  */
+    OPTGROUP_NONE, /* optinfo_flags.  */
+    TV_NONE,       /* tv_id.  */
+    0,		 /* properties_required.  */
+    0,		 /* properties_provided.  */
+    0,		 /* properties_destroyed.  */
+    0,		 /* todo_flags_start.  */
+    0,		 /* todo_flags_finish.  */
+  };
+
+/* Return true if INSN is a branch insn.  */
+
+class pass_branch_dilution : public rtl_opt_pass
+{
+ public:
+  pass_branch_dilution (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_branch_dilution, ctxt)
+    {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return optimize && aarch64_bdilution;
+  }
+
+  virtual unsigned execute (function *);
+};
+
+/* Simple wrapper for RTX insns added to a granule.
+   It helps aid analysis and manipulation.  */
+struct insn_info
+{
+  insn_info (rtx_insn *);
+  rtx_insn *rtx;		/* underlying gcc rtx insn.  */
+  insn_info *next;		/* next insn in the granule.  */
+  insn_info *prev;		/* prev insn in the granule.  */
+  unsigned index;		/* current position in the granule.  */
+  bool is_branch;		/* denotes a branch insn.  */
+  bool is_unconditional;	/* denotes an unconditonal branch.  */
+  bool is_nop;			/* denotes a nop insn.  */
+  bool ignore;			/* to ignore unsupported branch types.  */
+};
+
+insn_info::insn_info (rtx_insn *i)
+: rtx (i), next (NULL), prev (NULL), index (-1),
+  is_branch (false), is_unconditional (false), is_nop (false),
+  ignore (false)
+{}
+
+/* A 'sliding window' like abstraction that represents
+   the current view of instructions that are being
+   considered for branch dilution.  */
+class insn_granule
+{
+ public:
+  insn_granule ();
+  /* Debug method.  */
+  void dump (const char *desc = "General");
+  /* Attempt to dilute/pad granule with nops.  */
+  int dilute ();
+  /* Returns true if the granule needs diluting.  */
+  bool saturated ();
+  /* Utility functions for inserting instructions into
+     the granule.  */
+  void insert_insn_after (insn_info *, insn_info *);
+  void add_insn (insn_info *);
+ private:
+  void remove_oldest_insn ();
+  void remove_newest_insn ();
+  void update_indexes ();
+  /* Utility functions for handling nop special nop
+     insertion.  */
+  void insert_nop_block_after (insn_info *insn);
+  basic_block create_nop_block_after_insn (insn_info *, int);
+  /* Dilution heuristics.  */
+  insn_info *get_best_branch ();
+  int branch_heuristic (insn_info *);
+
+  /* Pointers to the first/last instructions in granule.  */
+  insn_info *m_first = NULL;
+  insn_info *m_last = NULL;
+
+  /* Current counts of each interesting insn type in the granule.  */
+  unsigned m_insn_count;
+  unsigned m_branch_count;
+  unsigned m_ubranch_count;
+};
+
+insn_granule::insn_granule ()
+: m_insn_count (0), m_branch_count (0), m_ubranch_count (0)
+{}
+
+/* Create a new basic block, populated with nop_count nops, after a given
+   instruction.  */
+
+basic_block
+insn_granule::create_nop_block_after_insn (insn_info *insn, int nop_count)
+{
+  gcc_assert (nop_count > 0);
+  basic_block bb = BLOCK_FOR_INSN (insn->rtx);
+  rtx_insn *nop_ptr = NULL;
+  gcc_assert (is_branch (BB_END (bb)));
+  nop_ptr = emit_insn_after (gen_nop (), BB_END (bb));
+  insn_info *nop_insn = new insn_info (nop_ptr);
+  nop_insn->is_nop = true;
+  insert_insn_after (nop_insn, insn);
+  basic_block new_bb = create_basic_block (nop_ptr, NULL, bb);
+  set_block_for_insn (nop_ptr, new_bb);
+  for (int i = 0; i < (nop_count - 1); i++)
+    {
+      nop_ptr = emit_insn_after (gen_nop (), nop_ptr);
+      nop_insn = new insn_info (nop_ptr);
+      set_block_for_insn (nop_ptr, new_bb);
+      nop_insn->is_nop = true;
+      insert_insn_after (nop_insn, insn);
+    }
+  /* Fix up block endings.  */
+  BB_END (new_bb) = nop_ptr;
+  BB_END (bb) = insn->rtx;
+
+  return new_bb;
+}
+
+/* Dump a textual representation of the current state
+   of the insn_granule in the following format:
+
+   "===== GRANULE ====="
+   "====> {desc} <===="
+   "---> INS:?, BRA:? (?), FIRST: ?, LAST: ?"
+   "0. jump_insn  > NEXT = (?), PREV = (?) -- UID: ?"
+   "1. insn (nop)  > NEXT = (?), PREV = (?) -- UID: ?"
+   "2. insn  > NEXT = (?), PREV = (?) -- UID: ?"
+   "3. jump_insn  > NEXT = (?), PREV = (?) -- UID: ?"
+
+   Used only for debugging with fdump.
+*/
+
+void
+insn_granule::dump (const char *desc)
+{
+  insn_info *insn = m_first;
+  dump_printf (MSG_NOTE, "===== GRANULE =====\n====> %s <====\n", desc);
+  dump_printf (MSG_NOTE, "---> INS:%d, BRA:%d (%d), FIRST: %d, LAST: %d\n",
+	       m_insn_count, m_branch_count, m_ubranch_count,
+	       m_first->index, m_last->index);
+  while (insn)
+    {
+      dump_printf (MSG_NOTE,
+		   "%d. %s%s%s  > NEXT = (%d), PREV = (%d) -- UID: %d\n",
+		   insn->index, GET_RTX_NAME (GET_CODE (insn->rtx)),
+		   any_uncondjump_p (insn->rtx) ? " (ubranch)" : "",
+		   insn->is_nop ? " (nop)" : "",
+		   insn->next ? insn->next->index : -1,
+		   insn->prev ? insn->prev->index : -1, INSN_UID (insn->rtx));
+      insn = insn->next;
+    }
+}
+
+/* Simple heuristic used to favor certain types branches to pad after, for
+   e.g. we prefer unconditional branches or branches surrounded by other
+   branches.  */
+
+int
+insn_granule::branch_heuristic (insn_info *insn)
+{
+  int value = 0;
+  if (!is_branch (insn->rtx))
+    {
+      if (dump_file)
+	dump_printf (MSG_NOTE,
+		     "--> Ignoring insn: %d as it's not a branch\n",
+		     insn->index);
+      return -1;
+    }
+  if (insn->ignore)
+    {
+      if (dump_file)
+	dump_printf (MSG_NOTE,
+		     "--> Ignoring branch insn: %d (unsupported)\n",
+		     insn->index);
+      return -1;
+    }
+  if (insn == m_last)
+    {
+      if (dump_file)
+	dump_printf (MSG_NOTE,
+		     "--> Ignoring insn: %d as it's the last granule insn\n",
+		     insn->index);
+      return -1;
+    }
+  if (insn->is_unconditional)
+    value += 2;
+  if (is_branch (prev_real_nondebug_insn (insn->rtx))
+      && is_branch (next_real_nondebug_insn (insn->rtx)))
+    value++;
+
+  return value;
+}
+
+/* Iterate over the granule and test a heuristic against each insn.
+   Return the insn that seems most promising to pad after, or the
+   first, in a case where no others seem suitable.  */
+
+insn_info *
+insn_granule::get_best_branch ()
+{
+  insn_info *current_insn = m_first;
+  insn_info *best_insn = m_first;
+  int current_score = 0;
+  int best_score = 0;
+  /* Make sure we start with a branch.  */
+  while (current_insn && !current_insn->is_branch)
+    {
+      current_insn = current_insn->next;
+    }
+  best_insn = current_insn;
+  while (current_insn)
+    {
+      current_score = branch_heuristic (current_insn);
+      if (dump_file)
+	dump_printf (MSG_NOTE, "Evaluating insn %d (%s), score = %d\n",
+		     current_insn->index,
+		     GET_RTX_NAME (GET_CODE (current_insn->rtx)),
+		     current_score);
+      if (current_score > best_score)
+	{
+	  best_score = current_score;
+	  best_insn = current_insn;
+	}
+      current_insn = current_insn->next;
+    }
+  if (dump_file)
+    dump_printf (MSG_NOTE, "Returning best insn: %d %s.\n", best_insn->index,
+		 GET_RTX_NAME (GET_CODE (best_insn->rtx)));
+
+  return best_insn;
+}
+
+/* Insert the instruction into the granule, after the given instruction.  */
+
+void
+insn_granule::insert_insn_after (insn_info *new_insn, insn_info *current_insn)
+{
+  gcc_assert (current_insn != m_last);
+  if (dump_file)
+    dump_printf (MSG_NOTE, "Inserting nop after insn %d, at position %d\n",
+		 current_insn->index, current_insn->index + 1);
+  new_insn->index = current_insn->index + 1;
+
+  current_insn->next->prev = new_insn;
+  new_insn->next = current_insn->next;
+  current_insn->next = new_insn;
+  new_insn->prev = current_insn;
+
+  m_insn_count++;
+  update_indexes ();
+}
+
+/* Insert a new basic block containing a nop.  */
+
+void
+insn_granule::insert_nop_block_after (insn_info *insn)
+{
+  edge e;
+  basic_block bb = BLOCK_FOR_INSN (insn->rtx);
+  e = find_fallthru_edge (bb->succs);
+  if (e)
+    {
+      basic_block new_bb = create_nop_block_after_insn (insn, 1);
+      /* Wire up the edges and preserve the partition.  */
+      make_edge (new_bb, e->dest, EDGE_FALLTHRU);
+      BB_COPY_PARTITION (new_bb, bb);
+      redirect_edge_succ_nodup (e, new_bb);
+    }
+  else
+    {
+      /* Odd special case for exit functions, which have no fall-thru
+	 edge.  */
+      insn_info *nop_insn =
+	new insn_info (emit_filler_after (gen_filler_insn (), insn->rtx));
+      nop_insn->is_nop = true;
+      insert_insn_after (nop_insn, insn);
+    }
+}
+
+/* Analyse the granule's branches and insert a nop to dilute.  */
+
+int
+insn_granule::dilute ()
+{
+  if (dump_file)
+    dump_printf (MSG_NOTE, "> Starting dilution.\n");
+  insn_info *branch_insn = NULL;
+  branch_insn = get_best_branch ();
+  /* If the granule is saturated then branches should be available.
+     Inserting nops after the granule isn't going to help dilute it.  */
+  gcc_assert (branch_insn
+	      && (branch_insn->is_branch && (branch_insn != m_last)));
+  if (any_condjump_p (branch_insn->rtx))
+    {
+      /* We can insert a nop via a 'nop block'
+	 attached to the fallthru edge.  */
+      insert_nop_block_after (branch_insn);
+    }
+  else if (CALL_P (branch_insn->rtx))
+    {
+      /* Standard calls do not end the basic block,
+	 so a simple emit will suffice.  */
+      if (!control_flow_insn_p (branch_insn->rtx))
+	{
+	  insn_info *nop_insn = new insn_info
+	    (emit_insn_after (gen_nop (), branch_insn->rtx));
+	  nop_insn->is_nop = true;
+	  insert_insn_after (nop_insn, branch_insn);
+	}
+      else
+	/* Sibling calls and no-return calls do.  */
+	insert_nop_block_after (branch_insn);
+    }
+  else if (returnjump_p (branch_insn->rtx))
+    {
+      /* Return jumps must be followed by their barrier,
+	 so we emit the filler after that.  */
+      insn_info *nop_insn =
+	new insn_info (
+	  emit_filler_after (gen_filler_insn (),
+			     next_nonnote_nondebug_insn (branch_insn->rtx)));
+      nop_insn->is_nop = true;
+      insert_insn_after (nop_insn, branch_insn);
+    }
+  else if (any_uncondjump_p (branch_insn->rtx))
+    {
+      /* Any remaining unconditionals can be handled by a filler.  */
+      insn_info *nop_insn =
+	new insn_info (emit_filler_after (gen_filler_insn (),
+					  branch_insn->rtx));
+      nop_insn->is_nop = true;
+      insert_insn_after (nop_insn, branch_insn);
+    }
+  else if (pc_set (branch_insn->rtx) || JUMP_P (branch_insn->rtx))
+    {
+      /* TODO handle pc_set and asm _goto (s).
+	 For now we'll just pretend they're not branches.  */
+      branch_insn->is_branch = false;
+      m_branch_count--;
+      if (branch_insn->is_unconditional)
+	{
+	  branch_insn->is_unconditional = false;
+	  m_ubranch_count--;
+	}
+      branch_insn->ignore = true;
+    }
+  else
+    {
+      /* Unhandled branch type.  */
+      if (dump_file)
+	{
+	  dump_printf (MSG_NOTE, "Error: unhandled branch type:\n");
+	  print_rtl_single (dump_file, branch_insn->rtx);
+	}
+      gcc_unreachable ();
+    }
+
+  /* Trim the granule back to size after nop insertion.  */
+  int rollback = 0;
+  while (m_insn_count > granule_size)
+    {
+      remove_newest_insn ();
+      rollback++;
+    }
+  if (dump_file)
+    dump_printf (MSG_NOTE, "< End dilution.\n");
+
+  return rollback;
+}
+
+/* Return true if the granule needs diluting.  */
+
+bool
+insn_granule::saturated ()
+{
+  return m_branch_count > max_branch;
+}
+
+/* Update index in each insn_info in the current window.  */
+
+void
+insn_granule::update_indexes ()
+{
+  insn_info *i = m_first;
+  int n = 0;
+  i->index = n++;
+  while ((i = i->next))
+    {
+      i->index = n++;
+    }
+}
+
+/* Add an instruction to the granule.  */
+
+void
+insn_granule::add_insn (insn_info *insn)
+{
+  if (m_insn_count == 0)
+    m_first = m_last = insn;
+  else
+    {
+      if (m_insn_count >= granule_size)
+	remove_oldest_insn ();
+      m_last->next = insn;
+      insn->prev = m_last;
+      m_last = insn;
+    }
+  insn->index = m_insn_count++;
+  if (is_branch (insn->rtx))
+    {
+      m_branch_count++;
+      insn->is_branch = true;
+      if (any_uncondjump_p (insn->rtx))
+	{
+	  insn->is_unconditional = true;
+	  m_ubranch_count++;
+	}
+    }
+  update_indexes ();
+}
+
+/* Remove the last added instruction from the granule.  */
+
+void
+insn_granule::remove_newest_insn ()
+{
+  insn_info *to_delete = m_last;
+  if (dump_file)
+    dump_printf (MSG_NOTE, "to_delete: %d UID: %d\n", to_delete->index,
+	     INSN_UID (to_delete->rtx));
+  m_last = to_delete->prev;
+  m_last->next = NULL;
+  if (is_branch (to_delete->rtx) && !to_delete->ignore)
+    {
+      m_branch_count--;
+      if (any_uncondjump_p (to_delete->rtx))
+	m_ubranch_count--;
+    }
+  m_insn_count--;
+  delete to_delete;
+  update_indexes ();
+}
+
+/* Remove the first added instruction from the granule.  */
+
+void
+insn_granule::remove_oldest_insn ()
+{
+  insn_info *to_delete = m_first;
+  m_first = to_delete->next;
+  m_first->prev = NULL;
+  if (is_branch (to_delete->rtx) && !to_delete->ignore)
+    {
+      m_branch_count--;
+      if (any_uncondjump_p (to_delete->rtx))
+	m_ubranch_count--;
+    }
+  m_insn_count--;
+  delete to_delete;
+  update_indexes ();
+}
+
+/* Return the next instruction after START that is a "real" instruction
+   i.e. not barriers, code labels, debug insns etc.  */
+
+static rtx_insn *
+next_space_consuming_insn (rtx_insn *start)
+{
+  rtx_insn *insn = next_real_nondebug_insn (start);
+  while (insn && (recog_memoized (insn) < 0))
+    insn = next_real_nondebug_insn (insn);
+  return insn;
+}
+
+/* Find the next branching insn in the instruction stream starting
+   from, excluding START.  Return that insn or NULL if none is found.
+   Write in DIST the distance between the result and START
+   (in instructions).  */
+
+rtx_insn *
+find_next_branch (rtx_insn *start)
+{
+  rtx_insn *insn = next_space_consuming_insn (start);
+  while (insn)
+    {
+      if (is_branch (insn))
+	return insn;
+      insn = next_space_consuming_insn (insn);
+    }
+
+  return NULL;
+}
+
+/* Execute the branch dilution pass.  */
+
+unsigned int
+pass_branch_dilution::execute (ATTRIBUTE_UNUSED function *fun)
+{
+  granule_size = global_options_set.x_aarch64_bdilution_gsize
+    ? (unsigned)aarch64_bdilution_gsize
+    : aarch64_tune_params.bdilution_gsize;
+
+  max_branch = global_options_set.x_aarch64_bdilution_maxb
+    ? (unsigned)aarch64_bdilution_maxb
+    : aarch64_tune_params.bdilution_maxb;
+
+  if (dump_file)
+    dump_printf (MSG_NOTE, "BDILUTE OPTIONS: %d, %d\n", granule_size, max_branch);
+
+  /* Disabled.  */
+  if (granule_size < 1)
+    return 0;
+  /* Invalid.  */
+  if (max_branch < 1)
+    sorry ("branch dilution: max branch must be greater than zero");
+  else if (max_branch >= granule_size)
+    sorry ("branch dilution: max branches (%d) must be "
+	   "less than granule size (%d)", max_branch, granule_size);
+
+  if (dump_file)
+    dump_printf (MSG_NOTE, "branch-dilution start:\n");
+
+  /* Start scanning from the first occuring branch.  */
+  rtx_insn *curr_insn = find_next_branch (get_insns ());
+
+  if (!curr_insn)
+    return 0;
+
+  insn_granule granule;
+  granule.add_insn (new insn_info (curr_insn));
+
+  if (dump_file)
+    granule.dump ();
+
+  /* Iterate over the rest of the instruction stream.  */
+  while ((curr_insn = next_nonnote_nondebug_insn (curr_insn)))
+    {
+      if (!LABEL_P (curr_insn) && !BARRIER_P (curr_insn))
+	{
+	  granule.add_insn (new insn_info (curr_insn));
+	  if (dump_file)
+	    granule.dump ();
+	}
+      else
+	{
+	  /* Only check for saturation if changes have been made.  */
+	  continue;
+	}
+      if (granule.saturated ())
+	{
+	  if (dump_file)
+	    dump_printf (MSG_NOTE, "Granule is saturated.\n");
+	  int rollback = granule.dilute ();
+	  /* Now we need to re-scan any pushed out instructions.  */
+	  while (rollback > 0)
+	    {
+	      curr_insn = previous_insn (curr_insn);
+	      rollback--;
+	    }
+	  if (dump_file)
+	    granule.dump ();
+	}
+    }
+  if (dump_file)
+    dump_printf (MSG_NOTE, "branch-dilution end.\n");
+  return 0;
+}
+
+} // anon namespace
+
+/* Create the branch dilution pass.  */
+
+rtl_opt_pass *
+make_pass_branch_dilution (gcc::context *ctxt)
+{
+  return new pass_branch_dilution (ctxt);
+}
diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
index 223f78590568..9383858020af 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -21,4 +21,5 @@
 INSERT_PASS_AFTER (pass_regrename, 1, pass_fma_steering);
 INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
 INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
+INSERT_PASS_BEFORE (pass_compute_alignments, 1, pass_branch_dilution);
 INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 839f801a31b8..d02d2882df4c 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -267,6 +267,9 @@ struct tune_params
   int vec_reassoc_width;
   int min_div_recip_mul_sf;
   int min_div_recip_mul_df;
+  /* Branch dilution.  */
+  unsigned int bdilution_gsize;
+  unsigned int bdilution_maxb;
   /* Value for aarch64_case_values_threshold; or 0 for the default.  */
   unsigned int max_case_values;
 /* An enum specifying how to take into account CPU autoprefetch capabilities
@@ -761,6 +764,7 @@ rtl_opt_pass *make_pass_fma_steering (gcc::context *);
 rtl_opt_pass *make_pass_track_speculation (gcc::context *);
 rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
 rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
+rtl_opt_pass *make_pass_branch_dilution (gcc::context *ctxt);
 
 poly_uint64 aarch64_regmode_natural_size (machine_mode);
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 6ef2e397d393..23f98d471b32 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -888,6 +888,8 @@ static const struct tune_params generic_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
@@ -915,6 +917,8 @@ static const struct tune_params cortexa35_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
@@ -942,6 +946,8 @@ static const struct tune_params cortexa53_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
@@ -969,6 +975,8 @@ static const struct tune_params cortexa57_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
@@ -996,6 +1004,8 @@ static const struct tune_params cortexa72_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  4,	/* bdilution_gsize.  */
+  2,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
@@ -1023,6 +1033,8 @@ static const struct tune_params cortexa73_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
@@ -1051,6 +1063,8 @@ static const struct tune_params exynosm1_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   48,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
@@ -1077,6 +1091,8 @@ static const struct tune_params thunderxt88_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),	/* tune_flags.  */
@@ -1103,6 +1119,8 @@ static const struct tune_params thunderx_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
@@ -1131,6 +1149,8 @@ static const struct tune_params tsv110_tunings =
   1,    /* vec_reassoc_width.  */
   2,    /* min_div_recip_mul_sf.  */
   2,    /* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,    /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
@@ -1158,6 +1178,8 @@ static const struct tune_params xgene1_tunings =
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
   17,	/* max_case_values.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
   &xgene1_prefetch_tune
@@ -1183,6 +1205,8 @@ static const struct tune_params emag_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   17,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
@@ -1210,6 +1234,8 @@ static const struct tune_params qdf24xx_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
@@ -1239,6 +1265,8 @@ static const struct tune_params saphira_tunings =
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),		/* tune_flags.  */
@@ -1266,6 +1294,8 @@ static const struct tune_params thunderx2t99_tunings =
   2,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
@@ -1293,6 +1323,8 @@ static const struct tune_params thunderx3t110_tunings =
   2,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  0,	/* bdilution_gsize.  */
+  0,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
@@ -1319,6 +1351,8 @@ static const struct tune_params neoversen1_tunings =
   2,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
+  8,	/* bdilution_gsize.  */
+  4,	/* bdilution_maxb.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index d5ca1898c02e..1bee7df74048 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -899,6 +899,13 @@
    (set_attr "sls_length" "retbr")]
 )
 
+(define_insn "filler_insn"
+  [(filler_insn)]
+  ""
+  "nop"
+  [(set_attr "type" "no_insn")
+   (set_attr "length" "4")])
+
 (define_insn "*cb<optab><mode>1"
   [(set (pc) (if_then_else (EQL (match_operand:GPI 0 "register_operand" "r")
 				(const_int 0))
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 5170361fd5e5..cd2cb5715da4 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -164,6 +164,10 @@ msign-return-address=
 Target WarnRemoved RejectNegative Joined Enum(aarch64_ra_sign_scope_t) Var(aarch64_ra_sign_scope) Init(AARCH64_FUNCTION_NONE) Save
 Select return address signing scope.
 
+mbranch-dilution
+Target Report RejectNegative Save Var(aarch64_bdilution) Init(0) Save
+Run the branch dilution pass.
+
 Enum
 Name(aarch64_ra_sign_scope_t) Type(enum aarch64_function_type)
 Supported AArch64 return address signing scope (for use with -msign-return-address= option):
@@ -275,3 +279,10 @@ The number of Newton iterations for calculating the reciprocal for float type.
 Target Joined UInteger Var(aarch64_double_recp_precision) Init(2) IntegerRange(1, 5) Param
 The number of Newton iterations for calculating the reciprocal for double type.  The precision of division is proportional to this param when division approximation is enabled.  The default value is 2.
 
+-param=aarch64-branch-dilution-granularity=
+Target RejectNegative Joined UInteger Var(aarch64_bdilution_gsize) Init(0)
+Size of instruction stream granules (window).
+
+-param=aarch64-branch-dilution-max-branches=
+Target RejectNegative Joined UInteger Var(aarch64_bdilution_maxb) Init(0)
+Max number of branches that should appear within an instruction granule.
diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
index 11d20b7be140..23f4d417b246 100644
--- a/gcc/config/aarch64/t-aarch64
+++ b/gcc/config/aarch64/t-aarch64
@@ -158,6 +158,12 @@ aarch64-bti-insert.o: $(srcdir)/config/aarch64/aarch64-bti-insert.c \
 	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
 		$(srcdir)/config/aarch64/aarch64-bti-insert.c
 
+aarch64-branch-dilution.o: $(srcdir)/config/aarch64/aarch64-branch-dilution.c \
+    $(CONFIG_H) $(SYSTEM_H) $(RTL_BASE_H) \
+    $(srcdir)/config/aarch64/aarch64-protos.h
+	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+		$(srcdir)/config/aarch64/aarch64-branch-dilution.c
+
 comma=,
 MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
 MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 825fd669a757..739446a19e3b 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -705,7 +705,7 @@ Objective-C and Objective-C++ Dialects}.
 -moverride=@var{string}  -mverbose-cost-dump @gol
 -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg} @gol
 -mstack-protector-guard-offset=@var{offset} -mtrack-speculation @gol
--moutline-atomics }
+-moutline-atomics -mbranch-dilution @gol }
 
 @emph{Adapteva Epiphany Options}
 @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs @gol
@@ -13604,6 +13604,16 @@ The number of Newton iterations for calculating the reciprocal for double type.
 The precision of division is propotional to this param when division
 approximation is enabled.  The default value is 2.
 
+@item aarch64-branch-dilution-granularity
+Specify the size of the granules (instruction windows) to be considered
+for branch dilution.  When omitted, the tuning for the specified @option{-mcpu}
+will be used.
+
+@item aarch64-branch-dilution-max-branches
+Specify the amount of branches a granule may contain before it is considered
+saturated, requiring branch dilution.  When omitted, the tuning for the
+specified @option{-mcpu} will be used.
+
 @end table
 
 @end table
@@ -17425,6 +17435,11 @@ functions.  The optional argument @samp{b-key} can be used to sign the functions
 with the B-key instead of the A-key.
 @samp{bti} turns on branch target identification mechanism.
 
+@item -mbranch-dilution
+@opindex mbranch-dilution
+Enable the branch dilution optimization pass to improve performance when
+expecting high branch density.
+
 @item -mharden-sls=@var{opts}
 @opindex mharden-sls
 Enable compiler hardening against straight line speculation (SLS).
diff --git a/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c
new file mode 100644
index 000000000000..46c771bc3889
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c
@@ -0,0 +1,57 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -mcpu=cortex-a72 --param case-values-threshold=50" } */
+/* { dg-final { scan-assembler-not  "\\s*b.*\n\\snop\n" } } */
+#include <stdlib.h>
+
+void
+branch (int* n)
+{
+  while ((*n % 2 != 0))
+    *n = *n * *n;
+}
+
+int
+main ()
+{
+  int *i = malloc (sizeof (int));
+  *i = 5;
+  while (*i < 1000) {
+    switch (*i) {
+      case 1:
+        *i += 1;
+        branch (i);
+	break;
+      case 2:
+        *i += 2;
+        branch (i);
+	break;
+      case 3:
+        *i += 3;
+	branch (i);
+	break;
+      case 4:
+        *i += 4;
+        branch (i);
+	break;
+      case 5:
+        *i += 5;
+	branch (i);
+	break;
+      case 6:
+        *i += 6;
+        branch (i);
+        break;
+      case 7:
+        *i += 7;
+        branch (i);
+	break;
+      case 8:
+        *i += 8;
+        branch (i);
+	break;
+      default:
+        branch (i);
+    }
+  }
+  return *i;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/branch-dilution-on.c b/gcc/testsuite/gcc.target/aarch64/branch-dilution-on.c
new file mode 100644
index 000000000000..25d5c60e38c3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-on.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -mbranch-dilution -mcpu=cortex-a72 --param case-values-threshold=50 -fdump-rtl-branch-dilution" } */
+/* { dg-final { scan-assembler  "\\s*b.*\n\\snop\n" } } */
+/* { dg-final { scan-rtl-dump "filler_insn" "branch-dilution"} } */
+#include <stdlib.h>
+
+void
+branch (int* n)
+{
+  while ((*n % 2 != 0))
+    *n = *n * *n;
+}
+
+int
+main ()
+{
+  int *i = malloc (sizeof (int));
+  *i = 5;
+  while (*i < 1000) {
+    switch (*i) {
+      case 1:
+        *i += 1;
+        branch (i);
+	break;
+      case 2:
+        *i += 2;
+        branch (i);
+	break;
+      case 3:
+        *i += 3;
+	branch (i);
+	break;
+      case 4:
+        *i += 4;
+        branch (i);
+	break;
+      case 5:
+        *i += 5;
+	branch (i);
+	break;
+      case 6:
+        *i += 6;
+        branch (i);
+        break;
+      case 7:
+        *i += 7;
+        branch (i);
+	break;
+      case 8:
+        *i += 8;
+        branch (i);
+	break;
+      default:
+        branch (i);
+    }
+  }
+  return *i;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-22 10:09 ` [PATCH 2/2] Aarch64: Add branch diluter pass Andrea Corallo
@ 2020-07-22 10:39   ` Andrew Pinski
  2020-07-22 13:53     ` Andrea Corallo
  2020-07-24 22:09   ` Segher Boessenkool
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Pinski @ 2020-07-22 10:39 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: GCC Patches, nd, Richard Earnshaw

On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo <andrea.corallo@arm.com> wrote:
>
> Hi all,
>
> this second patch implements the AArch64 specific back-end pass
> 'branch-dilution' controllable by the followings command line options:
>
> -mbranch-dilution
>
> --param=aarch64-branch-dilution-granularity={num}
>
> --param=aarch64-branch-dilution-max-branches={num}
>
> Some cores known to be able to benefit from this pass have been given
> default tuning values for their granularity and max-branches.  Each
> affected core has a very specific granule size and associated max-branch
> limit.  This is a microarchitecture specific optimization.  Typical
> usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
> granularity tuned to 0 will be ignored. Options are provided for
> experimentation.

Can you give a simple example of what this patch does?
Also your testcases seem too sensitive to other optimizations which
could happen.  E.g. the call to "branch (i)" could be pulled out of
the switch statement.  Or even the "*i += N;" could be moved to one
Basic block and the switch becomes just one if statement.

> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
> for all the testsuite proved to be ~0.4%.

Also does this improve any non-SPEC benchmarks or has it only been
benchmarked with SPEC?

A few comments about the patch itself:
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -321,7 +321,7 @@ aarch64*-*-*)
>  c_target_objs="aarch64-c.o"
>  cxx_target_objs="aarch64-c.o"
>  d_target_objs="aarch64-d.o"
> - extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch64-bti-insert.o"
> + extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch64-bti-insert.o aarch64-branch-dilution.o"
>  target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>  target_has_targetm_common=yes
>  ;;

I think it is time to change how extra_objs is done and split it
across a few lines; this should be done in a different patch, it will
simplify future additions later on.

+unsigned max_branch = 0;
+unsigned granule_size = 0;

These two really should be part of the insn_granule class.  Having
global variables is not a good idea with respect to threading of GCC.

+      dump_printf (MSG_NOTE,
+    "%d. %s%s%s  > NEXT = (%d), PREV = (%d) -- UID: %d\n",
+    insn->index, GET_RTX_NAME (GET_CODE (insn->rtx)),
+    any_uncondjump_p (insn->rtx) ? " (ubranch)" : "",
+    insn->is_nop ? " (nop)" : "",
+    insn->next ? insn->next->index : -1,
+    insn->prev ? insn->prev->index : -1, INSN_UID (insn->rtx));

This part should really be of a method of insn_info instead.

is_branch (insn->rtx)

Why not:
insn->is_branch ()

This simplifies the code and really shows what is being done.

+  if (is_branch (prev_real_nondebug_insn (insn->rtx))
+      && is_branch (next_real_nondebug_insn (insn->rtx)))

Maybe:
+  if (is_branch (insn->prev_real_nondebug_insn ())
+      && is_branch (insn->next_real_nondebug_insn ()))

+  while (current_insn && !current_insn->is_branch)
+    {
+      current_insn = current_insn->next;
+    }

Why not:
current_insn = current_insn->next_branch_insn ();

There are many more examples of where you can improve like the above;
that is the way you define insn_info can be improved and push some of
the implementation back into the insn_info definition.

Thanks,
Andrew

>
> * Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window).  When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favor unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by
>   emitting a generic nop.
>
> - Unconditional branches must be the end of a basic block, and nops
>   cannot be outside of a basic block.  Thus the need for FILLER_INSN,
>   which allows placement outside of a basic block
>
> - and translates to a nop.
>
> - For most conditional branches we've taken a simple approach and only
>   handle the fallthru edge for simplicity, which we do by inserting a
>   "nop block" of nops on the fallthru edge, mapping that back to the
>   original destination block.
>
> - asm gotos and pcsets are going to be tricky to analyze from a
>   dilution perspective so are ignored at present.
>
> * Testing
>
> The two patches has been tested together on top of current master on
> aarch64-unknown-linux-gnu as follow:
>
> - Successful compilation of 3 stage bootstrap with the
>   pass forced on (for stage 2, 3)
>
> - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
>
> - No 'make check' regressions
>
>
> Regards
>
>   Andrea
>
> gcc/ChangeLog
>
> 2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
>             Carey Williams  <carey.williams@arm.com>
>
>         * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
>         * config/aarch64/aarch64-branch-dilution.c: New file.
>         * config/aarch64/aarch64-passes.def (branch-dilution): Register
>         pass.
>         * config/aarch64/aarch64-protos.h (struct tune_params): Declare
>         tuning parameters bdilution_gsize and bdilution_maxb.
>         (make_pass_branch_dilution): New declaration.
>         * config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings)
>         (cortexa53_tunings, cortexa57_tunings, cortexa72_tunings)
>         (cortexa73_tunings, exynosm1_tunings, thunderxt88_tunings)
>         (thunderx_tunings, tsv110_tunings, xgene1_tunings)
>         (qdf24xx_tunings, saphira_tunings, thunderx2t99_tunings)
>         (neoversen1_tunings): Provide default tunings for bdilution_gsize
>         and bdilution_maxb.
>         * config/aarch64/aarch64.md (filler_insn): Define new insn.
>         * config/aarch64/aarch64.opt (-mbranch-dilution)
>         (--param=aarch64-branch-dilution-granularity)
>         (--param=aarch64-branch-dilution-max-branches): Add new options.
>         * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
>         for aarch64-branch-dilution.c.
>         * doc/invoke.texi (-mbranch-dilution)
>         (--param=aarch64-branch-dilution-granularity)
>         (--param=aarch64-branch-dilution-max-branches): Document branch
>         dilution options.
>
> gcc/testsuite/ChangeLog
>
> 2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
>             Carey Williams  <carey.williams@arm.com>
>
>         * gcc.target/aarch64/branch-dilution-off.c: New file.
>         * gcc.target/aarch64/branch-dilution-on.c: New file.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-22 10:02 [PATCH 1/2] Add new RTX instruction class FILLER_INSN Andrea Corallo
  2020-07-22 10:09 ` [PATCH 2/2] Aarch64: Add branch diluter pass Andrea Corallo
@ 2020-07-22 12:24 ` Richard Biener
  2020-07-22 13:16   ` Richard Earnshaw (lists)
  2020-07-22 14:51   ` Andrea Corallo
  2020-07-22 18:41 ` Joseph Myers
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Richard Biener @ 2020-07-22 12:24 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: GCC Patches, nd, Richard Earnshaw

On Wed, Jul 22, 2020 at 12:03 PM Andrea Corallo <andrea.corallo@arm.com> wrote:
>
> Hi all,
>
> I'd like to submit the following two patches implementing a new AArch64
> specific back-end pass that helps optimize branch-dense code, which can
> be a bottleneck for performance on some Arm cores.  This is achieved by
> padding out the branch-dense sections of the instruction stream with
> nops.
>
> The original patch was already posted some time ago:
>
> https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg200721.html
>
> This follows up splitting as suggested in two patches, rebasing on
> master and implementing the suggestions of the first code review.
>
> This first patch implements the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block.  This is to allow padding after unconditional
> branches.  This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of
> nops.
>
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.
>
> 1/2 is requirement for 2/2.  Please see this the cover letter of this last
> for more details on the pass itself.

I wonder if such effect of instructions on the pipeline can be modeled
in the DFA and thus whether the scheduler could issue (always ready)
NOPs?

I also wonder whether such optimization is better suited for the assembler
which should know instruction lengths and alignment in a more precise
way and also would know whether extra nops make immediates too large
for pc relative things like short branches or section anchor accesses
(or whatever else)?

Richard.

> Regards
>
>   Andrea
>
> gcc/ChangeLog
>
> 2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
>             Carey Williams  <carey.williams@arm.com>
>
>         * cfgbuild.c (inside_basic_block_p): Handle FILLER_INSN.
>         * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
>         basic blocks.
>         * coretypes.h: New rtx class.
>         * emit-rtl.c (emit_filler_after): New function.
>         * rtl.def (FILLER_INSN): New rtl define.
>         * rtl.h (rtx_filler_insn): Define new structure.
>         (FILLER_INSN_P): New macro.
>         (is_a_helper <rtx_filler_insn *>::test): New test helper for
>         rtx_filler_insn.
>         (emit_filler_after): New extern.
>         * target-insns.def: Add target insn definition.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-22 12:24 ` [PATCH 1/2] Add new RTX instruction class FILLER_INSN Richard Biener
@ 2020-07-22 13:16   ` Richard Earnshaw (lists)
  2020-07-22 14:51   ` Andrea Corallo
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Earnshaw (lists) @ 2020-07-22 13:16 UTC (permalink / raw)
  To: Richard Biener, Andrea Corallo; +Cc: nd, GCC Patches

On 22/07/2020 13:24, Richard Biener via Gcc-patches wrote:
> On Wed, Jul 22, 2020 at 12:03 PM Andrea Corallo <andrea.corallo@arm.com> wrote:
>>
>> Hi all,
>>
>> I'd like to submit the following two patches implementing a new AArch64
>> specific back-end pass that helps optimize branch-dense code, which can
>> be a bottleneck for performance on some Arm cores.  This is achieved by
>> padding out the branch-dense sections of the instruction stream with
>> nops.
>>
>> The original patch was already posted some time ago:
>>
>> https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg200721.html
>>
>> This follows up splitting as suggested in two patches, rebasing on
>> master and implementing the suggestions of the first code review.
>>
>> This first patch implements the addition of a new RTX instruction class
>> FILLER_INSN, which has been white listed to allow placement of NOPs
>> outside of a basic block.  This is to allow padding after unconditional
>> branches.  This is favorable so that any performance gained from
>> diluting branches is not paid straight back via excessive eating of
>> nops.
>>
>> It was deemed that a new RTX class was less invasive than modifying
>> behavior in regards to standard UNSPEC nops.
>>
>> 1/2 is requirement for 2/2.  Please see this the cover letter of this last
>> for more details on the pass itself.
> 
> I wonder if such effect of instructions on the pipeline can be modeled
> in the DFA and thus whether the scheduler could issue (always ready)
> NOPs?
> 
> I also wonder whether such optimization is better suited for the assembler
> which should know instruction lengths and alignment in a more precise
> way and also would know whether extra nops make immediates too large
> for pc relative things like short branches or section anchor accesses
> (or whatever else)?

No, the assembler should never spontaneously insert instructions.  That
breaks the branch range calculations that the compiler relies upon.

R.

> 
> Richard.
> 
>> Regards
>>
>>   Andrea
>>
>> gcc/ChangeLog
>>
>> 2020-07-17  Andrea Corallo  <andrea.corallo@arm.com>
>>             Carey Williams  <carey.williams@arm.com>
>>
>>         * cfgbuild.c (inside_basic_block_p): Handle FILLER_INSN.
>>         * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
>>         basic blocks.
>>         * coretypes.h: New rtx class.
>>         * emit-rtl.c (emit_filler_after): New function.
>>         * rtl.def (FILLER_INSN): New rtl define.
>>         * rtl.h (rtx_filler_insn): Define new structure.
>>         (FILLER_INSN_P): New macro.
>>         (is_a_helper <rtx_filler_insn *>::test): New test helper for
>>         rtx_filler_insn.
>>         (emit_filler_after): New extern.
>>         * target-insns.def: Add target insn definition.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-22 10:39   ` Andrew Pinski
@ 2020-07-22 13:53     ` Andrea Corallo
  2020-07-22 16:43       ` Segher Boessenkool
  0 siblings, 1 reply; 25+ messages in thread
From: Andrea Corallo @ 2020-07-22 13:53 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: GCC Patches, nd, Richard Earnshaw

Hi Andrew,

thanks for reviewing I'll work on your comments.  Just replying to the
high level questions.

Andrew Pinski <pinskia@gmail.com> writes:

> On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo <andrea.corallo@arm.com> wrote:
>>
>> Hi all,
>>
>> this second patch implements the AArch64 specific back-end pass
>> 'branch-dilution' controllable by the followings command line options:
>>
>> -mbranch-dilution
>>
>> --param=aarch64-branch-dilution-granularity={num}
>>
>> --param=aarch64-branch-dilution-max-branches={num}
>>
>> Some cores known to be able to benefit from this pass have been given
>> default tuning values for their granularity and max-branches.  Each
>> affected core has a very specific granule size and associated max-branch
>> limit.  This is a microarchitecture specific optimization.  Typical
>> usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
>> granularity tuned to 0 will be ignored. Options are provided for
>> experimentation.
>
> Can you give a simple example of what this patch does?

Sure, this pass simply moves a sliding window over the insns trying to
make sure that we never have more then 'max_branch' branches for every
'granule_size' insns.

If too many branches are detected nops are added where considered less
armful to correct that.

There are obviously many scenarios where the compiler can generate a
branch dense pieces of code but say we have the equivalent of:

====
.L389:
	bl	foo
	b	.L43
.L388:
	bl	foo
	b	.L42
.L387:
	bl	foo
	b	.L41
.L386:
	bl	foo
	b	.L40
====

Assuming granule size 4 and max branches 2 this will be transformed in
the equivalent of:

====
.L389:
	bl	foo
	b	.L43
        nop
        nop
.L388:
	bl	foo
	b	.L42
        nop
        nop
.L387:
	bl	foo
	b	.L41
        nop
        nop
.L386:
	bl	foo
	b	.L40
        nop
        nop
====

> Also your testcases seem too sensitive to other optimizations which
> could happen.  E.g. the call to "branch (i)" could be pulled out of
> the switch statement.  Or even the "*i += N;" could be moved to one
> Basic block and the switch becomes just one if statement.
>
>> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
>> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
>> for all the testsuite proved to be ~0.4%.
>
> Also does this improve any non-SPEC benchmarks or has it only been
> benchmarked with SPEC?

So far I tried it only on SPEC 2006.  The transformation is not
benchmark specific tho, other code may benefit from it.

Thanks

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-22 12:24 ` [PATCH 1/2] Add new RTX instruction class FILLER_INSN Richard Biener
  2020-07-22 13:16   ` Richard Earnshaw (lists)
@ 2020-07-22 14:51   ` Andrea Corallo
  1 sibling, 0 replies; 25+ messages in thread
From: Andrea Corallo @ 2020-07-22 14:51 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, nd, Richard Earnshaw

Richard Biener <richard.guenther@gmail.com> writes:

> I wonder if such effect of instructions on the pipeline can be modeled
> in the DFA and thus whether the scheduler could issue (always ready)
> NOPs?

I might be wrong but the DFA model should be reasoning in terms of
executed instructions given an execution path, on the contrary this is
taking in account the 'footprint' of the branches of a program in
memory.  This is what some u-arch is sentive to.

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-22 13:53     ` Andrea Corallo
@ 2020-07-22 16:43       ` Segher Boessenkool
  2020-07-22 19:45         ` Andrea Corallo
  0 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2020-07-22 16:43 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: Andrew Pinski, Richard Earnshaw, nd, GCC Patches

Hi!

On Wed, Jul 22, 2020 at 03:53:34PM +0200, Andrea Corallo wrote:
> Andrew Pinski <pinskia@gmail.com> writes:
> > Can you give a simple example of what this patch does?
> 
> Sure, this pass simply moves a sliding window over the insns trying to
> make sure that we never have more then 'max_branch' branches for every
> 'granule_size' insns.
> 
> If too many branches are detected nops are added where considered less
> armful to correct that.

Should that actually be a sliding window, or should there actually just
not be more than N branches per aligned block of machine code?  Like,
per fetch group.

Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
even) then?  GCC has infrastructure for that, already.


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-22 10:02 [PATCH 1/2] Add new RTX instruction class FILLER_INSN Andrea Corallo
  2020-07-22 10:09 ` [PATCH 2/2] Aarch64: Add branch diluter pass Andrea Corallo
  2020-07-22 12:24 ` [PATCH 1/2] Add new RTX instruction class FILLER_INSN Richard Biener
@ 2020-07-22 18:41 ` Joseph Myers
  2020-07-24 21:18 ` Segher Boessenkool
  2020-08-19 16:51 ` Segher Boessenkool
  4 siblings, 0 replies; 25+ messages in thread
From: Joseph Myers @ 2020-07-22 18:41 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: gcc-patches, nd, Richard Earnshaw

New insn types should be documented in rtl.texi (I think in the "Insns" 
section).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-22 16:43       ` Segher Boessenkool
@ 2020-07-22 19:45         ` Andrea Corallo
  2020-07-23 22:47           ` Segher Boessenkool
  0 siblings, 1 reply; 25+ messages in thread
From: Andrea Corallo @ 2020-07-22 19:45 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Andrew Pinski, Richard Earnshaw, nd, GCC Patches

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!
>
> On Wed, Jul 22, 2020 at 03:53:34PM +0200, Andrea Corallo wrote:
>> Andrew Pinski <pinskia@gmail.com> writes:
>> > Can you give a simple example of what this patch does?
>>
>> Sure, this pass simply moves a sliding window over the insns trying to
>> make sure that we never have more then 'max_branch' branches for every
>> 'granule_size' insns.
>>
>> If too many branches are detected nops are added where considered less
>> armful to correct that.
>
> Should that actually be a sliding window, or should there actually just
> not be more than N branches per aligned block of machine code?  Like,
> per fetch group.
>
> Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
> even) then?  GCC has infrastructure for that, already.

Correct, it's a sliding window only because the real load address is not
known to the compiler and the algorithm is conservative.  I believe we
could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
least) the granule size, then we should be able to insert 'nop aligned
labels' precisely.

My main fear is that given new cores tend to have big granules code size
would blow.  One advantage of the implemented algorithm is that even if
slightly conservative it's impacting code size only where an high branch
density shows up.

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-22 19:45         ` Andrea Corallo
@ 2020-07-23 22:47           ` Segher Boessenkool
  2020-07-24  7:01             ` Andrea Corallo
  0 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2020-07-23 22:47 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: Andrew Pinski, Richard Earnshaw, nd, GCC Patches

On Wed, Jul 22, 2020 at 09:45:08PM +0200, Andrea Corallo wrote:
> > Should that actually be a sliding window, or should there actually just
> > not be more than N branches per aligned block of machine code?  Like,
> > per fetch group.
> >
> > Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
> > even) then?  GCC has infrastructure for that, already.
> 
> Correct, it's a sliding window only because the real load address is not
> known to the compiler and the algorithm is conservative.  I believe we
> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
> least) the granule size, then we should be able to insert 'nop aligned
> labels' precisely.

Yeah, we have similar issues on Power...  Our "granule" (fetch group
size, in our terminology) is 32 typically, but we align functions to
just 16.  This is causing some problems, but aligning to bigger
boundaries isn't a very happy alternative either.  WIP...

(We don't have this exact same problem, because our non-ancient cores
can just predict *all* branches in the same cycle).

> My main fear is that given new cores tend to have big granules code size
> would blow.  One advantage of the implemented algorithm is that even if
> slightly conservative it's impacting code size only where an high branch
> density shows up.

What is "big granules" for you?


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-23 22:47           ` Segher Boessenkool
@ 2020-07-24  7:01             ` Andrea Corallo
  2020-07-24 11:53               ` Segher Boessenkool
  0 siblings, 1 reply; 25+ messages in thread
From: Andrea Corallo @ 2020-07-24  7:01 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Andrew Pinski, Richard Earnshaw, nd, GCC Patches

Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Wed, Jul 22, 2020 at 09:45:08PM +0200, Andrea Corallo wrote:
>> > Should that actually be a sliding window, or should there actually just
>> > not be more than N branches per aligned block of machine code?  Like,
>> > per fetch group.
>> >
>> > Can you not use ASM_OUTPUT_ALIGN_WITH_NOP (or ASM_OUTPUT_MAX_SKIP_ALIGN
>> > even) then?  GCC has infrastructure for that, already.
>> 
>> Correct, it's a sliding window only because the real load address is not
>> known to the compiler and the algorithm is conservative.  I believe we
>> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
>> least) the granule size, then we should be able to insert 'nop aligned
>> labels' precisely.
>
> Yeah, we have similar issues on Power...  Our "granule" (fetch group
> size, in our terminology) is 32 typically, but we align functions to
> just 16.  This is causing some problems, but aligning to bigger
> boundaries isn't a very happy alternative either.  WIP...

Interesting, I was expecting other CPUs to have a similar mechanism.

> (We don't have this exact same problem, because our non-ancient cores
> can just predict *all* branches in the same cycle).
>
>> My main fear is that given new cores tend to have big granules code size
>> would blow.  One advantage of the implemented algorithm is that even if
>> slightly conservative it's impacting code size only where an high branch
>> density shows up.
>
> What is "big granules" for you?

N1 is 8 instructions so 32 bytes as well, I guess this may grow further
(my speculation).

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-24  7:01             ` Andrea Corallo
@ 2020-07-24 11:53               ` Segher Boessenkool
  2020-07-24 13:21                 ` Andrea Corallo
  0 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2020-07-24 11:53 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: Andrew Pinski, Richard Earnshaw, nd, GCC Patches

Hi!

On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> Correct, it's a sliding window only because the real load address is not
> >> known to the compiler and the algorithm is conservative.  I believe we
> >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
> >> least) the granule size, then we should be able to insert 'nop aligned
> >> labels' precisely.
> >
> > Yeah, we have similar issues on Power...  Our "granule" (fetch group
> > size, in our terminology) is 32 typically, but we align functions to
> > just 16.  This is causing some problems, but aligning to bigger
> > boundaries isn't a very happy alternative either.  WIP...
> 
> Interesting, I was expecting other CPUs to have a similar mechanism.

On old cpus (like the 970) there were at most two branch predictions per
cycle.  Nowadays, all branches are predicted; not sure when this changed,
it is pretty long ago already.

> > (We don't have this exact same problem, because our non-ancient cores
> > can just predict *all* branches in the same cycle).
> >
> >> My main fear is that given new cores tend to have big granules code size
> >> would blow.  One advantage of the implemented algorithm is that even if
> >> slightly conservative it's impacting code size only where an high branch
> >> density shows up.
> >
> > What is "big granules" for you?
> 
> N1 is 8 instructions so 32 bytes as well, I guess this may grow further
> (my speculation).

It has to sooner rather than later, yeah.  Or the mechanism has to change
more radically.  Interesting times ahead, I guess :-)


About your patch itself.  The basic idea seems fine (I didn't look too
closely), but do you really need a new RTX class for this?  That is not
very appetising...


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-24 11:53               ` Segher Boessenkool
@ 2020-07-24 13:21                 ` Andrea Corallo
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Corallo @ 2020-07-24 13:21 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Andrew Pinski, Richard Earnshaw, nd, GCC Patches

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!
>
> On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> Correct, it's a sliding window only because the real load address is not
>> >> known to the compiler and the algorithm is conservative.  I believe we
>> >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
>> >> least) the granule size, then we should be able to insert 'nop aligned
>> >> labels' precisely.
>> >
>> > Yeah, we have similar issues on Power...  Our "granule" (fetch group
>> > size, in our terminology) is 32 typically, but we align functions to
>> > just 16.  This is causing some problems, but aligning to bigger
>> > boundaries isn't a very happy alternative either.  WIP...
>> 
>> Interesting, I was expecting other CPUs to have a similar mechanism.
>
> On old cpus (like the 970) there were at most two branch predictions per
> cycle.  Nowadays, all branches are predicted; not sure when this changed,
> it is pretty long ago already.
>
>> > (We don't have this exact same problem, because our non-ancient cores
>> > can just predict *all* branches in the same cycle).
>> >
>> >> My main fear is that given new cores tend to have big granules code size
>> >> would blow.  One advantage of the implemented algorithm is that even if
>> >> slightly conservative it's impacting code size only where an high branch
>> >> density shows up.
>> >
>> > What is "big granules" for you?
>> 
>> N1 is 8 instructions so 32 bytes as well, I guess this may grow further
>> (my speculation).
>
> It has to sooner rather than later, yeah.  Or the mechanism has to change
> more radically.  Interesting times ahead, I guess :-)

Indeed :)

> About your patch itself.  The basic idea seems fine (I didn't look too
> closely), but do you really need a new RTX class for this?  That is not
> very appetising...

Agree, OTOH I'm not sure about the other options on the table and their
impact, the advantage of this is that the impact is relatively
contained.

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-22 10:02 [PATCH 1/2] Add new RTX instruction class FILLER_INSN Andrea Corallo
                   ` (2 preceding siblings ...)
  2020-07-22 18:41 ` Joseph Myers
@ 2020-07-24 21:18 ` Segher Boessenkool
  2020-07-26 18:19   ` Eric Botcazou
                     ` (2 more replies)
  2020-08-19 16:51 ` Segher Boessenkool
  4 siblings, 3 replies; 25+ messages in thread
From: Segher Boessenkool @ 2020-07-24 21:18 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: gcc-patches, nd, Richard Earnshaw

Hi Andrea,

On Wed, Jul 22, 2020 at 12:02:33PM +0200, Andrea Corallo wrote:
> This first patch implements the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block.  This is to allow padding after unconditional
> branches.  This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of
> nops.

> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.

So I wonder if this cannot be done with some kind of NOTE, instead?

> +	      /* Allow nops after branches, via FILLER_INSN.  */
> +	      bool fail = true;
> +	      subrtx_iterator::array_type array;
> +	      FOR_EACH_SUBRTX (iter, array, x, ALL)
> +		{
> +		  const_rtx rtx = *iter;
> +		  if (GET_CODE (rtx) == FILLER_INSN)
> +		    {
> +		      fail = false;
> +		      break;
> +		    }
> +		}
> +	      if (fail)
> +		fatal_insn ("insn outside basic block", x);

This stops checking after finding a FILLER_INSN as first insn.  Is that
on purpose?

> +/* Make an insn of code FILLER_INSN to
> +   pad out the instruction stream.
> +   PATTERN should be from gen_filler_insn ().
> +   AFTER will typically be an unconditional
> +   branch at the end of a basic block.  */

Because it only allows one particular insn pattern, it should be pretty
easy to use a NOTE for this, or even a normal INSN where the ouside-of-BB
test can then just look at its pattern.


As you see, I really do not like to have another RTX class, without very
well defined semantics even.  Not without first being shown no
alternatives are acceptable, anyway :-)


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-22 10:09 ` [PATCH 2/2] Aarch64: Add branch diluter pass Andrea Corallo
  2020-07-22 10:39   ` Andrew Pinski
@ 2020-07-24 22:09   ` Segher Boessenkool
  2020-07-28 18:55     ` Andrea Corallo
  1 sibling, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2020-07-24 22:09 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: gcc-patches, nd, Richard Earnshaw

Hi!

[ Btw, the mailing list archive will not show your attachments (just lets
me download them); naming the files *.txt probably works, but you can
also use a sane mime type (like, text/plain) ].

On Wed, Jul 22, 2020 at 12:09:08PM +0200, Andrea Corallo wrote:
> this second patch implements the AArch64 specific back-end pass
> 'branch-dilution' controllable by the followings command line options:
> 
> -mbranch-dilution
> 
> --param=aarch64-branch-dilution-granularity={num}
> 
> --param=aarch64-branch-dilution-max-branches={num}

That sounds like something that would be useful generically, even?
Targets that do not want to use it can just skip it (that probably should
be the default then), via setting the granularity to 0 for example.

> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
> for all the testsuite proved to be ~0.4%.

Can you share a geomean improvement as well?  Also something like 0.4%
is sounds like, or is it more?

> - Unconditional branches must be the end of a basic block, and nops
>   cannot be outside of a basic block.  Thus the need for FILLER_INSN,
>   which allows placement outside of a basic block

But the new rtx class is recognised in just one location, so it could
recognise it on any other characteristic easily.

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -mcpu=cortex-a72 --param case-values-threshold=50" } */
> +/* { dg-final { scan-assembler-not  "\\s*b.*\n\\snop\n" } } */

You can write this simpler as

/* { dg-final { scan-assembler-not {\s*b.*\n\snop\n} } } */

which shows a problem in this more clearly: . will match newlines as
well.  Also, starting a RE with (anything)* does nothing, "anything" is
allowed to match 0 times after all.  You probably meant the "b" should
start a mnemonic?

/* { dg-final { scan-assembler-not {(?n)\mb.*\n\snop\n} } } */

(\m is a zero-width start-of-word match, like \< in grep; (?n) means .
does not match newlines (if you know Perl, it turns /m on and /s off --
the opposite of the defaults for Tcl).

(or you could do [^\n]* or even just \S* , no (?n) needed then).

Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-24 21:18 ` Segher Boessenkool
@ 2020-07-26 18:19   ` Eric Botcazou
  2020-07-28 19:29   ` Andrea Corallo
  2020-08-19  9:13   ` Andrea Corallo
  2 siblings, 0 replies; 25+ messages in thread
From: Eric Botcazou @ 2020-07-26 18:19 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, Andrea Corallo, Richard Earnshaw, nd

> As you see, I really do not like to have another RTX class, without very
> well defined semantics even.  Not without first being shown no
> alternatives are acceptable, anyway :-)

Seconded.

-- 
Eric Botcazou



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-24 22:09   ` Segher Boessenkool
@ 2020-07-28 18:55     ` Andrea Corallo
  2020-07-28 22:07       ` Segher Boessenkool
  0 siblings, 1 reply; 25+ messages in thread
From: Andrea Corallo @ 2020-07-28 18:55 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, nd, Richard Earnshaw

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!

Hi Segher,

> [ Btw, the mailing list archive will not show your attachments (just lets
> me download them); naming the files *.txt probably works, but you can
> also use a sane mime type (like, text/plain) ].

[ Sure can do it np, I'm just less sure if text/x-diff is such and
insane mime type for a mailing list software :) ]

> On Wed, Jul 22, 2020 at 12:09:08PM +0200, Andrea Corallo wrote:
>> this second patch implements the AArch64 specific back-end pass
>> 'branch-dilution' controllable by the followings command line options:
>> 
>> -mbranch-dilution
>> 
>> --param=aarch64-branch-dilution-granularity={num}
>> 
>> --param=aarch64-branch-dilution-max-branches={num}
>
> That sounds like something that would be useful generically, even?
> Targets that do not want to use it can just skip it (that probably should
> be the default then), via setting the granularity to 0 for example.
> 
>> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
>> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
>> for all the testsuite proved to be ~0.4%.
>
> Can you share a geomean improvement as well?  Also something like 0.4%
> is sounds like, or is it more?

After my first measure I was suggestted by a colleague a less noisy
system to benchmark on and a more reproducable methodology.  I repeated
the tests on N1 with the following results:

| Benchmark      | Est. Peak Rate ration |
|                |    diluted / baseline |
|----------------+-----------------------|
| 400.perlbench  |                1.018x |
| 401.bzip2      |                1.004x |
| 403.gcc        |                0.987x |
| 429.mcf        |                1.000x |
| 445.gobmk      |                0.998x |
| 456.hmmer      |                1.000x |
| 458.sjeng      |                1.008x |
| 462.libquantum |                1.014x |
| 464.h264ref    |                1.004x |
| 471.omnetpp    |                1.017x |
| 473.astar      |                1.007x |
| 483.xalancbmk  |                0.998x |

I was explained xalanc tend to be very noisy being memory bound so this
explains the difference, not sure why sjeng looks less good.  The
overall ratio comparing spec rates is +~0.44%.

>> - Unconditional branches must be the end of a basic block, and nops
>>   cannot be outside of a basic block.  Thus the need for FILLER_INSN,
>>   which allows placement outside of a basic block
>
> But the new rtx class is recognised in just one location, so it could
> recognise it on any other characteristic easily.

Make sense, unless in the future this changes, but probably is not worth
taking this in account now.

>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/branch-dilution-off.c
>> @@ -0,0 +1,57 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O1 -mcpu=cortex-a72 --param case-values-threshold=50" } */
>> +/* { dg-final { scan-assembler-not  "\\s*b.*\n\\snop\n" } } */
>
> You can write this simpler as
>
> /* { dg-final { scan-assembler-not {\s*b.*\n\snop\n} } } */
>
> which shows a problem in this more clearly: . will match newlines as
> well.  Also, starting a RE with (anything)* does nothing, "anything" is
> allowed to match 0 times after all.  You probably meant the "b" should
> start a mnemonic?
>
> /* { dg-final { scan-assembler-not {(?n)\mb.*\n\snop\n} } } */
>
> (\m is a zero-width start-of-word match, like \< in grep; (?n) means .
> does not match newlines (if you know Perl, it turns /m on and /s off --
> the opposite of the defaults for Tcl).
>
> (or you could do [^\n]* or even just \S* , no (?n) needed then).
>
>
> Segher

Thanks for the feedback!

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-24 21:18 ` Segher Boessenkool
  2020-07-26 18:19   ` Eric Botcazou
@ 2020-07-28 19:29   ` Andrea Corallo
  2020-08-19  9:13   ` Andrea Corallo
  2 siblings, 0 replies; 25+ messages in thread
From: Andrea Corallo @ 2020-07-28 19:29 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, nd, Richard Earnshaw

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi Andrea,
>
> On Wed, Jul 22, 2020 at 12:02:33PM +0200, Andrea Corallo wrote:
>> This first patch implements the addition of a new RTX instruction class
>> FILLER_INSN, which has been white listed to allow placement of NOPs
>> outside of a basic block.  This is to allow padding after unconditional
>> branches.  This is favorable so that any performance gained from
>> diluting branches is not paid straight back via excessive eating of
>> nops.
>
>> It was deemed that a new RTX class was less invasive than modifying
>> behavior in regards to standard UNSPEC nops.
>
> So I wonder if this cannot be done with some kind of NOTE, instead?
>
>> +	      /* Allow nops after branches, via FILLER_INSN.  */
>> +	      bool fail = true;
>> +	      subrtx_iterator::array_type array;
>> +	      FOR_EACH_SUBRTX (iter, array, x, ALL)
>> +		{
>> +		  const_rtx rtx = *iter;
>> +		  if (GET_CODE (rtx) == FILLER_INSN)
>> +		    {
>> +		      fail = false;
>> +		      break;
>> +		    }
>> +		}
>> +	      if (fail)
>> +		fatal_insn ("insn outside basic block", x);
>
> This stops checking after finding a FILLER_INSN as first insn.  Is that
> on purpose?

I guess after one is expected to find only other fillers but yeah, I
missed this while reviving the patch, it should be improved agree.

>> +/* Make an insn of code FILLER_INSN to
>> +   pad out the instruction stream.
>> +   PATTERN should be from gen_filler_insn ().
>> +   AFTER will typically be an unconditional
>> +   branch at the end of a basic block.  */
>
> Because it only allows one particular insn pattern, it should be pretty
> easy to use a NOTE for this, or even a normal INSN where the ouside-of-BB
> test can then just look at its pattern.
>
>
> As you see, I really do not like to have another RTX class, without very
> well defined semantics even.  Not without first being shown no
> alternatives are acceptable, anyway :-)

I see I see :)  I really don't have any strong opinion about this.  I'll
be happy to rework the patch in this direction if this is the outcome
suggested by the code review.

Thanks!

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] Aarch64: Add branch diluter pass
  2020-07-28 18:55     ` Andrea Corallo
@ 2020-07-28 22:07       ` Segher Boessenkool
  0 siblings, 0 replies; 25+ messages in thread
From: Segher Boessenkool @ 2020-07-28 22:07 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: gcc-patches, nd, Richard Earnshaw

Hi!

On Tue, Jul 28, 2020 at 07:55:58PM +0100, Andrea Corallo wrote:
> > [ Btw, the mailing list archive will not show your attachments (just lets
> > me download them); naming the files *.txt probably works, but you can
> > also use a sane mime type (like, text/plain) ].
> 
> [ Sure can do it np, I'm just less sure if text/x-diff is such and
> insane mime type for a mailing list software :) ]

Well, first and foremost, it does not *work* with our current mailing
list setup.  Also, anything x- only makes sense if you know the software
the receiver runs (and of course you don't, this is public email, with
hundreds or thousands of readers).

> > Can you share a geomean improvement as well?  Also something like 0.4%
> > is sounds like, or is it more?
> 
> After my first measure I was suggestted by a colleague a less noisy
> system to benchmark on and a more reproducable methodology.  I repeated
> the tests on N1 with the following results:
> 
> | Benchmark      | Est. Peak Rate ration |
> |                |    diluted / baseline |
> |----------------+-----------------------|
> | 400.perlbench  |                1.018x |
> | 401.bzip2      |                1.004x |
> | 403.gcc        |                0.987x |
> | 429.mcf        |                1.000x |
> | 445.gobmk      |                0.998x |
> | 456.hmmer      |                1.000x |
> | 458.sjeng      |                1.008x |
> | 462.libquantum |                1.014x |
> | 464.h264ref    |                1.004x |
> | 471.omnetpp    |                1.017x |
> | 473.astar      |                1.007x |
> | 483.xalancbmk  |                0.998x |

Cool.  Well, gcc has a pretty big drop, what's up with that?  Everything
else looks just great :-)

> I was explained xalanc tend to be very noisy being memory bound so this
> explains the difference, not sure why sjeng looks less good.

Sjeng is very jumpy code, so of course your patch will influence it a
lot.  No idea why it is less positive this run.

> The
> overall ratio comparing spec rates is +~0.44%.

Yup, a nice healthy improvement :-)


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-24 21:18 ` Segher Boessenkool
  2020-07-26 18:19   ` Eric Botcazou
  2020-07-28 19:29   ` Andrea Corallo
@ 2020-08-19  9:13   ` Andrea Corallo
  2020-08-19 10:52     ` Richard Sandiford
  2020-08-19 17:28     ` Segher Boessenkool
  2 siblings, 2 replies; 25+ messages in thread
From: Andrea Corallo @ 2020-08-19  9:13 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, nd, Richard Earnshaw

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi Andrea,
>
> On Wed, Jul 22, 2020 at 12:02:33PM +0200, Andrea Corallo wrote:
>> This first patch implements the addition of a new RTX instruction class
>> FILLER_INSN, which has been white listed to allow placement of NOPs
>> outside of a basic block.  This is to allow padding after unconditional
>> branches.  This is favorable so that any performance gained from
>> diluting branches is not paid straight back via excessive eating of
>> nops.
>
>> It was deemed that a new RTX class was less invasive than modifying
>> behavior in regards to standard UNSPEC nops.
>
> So I wonder if this cannot be done with some kind of NOTE, instead?
>

Hi Segher,

I was having a look into reworking this using an insn note as (IIUC)
suggested.  The idea is appealing but looking into insn-notes.def I've
found the following comment:

"We are slowly removing the concept of insn-chain notes from the
compiler.  Adding new codes to this file is STRONGLY DISCOURAGED.
If you think you need one, look for other ways to express what you
mean, such as register notes or bits in the basic-block structure."

Would still be justificated in this case to proceed this way?  The other
option would be to add the information into the basic-block or into
struct rtx_jump_insn.

My GCC experience is far from sufficient for having a formed opinion on
this, I'd probably bet on struct rtx_jump_insn as the better option.

Any thoughts?

Thanks!

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-08-19  9:13   ` Andrea Corallo
@ 2020-08-19 10:52     ` Richard Sandiford
  2020-08-19 17:28     ` Segher Boessenkool
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Sandiford @ 2020-08-19 10:52 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: Segher Boessenkool, Richard Earnshaw, nd, gcc-patches

Andrea Corallo <andrea.corallo@arm.com> writes:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
>
>> Hi Andrea,
>>
>> On Wed, Jul 22, 2020 at 12:02:33PM +0200, Andrea Corallo wrote:
>>> This first patch implements the addition of a new RTX instruction class
>>> FILLER_INSN, which has been white listed to allow placement of NOPs
>>> outside of a basic block.  This is to allow padding after unconditional
>>> branches.  This is favorable so that any performance gained from
>>> diluting branches is not paid straight back via excessive eating of
>>> nops.
>>
>>> It was deemed that a new RTX class was less invasive than modifying
>>> behavior in regards to standard UNSPEC nops.
>>
>> So I wonder if this cannot be done with some kind of NOTE, instead?
>>
>
> Hi Segher,
>
> I was having a look into reworking this using an insn note as (IIUC)
> suggested.  The idea is appealing but looking into insn-notes.def I've
> found the following comment:
>
> "We are slowly removing the concept of insn-chain notes from the
> compiler.  Adding new codes to this file is STRONGLY DISCOURAGED.
> If you think you need one, look for other ways to express what you
> mean, such as register notes or bits in the basic-block structure."
>
> Would still be justificated in this case to proceed this way?  The other
> option would be to add the information into the basic-block or into
> struct rtx_jump_insn.
>
> My GCC experience is far from sufficient for having a formed opinion on
> this, I'd probably bet on struct rtx_jump_insn as the better option.

Adding it to the basic block structure wouldn't work because we need
this information to survive until asm output time, and the cfg doesn't
last that long.  (Would be nice if it did, but that's a whole new can
of worms.)

Using REG_NOTES on the jump might be OK.  I guess the note value could
be the length in bytes.  shorten_branches would then need to look for
these notes and add the associated length after adding the length of
the insn itself.  There would then need to be some hook that final.c
can call to emit nops of the given length.

I guess there's also the option of representing this in the same way
as a delayed branch sequence, which is to make the jump insn pattern:

  (sequence [(normal jump insn)
             (delayed insn 1)
             ...])

The members of the sequence are full insns, rather than just patterns.
For this use case, the delayed insns would all be nops.

However, not much is prepared to handle the sequence representation
before the normal pass_machine_reorg position.  (The main dbr pass
itself is pass_delay_slots, but some targets run dbr within
pass_machine_reorg instead.)  There again, it isn't worth doing
layout optimisations earlier than pass_machine_reorg anyway.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-07-22 10:02 [PATCH 1/2] Add new RTX instruction class FILLER_INSN Andrea Corallo
                   ` (3 preceding siblings ...)
  2020-07-24 21:18 ` Segher Boessenkool
@ 2020-08-19 16:51 ` Segher Boessenkool
  2020-08-19 17:47   ` Andrea Corallo
  4 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2020-08-19 16:51 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: gcc-patches, nd, Richard Earnshaw

[ Please don't post new patch series as replies to old ]

On Wed, Jul 22, 2020 at 12:02:33PM +0200, Andrea Corallo wrote:
> This first patch implements the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block.  This is to allow padding after unconditional
> branches.  This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of
> nops.
> 
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.

Deemed, by whom?  There are several people very against it, too.  You
need to modify only one simple behaviour (maybe in a handful of places),
making a new RTX class for that is excessive.

> 	* cfgbuild.c (inside_basic_block_p): Handle FILLER_INSN.
> 	* cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
> 	basic blocks.
> 	* coretypes.h: New rtx class.

coretypes.h is not a new RTX class?  :-)  Maybe:

	* coretypes.h (rtx_filler_insn): New struct.

> @@ -3033,7 +3034,20 @@ rtl_verify_bb_layout (void)
>  	      break;
>  
>  	    default:
> -	      fatal_insn ("insn outside basic block", x);
> +	      /* Allow nops after branches, via FILLER_INSN.  */
> +	      bool fail = true;
> +	      subrtx_iterator::array_type array;
> +	      FOR_EACH_SUBRTX (iter, array, x, ALL)
> +		{
> +		  const_rtx rtx = *iter;
> +		  if (GET_CODE (rtx) == FILLER_INSN)
> +		    {
> +		      fail = false;
> +		      break;
> +		    }
> +		}
> +	      if (fail)
> +		fatal_insn ("insn outside basic block", x);
>  	    }
>  	}

It wouldn't be hard to allow some existing RTL here.  Maybe something
with CODE_FOR_filler_nop or similar (after you recog () it).

It still allows anything after leading filler insns, btw; you could get
rid of the "fail" variable altogether, just call fatal_insn as soon as
you see some unexpected RTX code.

> +  rtx_insn* i = make_insn_raw (pattern);

rtx_insn *i = ...


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-08-19  9:13   ` Andrea Corallo
  2020-08-19 10:52     ` Richard Sandiford
@ 2020-08-19 17:28     ` Segher Boessenkool
  1 sibling, 0 replies; 25+ messages in thread
From: Segher Boessenkool @ 2020-08-19 17:28 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: gcc-patches, nd, Richard Earnshaw

On Wed, Aug 19, 2020 at 11:13:40AM +0200, Andrea Corallo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > So I wonder if this cannot be done with some kind of NOTE, instead?
> 
> I was having a look into reworking this using an insn note as (IIUC)
> suggested.  The idea is appealing but looking into insn-notes.def I've
> found the following comment:
> 
> "We are slowly removing the concept of insn-chain notes from the
> compiler.  Adding new codes to this file is STRONGLY DISCOURAGED.
> If you think you need one, look for other ways to express what you
> mean, such as register notes or bits in the basic-block structure."

That is from 2004.  Since then 9 note types have been removed, but 7
new types added.  (There are 18 insn note types now).

> Would still be justificated in this case to proceed this way?

Yes, it is a lesser evil imho.

> The other
> option would be to add the information into the basic-block or into
> struct rtx_jump_insn.

Or just look at the insn code, define a "filler-nop" insn, allow those
after BBs?  Any reason that wouldn't work?


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] Add new RTX instruction class FILLER_INSN
  2020-08-19 16:51 ` Segher Boessenkool
@ 2020-08-19 17:47   ` Andrea Corallo
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Corallo @ 2020-08-19 17:47 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, nd, Richard Earnshaw

Segher Boessenkool <segher@kernel.crashing.org> writes:

> [ Please don't post new patch series as replies to old ]
>
> On Wed, Jul 22, 2020 at 12:02:33PM +0200, Andrea Corallo wrote:
>> This first patch implements the addition of a new RTX instruction class
>> FILLER_INSN, which has been white listed to allow placement of NOPs
>> outside of a basic block.  This is to allow padding after unconditional
>> branches.  This is favorable so that any performance gained from
>> diluting branches is not paid straight back via excessive eating of
>> nops.
>> 
>> It was deemed that a new RTX class was less invasive than modifying
>> behavior in regards to standard UNSPEC nops.
>
> Deemed, by whom?  There are several people very against it, too.  You
> need to modify only one simple behaviour (maybe in a handful of places),
> making a new RTX class for that is excessive.

Hi Segher,

That's understood and agreed.  I haven't posted any new patch on this,
the quoted mail is an old one.  I just wanted to discuss how to proceede
this way with my mail of today.

Thanks for your feedback!

  Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-08-19 17:47 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22 10:02 [PATCH 1/2] Add new RTX instruction class FILLER_INSN Andrea Corallo
2020-07-22 10:09 ` [PATCH 2/2] Aarch64: Add branch diluter pass Andrea Corallo
2020-07-22 10:39   ` Andrew Pinski
2020-07-22 13:53     ` Andrea Corallo
2020-07-22 16:43       ` Segher Boessenkool
2020-07-22 19:45         ` Andrea Corallo
2020-07-23 22:47           ` Segher Boessenkool
2020-07-24  7:01             ` Andrea Corallo
2020-07-24 11:53               ` Segher Boessenkool
2020-07-24 13:21                 ` Andrea Corallo
2020-07-24 22:09   ` Segher Boessenkool
2020-07-28 18:55     ` Andrea Corallo
2020-07-28 22:07       ` Segher Boessenkool
2020-07-22 12:24 ` [PATCH 1/2] Add new RTX instruction class FILLER_INSN Richard Biener
2020-07-22 13:16   ` Richard Earnshaw (lists)
2020-07-22 14:51   ` Andrea Corallo
2020-07-22 18:41 ` Joseph Myers
2020-07-24 21:18 ` Segher Boessenkool
2020-07-26 18:19   ` Eric Botcazou
2020-07-28 19:29   ` Andrea Corallo
2020-08-19  9:13   ` Andrea Corallo
2020-08-19 10:52     ` Richard Sandiford
2020-08-19 17:28     ` Segher Boessenkool
2020-08-19 16:51 ` Segher Boessenkool
2020-08-19 17:47   ` Andrea Corallo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).