public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][AArch64] Separate shrink wrapping hooks implementation
@ 2016-11-10 14:26 Kyrill Tkachov
  2016-11-10 16:33 ` Segher Boessenkool
  2016-11-10 22:42 ` Andrew Pinski
  0 siblings, 2 replies; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-10 14:26 UTC (permalink / raw)
  To: GCC Patches
  Cc: Marcus Shawcroft, Richard Earnshaw, James Greenhalgh, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 4529 bytes --]

Hi all,

This patch implements the new separate shrink-wrapping hooks for aarch64.
In separate shrink wrapping (as I understand it) we consider each register save/restore as
a 'component' that can be performed independently of the other save/restores in the prologue/epilogue
and can be moved outside the prologue/epilogue and instead performed only in the basic blocks where it's
actually needed. This allows us to avoid saving and restoring registers on execution paths where a register
might not be needed.

In the most general form a 'component' can be any operation that the prologue/epilogue performs, for example
stack adjustment. But in this patch we only consider callee-saved register save/restores as components.
The code is in many ways similar to the powerpc implementation of the hooks.

The hooks implemented are:
* TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS: Returns a bitmap containing a bit for each register that should
be considered a 'component' i.e. its save/restore should be separated from the prologue and epilogue and placed
at the basic block where it's needed.

* TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB: Determine for a given basic block which 'component' registers it needs.
This is determined through dataflow. If a component register is in the IN,GEN or KILL sets for the basic block
it's considered as needed and marked as such in the bitmap.

* TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS and TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS: Given a bitmap
of component registers emits the save or restore code for them.

* TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS: Given a bitmap of component registers record in the backend that
the register is shrink-wrapped using this approach and that the normal prologue and epilogue expansion code
should not emit code for them. This is done similarly to powerpc by defining a bool array in machine_function
where we record whether each register is separately shrink-wrapped.  The prologue and epilogue expansion code
(through aarch64_save_callee_saves and aarch64_restore_callee_saves) is updated to not emit save/restores for
these registers if they appear in that array.

Our prologue and epilogue code has a lot of intricate logic to perform stack adjustments using the writeback
forms of the load/store instructions. Separately shrink-wrapping those registers marked for writeback
(cfun->machine->frame.wb_candidate1 and cfun->machine->frame.wb_candidate2) broke that codegen and I had to
emit an explicit stack adjustment instruction that created ugly prologue/epilogue sequences. So this patch
is conservative and doesn't allow shrink-wrapping of the registers marked for writeback. Maybe in the future
we can relax it (for example allow wrapping of one of the two writeback registers if the writeback amount
can be encoded in a single-register writeback store/load) but given the development stage of GCC I thought
I'd play it safe.

I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were some interesting swings.
458.sjeng     +1.45%
471.omnetpp   +2.19%
445.gobmk     -2.01%

On SPECFP:
453.povray    +7.00%

I'll be re-running the benchmarks with Segher's recent patch [1] to see if they fix the regression
and if it does I think this can go in.

[1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00889.html

Bootstrapped and tested on aarch64-none-linux-gnu.

Thanks,
Kyrill

2016-11-10  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.h (machine_function): Add
     reg_is_wrapped_separately field.
     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
     rtx_insn *.
     (aarch64_save_callee_saves): Don't save registers that are wrapped
     separately.
     (aarch64_restore_callee_saves): Don't restore registers that are
     wrapped separately.
     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
     (aarch64_get_separate_components): New function.
     (aarch64_components_for_bb): Likewise.
     (aarch64_disqualify_components): Likewise.
     (aarch64_emit_prologue_components): Likewise.
     (aarch64_emit_epilogue_components): Likewise.
     (aarch64_set_handled_components): Likewise.
     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.

[-- Attachment #2: aarch64-sws.patch --]
[-- Type: text/x-patch, Size: 9821 bytes --]

commit 14c7a66d9f3a44ef40499e61ca9643c7dfbc6c82
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Oct 11 09:25:54 2016 +0100

    [AArch64] Separate shrink wrapping hooks implementation

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 325e725..5508333 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm,
 
 /* Emit an insn that's a simple single-set.  Both the operands must be
    known to be valid.  */
-inline static rtx
+inline static rtx_insn *
 emit_set_insn (rtx x, rtx y)
 {
   return emit_insn (gen_rtx_SET (x, y));
@@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	      || regno == cfun->machine->frame.wb_candidate2))
 	continue;
 
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
@@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 
@@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode,
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
     {
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       rtx reg, mem;
 
       if (skip_wb
@@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 	{
@@ -3224,6 +3232,169 @@ aarch64_restore_callee_saves (machine_mode mode,
     }
 }
 
+static inline bool
+offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
+			       HOST_WIDE_INT offset)
+{
+  return offset >= -256 && offset < 256;
+}
+
+static inline bool
+offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= 0
+	  && offset < 4096 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+bool
+aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= -64 * GET_MODE_SIZE (mode)
+	  && offset < 64 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+
+static sbitmap
+aarch64_get_separate_components (void)
+{
+  /* Calls to alloca further extend the stack frame and it can be messy to
+     figure out the location of the stack slots for each register.
+     For now be conservative.  */
+  if (cfun->calls_alloca)
+    return NULL;
+
+  aarch64_layout_frame ();
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* The registers we need saved to the frame.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (aarch64_register_saved_on_entry (regno))
+      {
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		    - cfun->machine->frame.hard_fp_offset;
+	/* Check that we can access the stack slot of the register with one
+	   direct load with no adjustments needed.  */
+	if (offset_12bit_unsigned_scaled_p (DImode, offset))
+	  bitmap_set_bit (components, regno);
+      }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  unsigned reg1 = cfun->machine->frame.wb_candidate1;
+  unsigned reg2 = cfun->machine->frame.wb_candidate2;
+  /* If aarch64_layout_frame has chosen registers to store/restore with
+     writeback don't interfere with them to avoid having to output explicit
+     stack adjustment instructions.  */
+  if (reg2 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg2);
+  if (reg1 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg1);
+
+  bitmap_clear_bit (components, LR_REGNUM);
+  bitmap_clear_bit (components, SP_REGNUM);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+
+static sbitmap
+aarch64_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if ((!call_used_regs[regno])
+       && (bitmap_bit_p (in, regno)
+	   || bitmap_bit_p (gen, regno)
+	   || bitmap_bit_p (kill, regno)))
+	  bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
+   Nothing to do for aarch64.  */
+
+static void
+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_prologue_components (sbitmap components)
+{
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      {
+	rtx reg = gen_rtx_REG (Pmode, regno);
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+	rtx addr = plus_constant (Pmode, ptr_reg, offset);
+	rtx mem = gen_frame_mem (Pmode, addr);
+
+	RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1;
+      }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_epilogue_components (sbitmap components)
+{
+
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      {
+	rtx reg = gen_rtx_REG (Pmode, regno);
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		     - cfun->machine->frame.hard_fp_offset;
+	rtx addr = plus_constant (Pmode, ptr_reg, offset);
+	rtx mem = gen_frame_mem (Pmode, addr);
+
+	RTX_FRAME_RELATED_P (emit_move_insn (reg, mem)) = 1;
+	add_reg_note (get_last_insn (), REG_CFA_RESTORE, reg);
+      }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+
+static void
+aarch64_set_handled_components (sbitmap components)
+{
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      cfun->machine->reg_is_wrapped_separately[regno] = true;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3944,29 +4115,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
   return false;
 }
 
-bool
-aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= -64 * GET_MODE_SIZE (mode)
-	  && offset < 64 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
-static inline bool
-offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
-			       HOST_WIDE_INT offset)
-{
-  return offset >= -256 && offset < 256;
-}
-
-static inline bool
-offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= 0
-	  && offset < 4096 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
 /* Return true if MODE is one of the modes for which we
    support LDP/STP operations.  */
 
@@ -14452,6 +14600,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
   aarch64_first_cycle_multipass_dfa_lookahead_guard
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
+  aarch64_get_separate_components
+
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
+  aarch64_components_for_bb
+
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
+  aarch64_disqualify_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  aarch64_emit_prologue_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  aarch64_emit_epilogue_components
+
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
+  aarch64_set_handled_components
+
 #undef TARGET_TRAMPOLINE_INIT
 #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
 
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 584ff5c..fb89e5a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
 typedef struct GTY (()) machine_function
 {
   struct aarch64_frame frame;
+  /* One entry for each GPR and FP register.  */
+  bool reg_is_wrapped_separately[V31_REGNUM + 1];
 } machine_function;
 #endif
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-10 14:26 [PATCH][AArch64] Separate shrink wrapping hooks implementation Kyrill Tkachov
@ 2016-11-10 16:33 ` Segher Boessenkool
  2016-11-10 16:45   ` Kyrill Tkachov
  2016-11-10 22:42 ` Andrew Pinski
  1 sibling, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2016-11-10 16:33 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

Hi!

Great to see this.  Just a few comments...

On Thu, Nov 10, 2016 at 02:25:47PM +0000, Kyrill Tkachov wrote:
> +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
> +
> +static sbitmap
> +aarch64_get_separate_components (void)
> +{
> +  /* Calls to alloca further extend the stack frame and it can be messy to
> +     figure out the location of the stack slots for each register.
> +     For now be conservative.  */
> +  if (cfun->calls_alloca)
> +    return NULL;

The generic code already disallows functions with alloca (in
try_shrink_wrapping_separate).

> +static void
> +aarch64_emit_prologue_components (sbitmap components)
> +{
> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
> +			     ? HARD_FRAME_POINTER_REGNUM
> +			     : STACK_POINTER_REGNUM);
> +
> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
> +    if (bitmap_bit_p (components, regno))
> +      {
> +	rtx reg = gen_rtx_REG (Pmode, regno);
> +	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
> +	if (!frame_pointer_needed)
> +	offset += cfun->machine->frame.frame_size
> +		  - cfun->machine->frame.hard_fp_offset;
> +	rtx addr = plus_constant (Pmode, ptr_reg, offset);
> +	rtx mem = gen_frame_mem (Pmode, addr);
> +
> +	RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1;
> +      }
> +}

I think you should emit the CFI notes here directly, just like for the
epilogue components.


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-10 16:33 ` Segher Boessenkool
@ 2016-11-10 16:45   ` Kyrill Tkachov
  0 siblings, 0 replies; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-10 16:45 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

On 10/11/16 16:26, Segher Boessenkool wrote:
> Hi!

Hi,

>
> Great to see this.  Just a few comments...
>
> On Thu, Nov 10, 2016 at 02:25:47PM +0000, Kyrill Tkachov wrote:
>> +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
>> +
>> +static sbitmap
>> +aarch64_get_separate_components (void)
>> +{
>> +  /* Calls to alloca further extend the stack frame and it can be messy to
>> +     figure out the location of the stack slots for each register.
>> +     For now be conservative.  */
>> +  if (cfun->calls_alloca)
>> +    return NULL;
> The generic code already disallows functions with alloca (in
> try_shrink_wrapping_separate).

Ok, I'll remove this.

>> +static void
>> +aarch64_emit_prologue_components (sbitmap components)
>> +{
>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
>> +			     ? HARD_FRAME_POINTER_REGNUM
>> +			     : STACK_POINTER_REGNUM);
>> +
>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
>> +    if (bitmap_bit_p (components, regno))
>> +      {
>> +	rtx reg = gen_rtx_REG (Pmode, regno);
>> +	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
>> +	if (!frame_pointer_needed)
>> +	offset += cfun->machine->frame.frame_size
>> +		  - cfun->machine->frame.hard_fp_offset;
>> +	rtx addr = plus_constant (Pmode, ptr_reg, offset);
>> +	rtx mem = gen_frame_mem (Pmode, addr);
>> +
>> +	RTX_FRAME_RELATED_P (emit_move_insn (mem, reg)) = 1;
>> +      }
>> +}
> I think you should emit the CFI notes here directly, just like for the
> epilogue components.

The prologue code in expand_prologue doesn't attach any explicit notes,
so I didn't want to deviate from that. Looking at the powerpc implementation,
would that be a REG_CFA_OFFSET with the (SET (mem) (reg)) expression for saving
the reg?

Thanks,
Kyrill

>
> Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-10 14:26 [PATCH][AArch64] Separate shrink wrapping hooks implementation Kyrill Tkachov
  2016-11-10 16:33 ` Segher Boessenkool
@ 2016-11-10 22:42 ` Andrew Pinski
  2016-11-10 23:39   ` Segher Boessenkool
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Pinski @ 2016-11-10 22:42 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh, Segher Boessenkool

On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> This patch implements the new separate shrink-wrapping hooks for aarch64.
> In separate shrink wrapping (as I understand it) we consider each register
> save/restore as
> a 'component' that can be performed independently of the other save/restores
> in the prologue/epilogue
> and can be moved outside the prologue/epilogue and instead performed only in
> the basic blocks where it's
> actually needed. This allows us to avoid saving and restoring registers on
> execution paths where a register
> might not be needed.
>
> In the most general form a 'component' can be any operation that the
> prologue/epilogue performs, for example
> stack adjustment. But in this patch we only consider callee-saved register
> save/restores as components.
> The code is in many ways similar to the powerpc implementation of the hooks.
>
> The hooks implemented are:
> * TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS: Returns a bitmap containing a
> bit for each register that should
> be considered a 'component' i.e. its save/restore should be separated from
> the prologue and epilogue and placed
> at the basic block where it's needed.
>
> * TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB: Determine for a given basic block
> which 'component' registers it needs.
> This is determined through dataflow. If a component register is in the
> IN,GEN or KILL sets for the basic block
> it's considered as needed and marked as such in the bitmap.
>
> * TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS and
> TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS: Given a bitmap
> of component registers emits the save or restore code for them.
>
> * TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS: Given a bitmap of component
> registers record in the backend that
> the register is shrink-wrapped using this approach and that the normal
> prologue and epilogue expansion code
> should not emit code for them. This is done similarly to powerpc by defining
> a bool array in machine_function
> where we record whether each register is separately shrink-wrapped.  The
> prologue and epilogue expansion code
> (through aarch64_save_callee_saves and aarch64_restore_callee_saves) is
> updated to not emit save/restores for
> these registers if they appear in that array.
>
> Our prologue and epilogue code has a lot of intricate logic to perform stack
> adjustments using the writeback
> forms of the load/store instructions. Separately shrink-wrapping those
> registers marked for writeback
> (cfun->machine->frame.wb_candidate1 and cfun->machine->frame.wb_candidate2)
> broke that codegen and I had to
> emit an explicit stack adjustment instruction that created ugly
> prologue/epilogue sequences. So this patch
> is conservative and doesn't allow shrink-wrapping of the registers marked
> for writeback. Maybe in the future
> we can relax it (for example allow wrapping of one of the two writeback
> registers if the writeback amount
> can be encoded in a single-register writeback store/load) but given the
> development stage of GCC I thought
> I'd play it safe.
>
> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
> some interesting swings.
> 458.sjeng     +1.45%
> 471.omnetpp   +2.19%
> 445.gobmk     -2.01%
>
> On SPECFP:
> 453.povray    +7.00%


Wow, this looks really good.  Thank you for implementing this.  If I
get some time I am going to try it out on other processors than A72
but I doubt I have time any time soon.

Thanks,
Andrew

>
> I'll be re-running the benchmarks with Segher's recent patch [1] to see if
> they fix the regression
> and if it does I think this can go in.
>
> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00889.html
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Thanks,
> Kyrill
>
> 2016-11-10  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.h (machine_function): Add
>     reg_is_wrapped_separately field.
>     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
>     rtx_insn *.
>     (aarch64_save_callee_saves): Don't save registers that are wrapped
>     separately.
>     (aarch64_restore_callee_saves): Don't restore registers that are
>     wrapped separately.
>     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
>     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
>     (aarch64_get_separate_components): New function.
>     (aarch64_components_for_bb): Likewise.
>     (aarch64_disqualify_components): Likewise.
>     (aarch64_emit_prologue_components): Likewise.
>     (aarch64_emit_epilogue_components): Likewise.
>     (aarch64_set_handled_components): Likewise.
>     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
>     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
>     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-10 22:42 ` Andrew Pinski
@ 2016-11-10 23:39   ` Segher Boessenkool
  2016-11-11 10:18     ` Kyrill Tkachov
  0 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2016-11-10 23:39 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Kyrill Tkachov, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh

On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
> > I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
> > some interesting swings.
> > 458.sjeng     +1.45%
> > 471.omnetpp   +2.19%
> > 445.gobmk     -2.01%
> >
> > On SPECFP:
> > 453.povray    +7.00%
> 
> 
> Wow, this looks really good.  Thank you for implementing this.  If I
> get some time I am going to try it out on other processors than A72
> but I doubt I have time any time soon.

I'd love to hear what causes the slowdown for gobmk as well, btw.


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-10 23:39   ` Segher Boessenkool
@ 2016-11-11 10:18     ` Kyrill Tkachov
  2016-11-11 15:31       ` Kyrill Tkachov
  0 siblings, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-11 10:18 UTC (permalink / raw)
  To: Segher Boessenkool, Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh


On 10/11/16 23:39, Segher Boessenkool wrote:
> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
>>> some interesting swings.
>>> 458.sjeng     +1.45%
>>> 471.omnetpp   +2.19%
>>> 445.gobmk     -2.01%
>>>
>>> On SPECFP:
>>> 453.povray    +7.00%
>>
>> Wow, this looks really good.  Thank you for implementing this.  If I
>> get some time I am going to try it out on other processors than A72
>> but I doubt I have time any time soon.
> I'd love to hear what causes the slowdown for gobmk as well, btw.

I haven't yet gotten a direct answer for that (through performance analysis tools)
but I have noticed that load/store pairs are not generated as aggressively as I hoped.
They are being merged by the sched fusion pass and peepholes (which runs after this)
but it still misses cases. I've hacked the SWS hooks to generate pairs explicitly and that
increases the number of pairs and helps code size to boot. It complicates the logic of
the hooks a bit but not too much.

I'll make those changes and re-benchmark, hopefully that
will help performance.

Thanks,
Kyrill

>
> Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-11 10:18     ` Kyrill Tkachov
@ 2016-11-11 15:31       ` Kyrill Tkachov
  2016-11-14 14:25         ` Kyrill Tkachov
  0 siblings, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-11 15:31 UTC (permalink / raw)
  To: Segher Boessenkool, Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 1665 bytes --]


On 11/11/16 10:17, Kyrill Tkachov wrote:
>
> On 10/11/16 23:39, Segher Boessenkool wrote:
>> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
>>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
>>>> some interesting swings.
>>>> 458.sjeng     +1.45%
>>>> 471.omnetpp   +2.19%
>>>> 445.gobmk     -2.01%
>>>>
>>>> On SPECFP:
>>>> 453.povray    +7.00%
>>>
>>> Wow, this looks really good.  Thank you for implementing this.  If I
>>> get some time I am going to try it out on other processors than A72
>>> but I doubt I have time any time soon.
>> I'd love to hear what causes the slowdown for gobmk as well, btw.
>
> I haven't yet gotten a direct answer for that (through performance analysis tools)
> but I have noticed that load/store pairs are not generated as aggressively as I hoped.
> They are being merged by the sched fusion pass and peepholes (which runs after this)
> but it still misses cases. I've hacked the SWS hooks to generate pairs explicitly and that
> increases the number of pairs and helps code size to boot. It complicates the logic of
> the hooks a bit but not too much.
>
> I'll make those changes and re-benchmark, hopefully that
> will help performance.
>

And here's a version that explicitly emits pairs. I've looked at assembly codegen on SPEC2006
and it generates quite a few more LDP/STP pairs than the original version.
I kicked off benchmarks over the weekend to see the effect.
Andrew, if you want to try it out (more benchmarking and testing always welcome) this is the
one to try.

Thanks,
Kyrill



> Thanks,
> Kyrill
>
>>
>> Segher
>


[-- Attachment #2: aarch64-sws.patch --]
[-- Type: text/x-patch, Size: 13471 bytes --]

commit bedb71d6f6f772eed33ba35e93cc4104326675da
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Oct 11 09:25:54 2016 +0100

    [AArch64] Separate shrink wrapping hooks implementation

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 325e725..15b5bdf 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm,
 
 /* Emit an insn that's a simple single-set.  Both the operands must be
    known to be valid.  */
-inline static rtx
+inline static rtx_insn *
 emit_set_insn (rtx x, rtx y)
 {
   return emit_insn (gen_rtx_SET (x, y));
@@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	      || regno == cfun->machine->frame.wb_candidate2))
 	continue;
 
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
@@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 
@@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode,
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
     {
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       rtx reg, mem;
 
       if (skip_wb
@@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 	{
@@ -3224,6 +3232,273 @@ aarch64_restore_callee_saves (machine_mode mode,
     }
 }
 
+static inline bool
+offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
+			       HOST_WIDE_INT offset)
+{
+  return offset >= -256 && offset < 256;
+}
+
+static inline bool
+offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= 0
+	  && offset < 4096 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+bool
+aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= -64 * GET_MODE_SIZE (mode)
+	  && offset < 64 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+
+static sbitmap
+aarch64_get_separate_components (void)
+{
+  aarch64_layout_frame ();
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* The registers we need saved to the frame.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (aarch64_register_saved_on_entry (regno))
+      {
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		    - cfun->machine->frame.hard_fp_offset;
+	/* Check that we can access the stack slot of the register with one
+	   direct load with no adjustments needed.  */
+	if (offset_12bit_unsigned_scaled_p (DImode, offset))
+	  bitmap_set_bit (components, regno);
+      }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  unsigned reg1 = cfun->machine->frame.wb_candidate1;
+  unsigned reg2 = cfun->machine->frame.wb_candidate2;
+  /* If aarch64_layout_frame has chosen registers to store/restore with
+     writeback don't interfere with them to avoid having to output explicit
+     stack adjustment instructions.  */
+  if (reg2 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg2);
+  if (reg1 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg1);
+
+  bitmap_clear_bit (components, LR_REGNUM);
+  bitmap_clear_bit (components, SP_REGNUM);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+
+static sbitmap
+aarch64_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if ((!call_used_regs[regno])
+       && (bitmap_bit_p (in, regno)
+	   || bitmap_bit_p (gen, regno)
+	   || bitmap_bit_p (kill, regno)))
+	  bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
+   Nothing to do for aarch64.  */
+
+static void
+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+}
+
+/* Return the next set bit in BMP from START onwards.  Return the total number
+   of bits in BMP if no set bit is found at or after START.  */
+
+static unsigned int
+aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
+{
+  unsigned int nbits = SBITMAP_SIZE (bmp);
+  if (start == nbits)
+    return start;
+
+  gcc_assert (start < nbits);
+  for (unsigned int i = start; i < nbits; i++)
+    if (bitmap_bit_p (bmp, i))
+      return i;
+
+  return nbits;
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_prologue_components (sbitmap components)
+{
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+
+  unsigned total_bits = SBITMAP_SIZE (components);
+  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
+  rtx_insn *insn = NULL;
+
+  while (regno != total_bits)
+    {
+      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
+      rtx reg = gen_rtx_REG (mode, regno);
+      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+      if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr = plus_constant (Pmode, ptr_reg, offset);
+      rtx mem = gen_frame_mem (mode, addr);
+
+      rtx set = gen_rtx_SET (mem, reg);
+      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
+      /* No more registers to save after REGNO or the memory slot of the
+	 register is not suitable for a store pair.
+	 Emit a single save and exit.  */
+      if (!satisfies_constraint_Ump (mem)
+	  || regno2 == total_bits)
+	{
+	  insn = emit_insn (set);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+	  break;
+	}
+
+      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
+      /* The next register is not of the same class or its offset is not
+	 mergeable with the current one into a pair.  */
+      if (GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
+	  || (offset2 - cfun->machine->frame.reg_offset[regno])
+		!= GET_MODE_SIZE (DImode))
+	{
+	  insn = emit_insn (set);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+
+	  regno = regno2;
+	  continue;
+	}
+
+      /* REGNO2 can be stored in a pair with REGNO.  */
+      rtx reg2 = gen_rtx_REG (mode, regno2);
+      if (!frame_pointer_needed)
+	offset2 += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
+      rtx mem2 = gen_frame_mem (mode, addr2);
+      rtx set2 = gen_rtx_SET (mem2, reg2);
+
+      insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_OFFSET, set);
+      add_reg_note (insn, REG_CFA_OFFSET, set2);
+
+      regno = aarch64_get_next_set_bit (components, regno2 + 1);
+    }
+
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_epilogue_components (sbitmap components)
+{
+
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+  unsigned total_bits = SBITMAP_SIZE (components);
+  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
+  rtx_insn *insn = NULL;
+
+  while (regno != total_bits)
+    {
+      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
+      rtx reg = gen_rtx_REG (mode, regno);
+      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+      if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr = plus_constant (Pmode, ptr_reg, offset);
+      rtx mem = gen_frame_mem (mode, addr);
+
+      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
+      /* No more registers to restore or the memory location is not suitable
+	 for a load pair.  Emit a single restore and exit.  */
+      if (!satisfies_constraint_Ump (mem)
+	  || regno2 == total_bits)
+	{
+	  insn = emit_move_insn (reg, mem);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_RESTORE, reg);
+	  break;
+	}
+
+      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
+      /* The next register is not of the same class or its offset is not
+	 mergeable with the current one into a pair.  Emit a single restore
+	 and continue from REGNO2.  */
+      if (GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
+	  || (offset2 - cfun->machine->frame.reg_offset[regno])
+		!= GET_MODE_SIZE (DImode))
+	{
+	  insn = emit_move_insn (reg, mem);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_RESTORE, reg);
+
+	  regno = regno2;
+	  continue;
+	}
+
+      /* REGNO2 can be loaded in a pair with REGNO.  */
+      rtx reg2 = gen_rtx_REG (mode, regno2);
+      if (!frame_pointer_needed)
+	offset2 += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
+      rtx mem2 = gen_frame_mem (mode, addr2);
+
+      insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_RESTORE, reg);
+      add_reg_note (insn, REG_CFA_RESTORE, reg2);
+
+      regno = aarch64_get_next_set_bit (components, regno2 + 1);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+
+static void
+aarch64_set_handled_components (sbitmap components)
+{
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      cfun->machine->reg_is_wrapped_separately[regno] = true;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3944,29 +4219,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
   return false;
 }
 
-bool
-aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= -64 * GET_MODE_SIZE (mode)
-	  && offset < 64 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
-static inline bool
-offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
-			       HOST_WIDE_INT offset)
-{
-  return offset >= -256 && offset < 256;
-}
-
-static inline bool
-offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= 0
-	  && offset < 4096 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
 /* Return true if MODE is one of the modes for which we
    support LDP/STP operations.  */
 
@@ -14452,6 +14704,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
   aarch64_first_cycle_multipass_dfa_lookahead_guard
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
+  aarch64_get_separate_components
+
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
+  aarch64_components_for_bb
+
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
+  aarch64_disqualify_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  aarch64_emit_prologue_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  aarch64_emit_epilogue_components
+
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
+  aarch64_set_handled_components
+
 #undef TARGET_TRAMPOLINE_INIT
 #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
 
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 584ff5c..fb89e5a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
 typedef struct GTY (()) machine_function
 {
   struct aarch64_frame frame;
+  /* One entry for each GPR and FP register.  */
+  bool reg_is_wrapped_separately[V31_REGNUM + 1];
 } machine_function;
 #endif
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-11 15:31       ` Kyrill Tkachov
@ 2016-11-14 14:25         ` Kyrill Tkachov
  2016-11-17 14:22           ` Kyrill Tkachov
  2016-11-29 10:58           ` James Greenhalgh
  0 siblings, 2 replies; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-14 14:25 UTC (permalink / raw)
  To: Segher Boessenkool, Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 3282 bytes --]


On 11/11/16 15:31, Kyrill Tkachov wrote:
>
> On 11/11/16 10:17, Kyrill Tkachov wrote:
>>
>> On 10/11/16 23:39, Segher Boessenkool wrote:
>>> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
>>>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
>>>>> some interesting swings.
>>>>> 458.sjeng     +1.45%
>>>>> 471.omnetpp   +2.19%
>>>>> 445.gobmk     -2.01%
>>>>>
>>>>> On SPECFP:
>>>>> 453.povray    +7.00%
>>>>
>>>> Wow, this looks really good.  Thank you for implementing this.  If I
>>>> get some time I am going to try it out on other processors than A72
>>>> but I doubt I have time any time soon.
>>> I'd love to hear what causes the slowdown for gobmk as well, btw.
>>
>> I haven't yet gotten a direct answer for that (through performance analysis tools)
>> but I have noticed that load/store pairs are not generated as aggressively as I hoped.
>> They are being merged by the sched fusion pass and peepholes (which runs after this)
>> but it still misses cases. I've hacked the SWS hooks to generate pairs explicitly and that
>> increases the number of pairs and helps code size to boot. It complicates the logic of
>> the hooks a bit but not too much.
>>
>> I'll make those changes and re-benchmark, hopefully that
>> will help performance.
>>
>
> And here's a version that explicitly emits pairs. I've looked at assembly codegen on SPEC2006
> and it generates quite a few more LDP/STP pairs than the original version.
> I kicked off benchmarks over the weekend to see the effect.
> Andrew, if you want to try it out (more benchmarking and testing always welcome) this is the
> one to try.
>

And I discovered over the weekend that gamess and wrf have validation errors.
This version runs correctly.
SPECINT results were fine though and there is even a small overall gain due to
sjeng and omnetpp. However, gobmk still has the regression.
I'll rerun SPECFP with this patch (it's really just a small bugfix over the previous version)
and get on with analysing gobmk.

Thanks,
Kyrill

2016-11-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.h (machine_function): Add
     reg_is_wrapped_separately field.
     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
     rtx_insn *.
     (aarch64_save_callee_saves): Don't save registers that are wrapped
     separately.
     (aarch64_restore_callee_saves): Don't restore registers that are
     wrapped separately.
     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
     (aarch64_get_separate_components): New function.
     (aarch64_get_next_set_bit): Likewise.
     (aarch64_components_for_bb): Likewise.
     (aarch64_disqualify_components): Likewise.
     (aarch64_emit_prologue_components): Likewise.
     (aarch64_emit_epilogue_components): Likewise.
     (aarch64_set_handled_components): Likewise.
     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.


[-- Attachment #2: aarch64-sws.patch --]
[-- Type: text/x-patch, Size: 13398 bytes --]

commit 06ac3c30d8aa38781ee9019e60a5fcf727b85231
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Oct 11 09:25:54 2016 +0100

    [AArch64] Separate shrink wrapping hooks implementation

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 325e725..2d33ef6 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm,
 
 /* Emit an insn that's a simple single-set.  Both the operands must be
    known to be valid.  */
-inline static rtx
+inline static rtx_insn *
 emit_set_insn (rtx x, rtx y)
 {
   return emit_insn (gen_rtx_SET (x, y));
@@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	      || regno == cfun->machine->frame.wb_candidate2))
 	continue;
 
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
@@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 
@@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode,
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
     {
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       rtx reg, mem;
 
       if (skip_wb
@@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 	{
@@ -3224,6 +3232,272 @@ aarch64_restore_callee_saves (machine_mode mode,
     }
 }
 
+static inline bool
+offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
+			       HOST_WIDE_INT offset)
+{
+  return offset >= -256 && offset < 256;
+}
+
+static inline bool
+offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= 0
+	  && offset < 4096 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+bool
+aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= -64 * GET_MODE_SIZE (mode)
+	  && offset < 64 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+
+static sbitmap
+aarch64_get_separate_components (void)
+{
+  aarch64_layout_frame ();
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* The registers we need saved to the frame.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (aarch64_register_saved_on_entry (regno))
+      {
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		    - cfun->machine->frame.hard_fp_offset;
+	/* Check that we can access the stack slot of the register with one
+	   direct load with no adjustments needed.  */
+	if (offset_12bit_unsigned_scaled_p (DImode, offset))
+	  bitmap_set_bit (components, regno);
+      }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  unsigned reg1 = cfun->machine->frame.wb_candidate1;
+  unsigned reg2 = cfun->machine->frame.wb_candidate2;
+  /* If aarch64_layout_frame has chosen registers to store/restore with
+     writeback don't interfere with them to avoid having to output explicit
+     stack adjustment instructions.  */
+  if (reg2 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg2);
+  if (reg1 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg1);
+
+  bitmap_clear_bit (components, LR_REGNUM);
+  bitmap_clear_bit (components, SP_REGNUM);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+
+static sbitmap
+aarch64_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if ((!call_used_regs[regno])
+       && (bitmap_bit_p (in, regno)
+	   || bitmap_bit_p (gen, regno)
+	   || bitmap_bit_p (kill, regno)))
+	  bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
+   Nothing to do for aarch64.  */
+
+static void
+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+}
+
+/* Return the next set bit in BMP from START onwards.  Return the total number
+   of bits in BMP if no set bit is found at or after START.  */
+
+static unsigned int
+aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
+{
+  unsigned int nbits = SBITMAP_SIZE (bmp);
+  if (start == nbits)
+    return start;
+
+  gcc_assert (start < nbits);
+  for (unsigned int i = start; i < nbits; i++)
+    if (bitmap_bit_p (bmp, i))
+      return i;
+
+  return nbits;
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_prologue_components (sbitmap components)
+{
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+
+  unsigned total_bits = SBITMAP_SIZE (components);
+  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
+  rtx_insn *insn = NULL;
+
+  while (regno != total_bits)
+    {
+      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
+      rtx reg = gen_rtx_REG (mode, regno);
+      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+      if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr = plus_constant (Pmode, ptr_reg, offset);
+      rtx mem = gen_frame_mem (mode, addr);
+
+      rtx set = gen_rtx_SET (mem, reg);
+      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
+      /* No more registers to save after REGNO.
+	 Emit a single save and exit.  */
+      if (regno2 == total_bits)
+	{
+	  insn = emit_insn (set);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+	  break;
+	}
+
+      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
+      /* The next register is not of the same class or its offset is not
+	 mergeable with the current one into a pair.  */
+      if (!satisfies_constraint_Ump (mem)
+	  || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
+	  || (offset2 - cfun->machine->frame.reg_offset[regno])
+		!= GET_MODE_SIZE (DImode))
+	{
+	  insn = emit_insn (set);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+
+	  regno = regno2;
+	  continue;
+	}
+
+      /* REGNO2 can be stored in a pair with REGNO.  */
+      rtx reg2 = gen_rtx_REG (mode, regno2);
+      if (!frame_pointer_needed)
+	offset2 += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
+      rtx mem2 = gen_frame_mem (mode, addr2);
+      rtx set2 = gen_rtx_SET (mem2, reg2);
+
+      insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_OFFSET, set);
+      add_reg_note (insn, REG_CFA_OFFSET, set2);
+
+      regno = aarch64_get_next_set_bit (components, regno2 + 1);
+    }
+
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_epilogue_components (sbitmap components)
+{
+
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+  unsigned total_bits = SBITMAP_SIZE (components);
+  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
+  rtx_insn *insn = NULL;
+
+  while (regno != total_bits)
+    {
+      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
+      rtx reg = gen_rtx_REG (mode, regno);
+      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+      if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr = plus_constant (Pmode, ptr_reg, offset);
+      rtx mem = gen_frame_mem (mode, addr);
+
+      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
+      /* No more registers after REGNO to restore.
+	 Emit a single restore and exit.  */
+      if (regno2 == total_bits)
+	{
+	  insn = emit_move_insn (reg, mem);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_RESTORE, reg);
+	  break;
+	}
+
+      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
+      /* The next register is not of the same class or its offset is not
+	 mergeable with the current one into a pair or the offset doesn't fit
+	 for a load pair.  Emit a single restore and continue from REGNO2.  */
+      if (!satisfies_constraint_Ump (mem)
+	  || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
+	  || (offset2 - cfun->machine->frame.reg_offset[regno])
+		!= GET_MODE_SIZE (DImode))
+	{
+	  insn = emit_move_insn (reg, mem);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_RESTORE, reg);
+
+	  regno = regno2;
+	  continue;
+	}
+
+      /* REGNO2 can be loaded in a pair with REGNO.  */
+      rtx reg2 = gen_rtx_REG (mode, regno2);
+      if (!frame_pointer_needed)
+	offset2 += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
+      rtx mem2 = gen_frame_mem (mode, addr2);
+
+      insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_RESTORE, reg);
+      add_reg_note (insn, REG_CFA_RESTORE, reg2);
+
+      regno = aarch64_get_next_set_bit (components, regno2 + 1);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+
+static void
+aarch64_set_handled_components (sbitmap components)
+{
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      cfun->machine->reg_is_wrapped_separately[regno] = true;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
   return false;
 }
 
-bool
-aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= -64 * GET_MODE_SIZE (mode)
-	  && offset < 64 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
-static inline bool
-offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
-			       HOST_WIDE_INT offset)
-{
-  return offset >= -256 && offset < 256;
-}
-
-static inline bool
-offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= 0
-	  && offset < 4096 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
 /* Return true if MODE is one of the modes for which we
    support LDP/STP operations.  */
 
@@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
   aarch64_first_cycle_multipass_dfa_lookahead_guard
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
+  aarch64_get_separate_components
+
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
+  aarch64_components_for_bb
+
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
+  aarch64_disqualify_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  aarch64_emit_prologue_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  aarch64_emit_epilogue_components
+
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
+  aarch64_set_handled_components
+
 #undef TARGET_TRAMPOLINE_INIT
 #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
 
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 584ff5c..fb89e5a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
 typedef struct GTY (()) machine_function
 {
   struct aarch64_frame frame;
+  /* One entry for each GPR and FP register.  */
+  bool reg_is_wrapped_separately[V31_REGNUM + 1];
 } machine_function;
 #endif
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-14 14:25         ` Kyrill Tkachov
@ 2016-11-17 14:22           ` Kyrill Tkachov
  2016-11-17 14:44             ` Segher Boessenkool
  2016-11-29 10:58           ` James Greenhalgh
  1 sibling, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-17 14:22 UTC (permalink / raw)
  To: Segher Boessenkool, Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh


On 14/11/16 14:25, Kyrill Tkachov wrote:
>
> On 11/11/16 15:31, Kyrill Tkachov wrote:
>>
>> On 11/11/16 10:17, Kyrill Tkachov wrote:
>>>
>>> On 10/11/16 23:39, Segher Boessenkool wrote:
>>>> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
>>>>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
>>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
>>>>>> some interesting swings.
>>>>>> 458.sjeng     +1.45%
>>>>>> 471.omnetpp   +2.19%
>>>>>> 445.gobmk     -2.01%
>>>>>>
>>>>>> On SPECFP:
>>>>>> 453.povray    +7.00%
>>>>>
>>>>> Wow, this looks really good.  Thank you for implementing this.  If I
>>>>> get some time I am going to try it out on other processors than A72
>>>>> but I doubt I have time any time soon.
>>>> I'd love to hear what causes the slowdown for gobmk as well, btw.
>>>
>>> I haven't yet gotten a direct answer for that (through performance analysis tools)
>>> but I have noticed that load/store pairs are not generated as aggressively as I hoped.
>>> They are being merged by the sched fusion pass and peepholes (which runs after this)
>>> but it still misses cases. I've hacked the SWS hooks to generate pairs explicitly and that
>>> increases the number of pairs and helps code size to boot. It complicates the logic of
>>> the hooks a bit but not too much.
>>>
>>> I'll make those changes and re-benchmark, hopefully that
>>> will help performance.
>>>
>>
>> And here's a version that explicitly emits pairs. I've looked at assembly codegen on SPEC2006
>> and it generates quite a few more LDP/STP pairs than the original version.
>> I kicked off benchmarks over the weekend to see the effect.
>> Andrew, if you want to try it out (more benchmarking and testing always welcome) this is the
>> one to try.
>>
>
> And I discovered over the weekend that gamess and wrf have validation errors.
> This version runs correctly.
> SPECINT results were fine though and there is even a small overall gain due to
> sjeng and omnetpp. However, gobmk still has the regression.
> I'll rerun SPECFP with this patch (it's really just a small bugfix over the previous version)
> and get on with analysing gobmk.
>

After looking at the gobmk performance with performance counters it looks like more icache pressure.
I see an increase in misses.
This looks to me like an effect of code size increase, though it is not that large an increase (0.4% with SWS).
Branch mispredicts also go up a bit but not as much as icache misses.
I don't think there's anything we can do here, or at least that this patch can do about it.
Overall, there's a slight improvement in SPECINT, even with the gobmk regression and a slightly larger improvement
on SPECFP due to povray.

Segher, one curious artifact I spotted while looking at codegen differences in gobmk was a case where we fail
to emit load-pairs as effectively in the epilogue and its preceeding basic block.
So before we had this epilogue:
.L43:
     ldp    x21, x22, [sp, 16]
     ldp    x23, x24, [sp, 32]
     ldp    x25, x26, [sp, 48]
     ldp    x27, x28, [sp, 64]
     ldr    x30, [sp, 80]
     ldp    x19, x20, [sp], 112
     ret

and I see this becoming (among numerous other changes in the function):

.L69:
     ldp    x21, x22, [sp, 16]
     ldr    x24, [sp, 40]
.L43:
     ldp    x25, x26, [sp, 48]
     ldp    x27, x28, [sp, 64]
     ldr    x23, [sp, 32]
     ldr    x30, [sp, 80]
     ldp    x19, x20, [sp], 112
     ret

So this is better in the cases where we jump straight into .L43 because we load fewer registers
but worse when we jump to or fallthrough to .L69 because x23 and x24 are now restored using two loads
rather than a single load-pair. This hunk isn't critical to performance in gobmk though.

Given that there is an overall gain is this ok for trunk?
https://gcc.gnu.org/ml/gcc-patches/2016-11/msg01352.html

Thanks,
Kyrill

>
> 2016-11-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.h (machine_function): Add
>     reg_is_wrapped_separately field.
>     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
>     rtx_insn *.
>     (aarch64_save_callee_saves): Don't save registers that are wrapped
>     separately.
>     (aarch64_restore_callee_saves): Don't restore registers that are
>     wrapped separately.
>     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
>     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
>     (aarch64_get_separate_components): New function.
>     (aarch64_get_next_set_bit): Likewise.
>     (aarch64_components_for_bb): Likewise.
>     (aarch64_disqualify_components): Likewise.
>     (aarch64_emit_prologue_components): Likewise.
>     (aarch64_emit_epilogue_components): Likewise.
>     (aarch64_set_handled_components): Likewise.
>     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
>     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
>     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-17 14:22           ` Kyrill Tkachov
@ 2016-11-17 14:44             ` Segher Boessenkool
  2016-11-17 14:55               ` Kyrill Tkachov
  0 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2016-11-17 14:44 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh

Hi Kyrill,

On Thu, Nov 17, 2016 at 02:22:08PM +0000, Kyrill Tkachov wrote:
> >>>>>>I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there 
> >>>>>>were
> >>>>>>some interesting swings.
> >>>>>>458.sjeng     +1.45%
> >>>>>>471.omnetpp   +2.19%
> >>>>>>445.gobmk     -2.01%
> >>>>>>
> >>>>>>On SPECFP:
> >>>>>>453.povray    +7.00%

> After looking at the gobmk performance with performance counters it looks 
> like more icache pressure.
> I see an increase in misses.
> This looks to me like an effect of code size increase, though it is not 
> that large an increase (0.4% with SWS).

Right.  I don't see how to improve on this (but see below); ideas welcome :-)

> Branch mispredicts also go up a bit but not as much as icache misses.

I don't see that happening -- for some testcases we get unlucky and have
more branch predictor aliasing, and for some we have less, it's pretty
random.  Some testcases are really sensitive to this.

> I don't think there's anything we can do here, or at least that this patch 
> can do about it.
> Overall, there's a slight improvement in SPECINT, even with the gobmk 
> regression and a slightly larger improvement
> on SPECFP due to povray.

And that is for only the "normal" GPRs, not LR or FP yet, right?

> Segher, one curious artifact I spotted while looking at codegen differences 
> in gobmk was a case where we fail
> to emit load-pairs as effectively in the epilogue and its preceeding basic 
> block.
> So before we had this epilogue:
> .L43:
>     ldp    x21, x22, [sp, 16]
>     ldp    x23, x24, [sp, 32]
>     ldp    x25, x26, [sp, 48]
>     ldp    x27, x28, [sp, 64]
>     ldr    x30, [sp, 80]
>     ldp    x19, x20, [sp], 112
>     ret
> 
> and I see this becoming (among numerous other changes in the function):
> 
> .L69:
>     ldp    x21, x22, [sp, 16]
>     ldr    x24, [sp, 40]
> .L43:
>     ldp    x25, x26, [sp, 48]
>     ldp    x27, x28, [sp, 64]
>     ldr    x23, [sp, 32]
>     ldr    x30, [sp, 80]
>     ldp    x19, x20, [sp], 112
>     ret
> 
> So this is better in the cases where we jump straight into .L43 because we 
> load fewer registers
> but worse when we jump to or fallthrough to .L69 because x23 and x24 are 
> now restored using two loads
> rather than a single load-pair. This hunk isn't critical to performance in 
> gobmk though.

Is loading/storing a pair as cheap as loading/storing a single register?
In that case you could shrink-wrap per pair of registers instead.


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-17 14:44             ` Segher Boessenkool
@ 2016-11-17 14:55               ` Kyrill Tkachov
  2016-11-17 15:02                 ` Segher Boessenkool
  2016-11-17 16:50                 ` Kyrill Tkachov
  0 siblings, 2 replies; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-17 14:55 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh


On 17/11/16 14:44, Segher Boessenkool wrote:
> Hi Kyrill,
>
> On Thu, Nov 17, 2016 at 02:22:08PM +0000, Kyrill Tkachov wrote:
>>>>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there
>>>>>>>> were
>>>>>>>> some interesting swings.
>>>>>>>> 458.sjeng     +1.45%
>>>>>>>> 471.omnetpp   +2.19%
>>>>>>>> 445.gobmk     -2.01%
>>>>>>>>
>>>>>>>> On SPECFP:
>>>>>>>> 453.povray    +7.00%
>> After looking at the gobmk performance with performance counters it looks
>> like more icache pressure.
>> I see an increase in misses.
>> This looks to me like an effect of code size increase, though it is not
>> that large an increase (0.4% with SWS).
> Right.  I don't see how to improve on this (but see below); ideas welcome :-)
>
>> Branch mispredicts also go up a bit but not as much as icache misses.
> I don't see that happening -- for some testcases we get unlucky and have
> more branch predictor aliasing, and for some we have less, it's pretty
> random.  Some testcases are really sensitive to this.

Right, I don't think it's the branch prediction at fault in this case,
rather the above icache stuff.

>
>> I don't think there's anything we can do here, or at least that this patch
>> can do about it.
>> Overall, there's a slight improvement in SPECINT, even with the gobmk
>> regression and a slightly larger improvement
>> on SPECFP due to povray.
> And that is for only the "normal" GPRs, not LR or FP yet, right?

This patch does implement FP registers wrapping as well but not LR.
Though I remember seeing the improvement even when only GPRs were wrapped
in an earlier version of the patch.

>> Segher, one curious artifact I spotted while looking at codegen differences
>> in gobmk was a case where we fail
>> to emit load-pairs as effectively in the epilogue and its preceeding basic
>> block.
>> So before we had this epilogue:
>> .L43:
>>      ldp    x21, x22, [sp, 16]
>>      ldp    x23, x24, [sp, 32]
>>      ldp    x25, x26, [sp, 48]
>>      ldp    x27, x28, [sp, 64]
>>      ldr    x30, [sp, 80]
>>      ldp    x19, x20, [sp], 112
>>      ret
>>
>> and I see this becoming (among numerous other changes in the function):
>>
>> .L69:
>>      ldp    x21, x22, [sp, 16]
>>      ldr    x24, [sp, 40]
>> .L43:
>>      ldp    x25, x26, [sp, 48]
>>      ldp    x27, x28, [sp, 64]
>>      ldr    x23, [sp, 32]
>>      ldr    x30, [sp, 80]
>>      ldp    x19, x20, [sp], 112
>>      ret
>>
>> So this is better in the cases where we jump straight into .L43 because we
>> load fewer registers
>> but worse when we jump to or fallthrough to .L69 because x23 and x24 are
>> now restored using two loads
>> rather than a single load-pair. This hunk isn't critical to performance in
>> gobmk though.
> Is loading/storing a pair as cheap as loading/storing a single register?
> In that case you could shrink-wrap per pair of registers instead.

I suppose it can vary by microarchitecture. For the purposes of codegen I'd say
it's more expensive than load/storing a single register (as there's more memory bandwidth required after all)
but cheaper than two separate loads stores (alignment quirks notwithstanding).
Interesting idea. That could help with code size too. I'll try it out.
Thanks,
Kyrill

>
> Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-17 14:55               ` Kyrill Tkachov
@ 2016-11-17 15:02                 ` Segher Boessenkool
  2016-11-17 15:06                   ` Kyrill Tkachov
  2016-11-17 16:50                 ` Kyrill Tkachov
  1 sibling, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2016-11-17 15:02 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh

On Thu, Nov 17, 2016 at 02:55:20PM +0000, Kyrill Tkachov wrote:
> >>Overall, there's a slight improvement in SPECINT, even with the gobmk
> >>regression and a slightly larger improvement
> >>on SPECFP due to povray.
> >And that is for only the "normal" GPRs, not LR or FP yet, right?
> 
> This patch does implement FP registers wrapping as well but not LR.

I meant frame pointer, not floating point :-)


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-17 15:02                 ` Segher Boessenkool
@ 2016-11-17 15:06                   ` Kyrill Tkachov
  0 siblings, 0 replies; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-17 15:06 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh


On 17/11/16 15:01, Segher Boessenkool wrote:
> On Thu, Nov 17, 2016 at 02:55:20PM +0000, Kyrill Tkachov wrote:
>>>> Overall, there's a slight improvement in SPECINT, even with the gobmk
>>>> regression and a slightly larger improvement
>>>> on SPECFP due to povray.
>>> And that is for only the "normal" GPRs, not LR or FP yet, right?
>> This patch does implement FP registers wrapping as well but not LR.
> I meant frame pointer, not floating point :-)

Ah :) HARD_FRAME_POINTER is wrapped with -fomit-frame-pointer but we don't
touch the stack pointer (SP) in any case.

Kyrill

>
> Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-17 14:55               ` Kyrill Tkachov
  2016-11-17 15:02                 ` Segher Boessenkool
@ 2016-11-17 16:50                 ` Kyrill Tkachov
  2016-11-17 17:46                   ` Segher Boessenkool
  1 sibling, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-17 16:50 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh


On 17/11/16 14:55, Kyrill Tkachov wrote:
>
> On 17/11/16 14:44, Segher Boessenkool wrote:
>> Hi Kyrill,
>>
>> On Thu, Nov 17, 2016 at 02:22:08PM +0000, Kyrill Tkachov wrote:
>>>>>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there
>>>>>>>>> were
>>>>>>>>> some interesting swings.
>>>>>>>>> 458.sjeng     +1.45%
>>>>>>>>> 471.omnetpp   +2.19%
>>>>>>>>> 445.gobmk     -2.01%
>>>>>>>>>
>>>>>>>>> On SPECFP:
>>>>>>>>> 453.povray    +7.00%
>>> After looking at the gobmk performance with performance counters it looks
>>> like more icache pressure.
>>> I see an increase in misses.
>>> This looks to me like an effect of code size increase, though it is not
>>> that large an increase (0.4% with SWS).
>> Right.  I don't see how to improve on this (but see below); ideas welcome :-)
>>
>>> Branch mispredicts also go up a bit but not as much as icache misses.
>> I don't see that happening -- for some testcases we get unlucky and have
>> more branch predictor aliasing, and for some we have less, it's pretty
>> random.  Some testcases are really sensitive to this.
>
> Right, I don't think it's the branch prediction at fault in this case,
> rather the above icache stuff.
>
>>
>>> I don't think there's anything we can do here, or at least that this patch
>>> can do about it.
>>> Overall, there's a slight improvement in SPECINT, even with the gobmk
>>> regression and a slightly larger improvement
>>> on SPECFP due to povray.
>> And that is for only the "normal" GPRs, not LR or FP yet, right?
>
> This patch does implement FP registers wrapping as well but not LR.
> Though I remember seeing the improvement even when only GPRs were wrapped
> in an earlier version of the patch.
>
>>> Segher, one curious artifact I spotted while looking at codegen differences
>>> in gobmk was a case where we fail
>>> to emit load-pairs as effectively in the epilogue and its preceeding basic
>>> block.
>>> So before we had this epilogue:
>>> .L43:
>>>      ldp    x21, x22, [sp, 16]
>>>      ldp    x23, x24, [sp, 32]
>>>      ldp    x25, x26, [sp, 48]
>>>      ldp    x27, x28, [sp, 64]
>>>      ldr    x30, [sp, 80]
>>>      ldp    x19, x20, [sp], 112
>>>      ret
>>>
>>> and I see this becoming (among numerous other changes in the function):
>>>
>>> .L69:
>>>      ldp    x21, x22, [sp, 16]
>>>      ldr    x24, [sp, 40]
>>> .L43:
>>>      ldp    x25, x26, [sp, 48]
>>>      ldp    x27, x28, [sp, 64]
>>>      ldr    x23, [sp, 32]
>>>      ldr    x30, [sp, 80]
>>>      ldp    x19, x20, [sp], 112
>>>      ret
>>>
>>> So this is better in the cases where we jump straight into .L43 because we
>>> load fewer registers
>>> but worse when we jump to or fallthrough to .L69 because x23 and x24 are
>>> now restored using two loads
>>> rather than a single load-pair. This hunk isn't critical to performance in
>>> gobmk though.
>> Is loading/storing a pair as cheap as loading/storing a single register?
>> In that case you could shrink-wrap per pair of registers instead.
>
> I suppose it can vary by microarchitecture. For the purposes of codegen I'd say
> it's more expensive than load/storing a single register (as there's more memory bandwidth required after all)
> but cheaper than two separate loads stores (alignment quirks notwithstanding).
> Interesting idea. That could help with code size too. I'll try it out.

I'm encountering some difficulties implementing this idea.
I want to still keep the per-register structures across the hooks but basically restrict the number
of components in a basic block to an even number of FPRs and GPRs. I tried doing this in COMPONENTS_FOR_BB
but apparently this ended up not saving/restoring some of the registers at all because the components that were
"filtered out" that way still made their way to the bitmap passed into SET_HANDLED_COMPONENTS and so the normal
prologue/epilogue didn't end up saving and restoring them.

I don't want to do it in GET_SEPARATE_COMPONENTS as that doesn't see each basic block.
Is this something DISQUALIFY_COMPONENTS could be used for?

Thanks,
Kyrill

> Thanks,
> Kyrill
>
>>
>> Segher
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-17 16:50                 ` Kyrill Tkachov
@ 2016-11-17 17:46                   ` Segher Boessenkool
  2016-11-18  9:29                     ` Kyrill Tkachov
  0 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2016-11-17 17:46 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh

On Thu, Nov 17, 2016 at 04:50:46PM +0000, Kyrill Tkachov wrote:
> >>Is loading/storing a pair as cheap as loading/storing a single register?
> >>In that case you could shrink-wrap per pair of registers instead.
> >
> >I suppose it can vary by microarchitecture. For the purposes of codegen 
> >I'd say
> >it's more expensive than load/storing a single register (as there's more 
> >memory bandwidth required after all)
> >but cheaper than two separate loads stores (alignment quirks 
> >notwithstanding).
> >Interesting idea. That could help with code size too. I'll try it out.
> 
> I'm encountering some difficulties implementing this idea.
> I want to still keep the per-register structures across the hooks but 
> basically restrict the number
> of components in a basic block to an even number of FPRs and GPRs. I tried 
> doing this in COMPONENTS_FOR_BB

So your COMPONENTS_FOR_BB returns both components in a pair whenever one
of those is needed?  That should work afaics.

> but apparently this ended up not saving/restoring some of the registers at 
> all because the components that were
> "filtered out" that way still made their way to the bitmap passed into 
> SET_HANDLED_COMPONENTS and so the normal
> prologue/epilogue didn't end up saving and restoring them.

I am not sure what this means?  "filtered out"?


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-17 17:46                   ` Segher Boessenkool
@ 2016-11-18  9:29                     ` Kyrill Tkachov
  2016-11-18 12:50                       ` Segher Boessenkool
  0 siblings, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-18  9:29 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh


On 17/11/16 17:45, Segher Boessenkool wrote:
> On Thu, Nov 17, 2016 at 04:50:46PM +0000, Kyrill Tkachov wrote:
>>>> Is loading/storing a pair as cheap as loading/storing a single register?
>>>> In that case you could shrink-wrap per pair of registers instead.
>>> I suppose it can vary by microarchitecture. For the purposes of codegen
>>> I'd say
>>> it's more expensive than load/storing a single register (as there's more
>>> memory bandwidth required after all)
>>> but cheaper than two separate loads stores (alignment quirks
>>> notwithstanding).
>>> Interesting idea. That could help with code size too. I'll try it out.
>> I'm encountering some difficulties implementing this idea.
>> I want to still keep the per-register structures across the hooks but
>> basically restrict the number
>> of components in a basic block to an even number of FPRs and GPRs. I tried
>> doing this in COMPONENTS_FOR_BB
> So your COMPONENTS_FOR_BB returns both components in a pair whenever one
> of those is needed?  That should work afaics.
>

I mean I still want to have one component per register and since
emit_{prologue,epilogue}_components knows how to form pairs from the
components passed down to it I just need to restrict the number of
components in any particular basic block to an even number.
So say a function can wrap 5 registers: x22,x23,x24,x25,x26.
I want get_separate_components to return 5 components since in that hook
we don't know how these registers are distributed across each basic block.
components_for_bb has that information.
In components_for_bb I want to restrict the components for a basic block to
an even number, so if normally all 5 registers would be valid for wrapping
in that bb I'd only choose 4 so I could form 2 pairs. But selecting only 4
of the 5 registers, say only x22,x23,x24,x25 leads to x26 not being saved
or restored at all, even during the normal prologue and epilogue because
x26 was marked as a component in components_for_bb and therefore omitted from
the prologue and epilogue.
So I'm thinking x26 should be removed from the wrappable components of
a basic block by disqualify_components. I'm trying that approach now.

Thanks,
Kyrill

>> but apparently this ended up not saving/restoring some of the registers at
>> all because the components that were
>> "filtered out" that way still made their way to the bitmap passed into
>> SET_HANDLED_COMPONENTS and so the normal
>> prologue/epilogue didn't end up saving and restoring them.
> I am not sure what this means?  "filtered out"?
>
>
> Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-18  9:29                     ` Kyrill Tkachov
@ 2016-11-18 12:50                       ` Segher Boessenkool
  2016-11-22 16:47                         ` Kyrill Tkachov
  0 siblings, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2016-11-18 12:50 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh

On Fri, Nov 18, 2016 at 09:29:13AM +0000, Kyrill Tkachov wrote:
> >So your COMPONENTS_FOR_BB returns both components in a pair whenever one
> >of those is needed?  That should work afaics.
> 
> I mean I still want to have one component per register and since
> emit_{prologue,epilogue}_components knows how to form pairs from the
> components passed down to it I just need to restrict the number of
> components in any particular basic block to an even number.
> So say a function can wrap 5 registers: x22,x23,x24,x25,x26.
> I want get_separate_components to return 5 components since in that hook
> we don't know how these registers are distributed across each basic block.
> components_for_bb has that information.
> In components_for_bb I want to restrict the components for a basic block to
> an even number, so if normally all 5 registers would be valid for wrapping
> in that bb I'd only choose 4 so I could form 2 pairs. But selecting only 4
> of the 5 registers, say only x22,x23,x24,x25 leads to x26 not being saved
> or restored at all, even during the normal prologue and epilogue because
> x26 was marked as a component in components_for_bb and therefore omitted 
> from
> the prologue and epilogue.
> So I'm thinking x26 should be removed from the wrappable components of
> a basic block by disqualify_components. I'm trying that approach now.

My suggestion was, in components_for_bb, whenever you mark x22 as needed
you also mark x23 as needed, and whenever you mark x23 as needed you also
mark x22.  I think this is a lot simpler?


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-18 12:50                       ` Segher Boessenkool
@ 2016-11-22 16:47                         ` Kyrill Tkachov
  0 siblings, 0 replies; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-22 16:47 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh


On 18/11/16 12:50, Segher Boessenkool wrote:
> On Fri, Nov 18, 2016 at 09:29:13AM +0000, Kyrill Tkachov wrote:
>>> So your COMPONENTS_FOR_BB returns both components in a pair whenever one
>>> of those is needed?  That should work afaics.
>> I mean I still want to have one component per register and since
>> emit_{prologue,epilogue}_components knows how to form pairs from the
>> components passed down to it I just need to restrict the number of
>> components in any particular basic block to an even number.
>> So say a function can wrap 5 registers: x22,x23,x24,x25,x26.
>> I want get_separate_components to return 5 components since in that hook
>> we don't know how these registers are distributed across each basic block.
>> components_for_bb has that information.
>> In components_for_bb I want to restrict the components for a basic block to
>> an even number, so if normally all 5 registers would be valid for wrapping
>> in that bb I'd only choose 4 so I could form 2 pairs. But selecting only 4
>> of the 5 registers, say only x22,x23,x24,x25 leads to x26 not being saved
>> or restored at all, even during the normal prologue and epilogue because
>> x26 was marked as a component in components_for_bb and therefore omitted
>> from
>> the prologue and epilogue.
>> So I'm thinking x26 should be removed from the wrappable components of
>> a basic block by disqualify_components. I'm trying that approach now.
> My suggestion was, in components_for_bb, whenever you mark x22 as needed
> you also mark x23 as needed, and whenever you mark x23 as needed you also
> mark x22.  I think this is a lot simpler?

But then we'd have cases where we're saving and restoring x23
even when it's not necessary.
In any case, I tried it out and it didn't fix the gobmk issue, though it did reduce the code
size increase somewhat.

With the patch already posted at [1] the net result is still positive on
both SPECINT and SPECFP.

I also ran the numbers on a Cortex-A57. The changes are less pronounced
with SPECINT being neutral (gobmk shows only a 0.8% regression) and SPECFP
having a small improvement, due to povray improving by 2.9%.

Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg01352.html

>
> Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-14 14:25         ` Kyrill Tkachov
  2016-11-17 14:22           ` Kyrill Tkachov
@ 2016-11-29 10:58           ` James Greenhalgh
  2016-11-29 11:18             ` Kyrill Tkachov
  2016-11-29 20:29             ` Segher Boessenkool
  1 sibling, 2 replies; 25+ messages in thread
From: James Greenhalgh @ 2016-11-29 10:58 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Segher Boessenkool, Andrew Pinski, GCC Patches, Marcus Shawcroft,
	Richard Earnshaw, nd

On Mon, Nov 14, 2016 at 02:25:28PM +0000, Kyrill Tkachov wrote:
> 
> On 11/11/16 15:31, Kyrill Tkachov wrote:
> >
> >On 11/11/16 10:17, Kyrill Tkachov wrote:
> >>
> >>On 10/11/16 23:39, Segher Boessenkool wrote:
> >>>On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
> >>>>On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
> >>>>>I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
> >>>>>some interesting swings.
> >>>>>458.sjeng     +1.45%
> >>>>>471.omnetpp   +2.19%
> >>>>>445.gobmk     -2.01%
> >>>>>
> >>>>>On SPECFP:
> >>>>>453.povray    +7.00%
> >>>>
> >>>>Wow, this looks really good.  Thank you for implementing this.  If I
> >>>>get some time I am going to try it out on other processors than A72
> >>>>but I doubt I have time any time soon.
> >>>I'd love to hear what causes the slowdown for gobmk as well, btw.
> >>
> >>I haven't yet gotten a direct answer for that (through performance analysis
> >>tools) but I have noticed that load/store pairs are not generated as
> >>aggressively as I hoped.  They are being merged by the sched fusion pass
> >>and peepholes (which runs after this) but it still misses cases. I've
> >>hacked the SWS hooks to generate pairs explicitly and that increases the
> >>number of pairs and helps code size to boot. It complicates the logic of
> >>the hooks a bit but not too much.
> >>
> >>I'll make those changes and re-benchmark, hopefully that
> >>will help performance.
> >>
> >
> >And here's a version that explicitly emits pairs. I've looked at assembly
> >codegen on SPEC2006 and it generates quite a few more LDP/STP pairs than the
> >original version.  I kicked off benchmarks over the weekend to see the
> >effect.  Andrew, if you want to try it out (more benchmarking and testing
> >always welcome) this is the one to try.
> >
> 
> And I discovered over the weekend that gamess and wrf have validation errors.
> This version runs correctly.  SPECINT results were fine though and there is
> even a small overall gain due to sjeng and omnetpp. However, gobmk still has
> the regression.  I'll rerun SPECFP with this patch (it's really just a small
> bugfix over the previous version) and get on with analysing gobmk.

I have some comments in line, most of which are about hardcoding the
maximum register number, but at a high level this looks good to me.

Thanks,
James

> 2016-11-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     * config/aarch64/aarch64.h (machine_function): Add
>     reg_is_wrapped_separately field.
>     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
>     rtx_insn *.
>     (aarch64_save_callee_saves): Don't save registers that are wrapped
>     separately.
>     (aarch64_restore_callee_saves): Don't restore registers that are
>     wrapped separately.
>     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
>     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
>     (aarch64_get_separate_components): New function.
>     (aarch64_get_next_set_bit): Likewise.
>     (aarch64_components_for_bb): Likewise.
>     (aarch64_disqualify_components): Likewise.
>     (aarch64_emit_prologue_components): Likewise.
>     (aarch64_emit_epilogue_components): Likewise.
>     (aarch64_set_handled_components): Likewise.
>     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
>     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
>     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
> 

> commit 06ac3c30d8aa38781ee9019e60a5fcf727b85231
> Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> Date:   Tue Oct 11 09:25:54 2016 +0100
> 
>     [AArch64] Separate shrink wrapping hooks implementation
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 325e725..2d33ef6 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm,
>  
>  /* Emit an insn that's a simple single-set.  Both the operands must be
>     known to be valid.  */
> -inline static rtx
> +inline static rtx_insn *
>  emit_set_insn (rtx x, rtx y)
>  {
>    return emit_insn (gen_rtx_SET (x, y));
> @@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
>  	      || regno == cfun->machine->frame.wb_candidate2))
>  	continue;
>  
> +      if (cfun->machine->reg_is_wrapped_separately[regno])
> +       continue;
> +
>        reg = gen_rtx_REG (mode, regno);
>        offset = start_offset + cfun->machine->frame.reg_offset[regno];
>        mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
> @@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
>        regno2 = aarch64_next_callee_save (regno + 1, limit);
>  
>        if (regno2 <= limit
> +	  && !cfun->machine->reg_is_wrapped_separately[regno2]
>  	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
>  	      == cfun->machine->frame.reg_offset[regno2]))
>  
> @@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode,
>         regno <= limit;
>         regno = aarch64_next_callee_save (regno + 1, limit))
>      {
> +      if (cfun->machine->reg_is_wrapped_separately[regno])
> +       continue;
> +
>        rtx reg, mem;
>  
>        if (skip_wb
> @@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode,
>        regno2 = aarch64_next_callee_save (regno + 1, limit);
>  
>        if (regno2 <= limit
> +	  && !cfun->machine->reg_is_wrapped_separately[regno2]
>  	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
>  	      == cfun->machine->frame.reg_offset[regno2]))
>  	{
> @@ -3224,6 +3232,272 @@ aarch64_restore_callee_saves (machine_mode mode,
>      }
>  }
>  
> +static inline bool
> +offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
> +			       HOST_WIDE_INT offset)
> +{
> +  return offset >= -256 && offset < 256;

A follow-up changing this to use IN_RANGE would seem correct.

> +}
> +
> +static inline bool
> +offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
> +{
> +  return (offset >= 0
> +	  && offset < 4096 * GET_MODE_SIZE (mode)
> +	  && offset % GET_MODE_SIZE (mode) == 0);
> +}
> +
> +bool
> +aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
> +{
> +  return (offset >= -64 * GET_MODE_SIZE (mode)
> +	  && offset < 64 * GET_MODE_SIZE (mode)
> +	  && offset % GET_MODE_SIZE (mode) == 0);

And certainly here, IN_RANGE would be neater.

> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
> +
> +static sbitmap
> +aarch64_get_separate_components (void)
> +{
> +  aarch64_layout_frame ();
> +
> +  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
> +  bitmap_clear (components);
> +
> +  /* The registers we need saved to the frame.  */
> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
> +    if (aarch64_register_saved_on_entry (regno))
> +      {
> +	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
> +	if (!frame_pointer_needed)
> +	  offset += cfun->machine->frame.frame_size
> +		    - cfun->machine->frame.hard_fp_offset;
> +	/* Check that we can access the stack slot of the register with one
> +	   direct load with no adjustments needed.  */
> +	if (offset_12bit_unsigned_scaled_p (DImode, offset))
> +	  bitmap_set_bit (components, regno);
> +      }
> +
> +  /* Don't mess with the hard frame pointer.  */
> +  if (frame_pointer_needed)
> +    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
> +
> +  unsigned reg1 = cfun->machine->frame.wb_candidate1;
> +  unsigned reg2 = cfun->machine->frame.wb_candidate2;
> +  /* If aarch64_layout_frame has chosen registers to store/restore with
> +     writeback don't interfere with them to avoid having to output explicit
> +     stack adjustment instructions.  */
> +  if (reg2 != INVALID_REGNUM)
> +    bitmap_clear_bit (components, reg2);
> +  if (reg1 != INVALID_REGNUM)
> +    bitmap_clear_bit (components, reg1);
> +
> +  bitmap_clear_bit (components, LR_REGNUM);
> +  bitmap_clear_bit (components, SP_REGNUM);
> +
> +  return components;
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
> +
> +static sbitmap
> +aarch64_components_for_bb (basic_block bb)
> +{
> +  bitmap in = DF_LIVE_IN (bb);
> +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
> +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
> +
> +  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
> +  bitmap_clear (components);
> +
> +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)

The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
where the end of the register file is (does this, for example, fall apart
with the SVE work that was recently posted). Something like a
LAST_HARDREG_NUM might work?

> +    if ((!call_used_regs[regno])
> +       && (bitmap_bit_p (in, regno)
> +	   || bitmap_bit_p (gen, regno)
> +	   || bitmap_bit_p (kill, regno)))
> +	  bitmap_set_bit (components, regno);
> +
> +  return components;
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
> +   Nothing to do for aarch64.  */
> +
> +static void
> +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
> +{
> +}

Is there no default "do nothing" hook for this?

> +
> +/* Return the next set bit in BMP from START onwards.  Return the total number
> +   of bits in BMP if no set bit is found at or after START.  */
> +
> +static unsigned int
> +aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
> +{
> +  unsigned int nbits = SBITMAP_SIZE (bmp);
> +  if (start == nbits)
> +    return start;
> +
> +  gcc_assert (start < nbits);
> +  for (unsigned int i = start; i < nbits; i++)
> +    if (bitmap_bit_p (bmp, i))
> +      return i;
> +
> +  return nbits;
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
> +
> +static void
> +aarch64_emit_prologue_components (sbitmap components)
> +{
> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
> +			     ? HARD_FRAME_POINTER_REGNUM
> +			     : STACK_POINTER_REGNUM);
> +
> +  unsigned total_bits = SBITMAP_SIZE (components);

Would this be clearer called last_regno ?

> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
> +  rtx_insn *insn = NULL;
> +
> +  while (regno != total_bits)
> +    {
> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;

Comment about why this can be DFmode rather than some 128-bit mode might
be useful here.

> +      rtx reg = gen_rtx_REG (mode, regno);
> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
> +      if (!frame_pointer_needed)
> +	offset += cfun->machine->frame.frame_size
> +		  - cfun->machine->frame.hard_fp_offset;
> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
> +      rtx mem = gen_frame_mem (mode, addr);
> +
> +      rtx set = gen_rtx_SET (mem, reg);
> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
> +      /* No more registers to save after REGNO.
> +	 Emit a single save and exit.  */
> +      if (regno2 == total_bits)
> +	{
> +	  insn = emit_insn (set);
> +	  RTX_FRAME_RELATED_P (insn) = 1;
> +	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
> +	  break;
> +	}
> +
> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
> +      /* The next register is not of the same class or its offset is not
> +	 mergeable with the current one into a pair.  */
> +      if (!satisfies_constraint_Ump (mem)
> +	  || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
> +	  || (offset2 - cfun->machine->frame.reg_offset[regno])
> +		!= GET_MODE_SIZE (DImode))
> +	{
> +	  insn = emit_insn (set);
> +	  RTX_FRAME_RELATED_P (insn) = 1;
> +	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
> +
> +	  regno = regno2;
> +	  continue;
> +	}
> +
> +      /* REGNO2 can be stored in a pair with REGNO.  */
> +      rtx reg2 = gen_rtx_REG (mode, regno2);
> +      if (!frame_pointer_needed)
> +	offset2 += cfun->machine->frame.frame_size
> +		  - cfun->machine->frame.hard_fp_offset;
> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
> +      rtx mem2 = gen_frame_mem (mode, addr2);
> +      rtx set2 = gen_rtx_SET (mem2, reg2);
> +
> +      insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
> +      RTX_FRAME_RELATED_P (insn) = 1;
> +      add_reg_note (insn, REG_CFA_OFFSET, set);
> +      add_reg_note (insn, REG_CFA_OFFSET, set2);
> +
> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
> +    }
> +
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
> +
> +static void
> +aarch64_emit_epilogue_components (sbitmap components)

Given the similarity of the logic, is there any way you can refactor this
with the prologue code above?

> +{
> +
> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
> +			     ? HARD_FRAME_POINTER_REGNUM
> +			     : STACK_POINTER_REGNUM);
> +  unsigned total_bits = SBITMAP_SIZE (components);
> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
> +  rtx_insn *insn = NULL;
> +
> +  while (regno != total_bits)
> +    {
> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
> +      rtx reg = gen_rtx_REG (mode, regno);
> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
> +      if (!frame_pointer_needed)
> +	offset += cfun->machine->frame.frame_size
> +		  - cfun->machine->frame.hard_fp_offset;
> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
> +      rtx mem = gen_frame_mem (mode, addr);
> +
> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
> +      /* No more registers after REGNO to restore.
> +	 Emit a single restore and exit.  */
> +      if (regno2 == total_bits)
> +	{
> +	  insn = emit_move_insn (reg, mem);
> +	  RTX_FRAME_RELATED_P (insn) = 1;
> +	  add_reg_note (insn, REG_CFA_RESTORE, reg);
> +	  break;
> +	}
> +
> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
> +      /* The next register is not of the same class or its offset is not
> +	 mergeable with the current one into a pair or the offset doesn't fit
> +	 for a load pair.  Emit a single restore and continue from REGNO2.  */
> +      if (!satisfies_constraint_Ump (mem)
> +	  || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
> +	  || (offset2 - cfun->machine->frame.reg_offset[regno])
> +		!= GET_MODE_SIZE (DImode))
> +	{
> +	  insn = emit_move_insn (reg, mem);
> +	  RTX_FRAME_RELATED_P (insn) = 1;
> +	  add_reg_note (insn, REG_CFA_RESTORE, reg);
> +
> +	  regno = regno2;
> +	  continue;
> +	}
> +
> +      /* REGNO2 can be loaded in a pair with REGNO.  */
> +      rtx reg2 = gen_rtx_REG (mode, regno2);
> +      if (!frame_pointer_needed)
> +	offset2 += cfun->machine->frame.frame_size
> +		  - cfun->machine->frame.hard_fp_offset;
> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
> +      rtx mem2 = gen_frame_mem (mode, addr2);
> +
> +      insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> +      RTX_FRAME_RELATED_P (insn) = 1;
> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
> +      add_reg_note (insn, REG_CFA_RESTORE, reg2);
> +
> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
> +    }
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
> +
> +static void
> +aarch64_set_handled_components (sbitmap components)
> +{
> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
> +    if (bitmap_bit_p (components, regno))
> +      cfun->machine->reg_is_wrapped_separately[regno] = true;
> +}
> +
>  /* AArch64 stack frames generated by this compiler look like:
>  
>  	+-------------------------------+
> @@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
>    return false;
>  }
>  
> -bool
> -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
> -{
> -  return (offset >= -64 * GET_MODE_SIZE (mode)
> -	  && offset < 64 * GET_MODE_SIZE (mode)
> -	  && offset % GET_MODE_SIZE (mode) == 0);
> -}
> -
> -static inline bool
> -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
> -			       HOST_WIDE_INT offset)
> -{
> -  return offset >= -256 && offset < 256;
> -}
> -
> -static inline bool
> -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
> -{
> -  return (offset >= 0
> -	  && offset < 4096 * GET_MODE_SIZE (mode)
> -	  && offset % GET_MODE_SIZE (mode) == 0);
> -}
> -
>  /* Return true if MODE is one of the modes for which we
>     support LDP/STP operations.  */
>  
> @@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
>  #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
>    aarch64_first_cycle_multipass_dfa_lookahead_guard
>  
> +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
> +  aarch64_get_separate_components
> +
> +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
> +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
> +  aarch64_components_for_bb
> +
> +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
> +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
> +  aarch64_disqualify_components
> +
> +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
> +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
> +  aarch64_emit_prologue_components
> +
> +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
> +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
> +  aarch64_emit_epilogue_components
> +
> +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
> +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
> +  aarch64_set_handled_components
> +
>  #undef TARGET_TRAMPOLINE_INIT
>  #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
>  
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 584ff5c..fb89e5a 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
>  typedef struct GTY (()) machine_function
>  {
>    struct aarch64_frame frame;
> +  /* One entry for each GPR and FP register.  */
> +  bool reg_is_wrapped_separately[V31_REGNUM + 1];

Another hardcoded use of V31_REGNUM.

Thanks,
James

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-29 10:58           ` James Greenhalgh
@ 2016-11-29 11:18             ` Kyrill Tkachov
  2016-11-29 11:32               ` Kyrill Tkachov
  2016-11-29 20:29             ` Segher Boessenkool
  1 sibling, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-29 11:18 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Segher Boessenkool, Andrew Pinski, GCC Patches, Marcus Shawcroft,
	Richard Earnshaw

Hi James,

On 29/11/16 10:57, James Greenhalgh wrote:
> On Mon, Nov 14, 2016 at 02:25:28PM +0000, Kyrill Tkachov wrote:
>> On 11/11/16 15:31, Kyrill Tkachov wrote:
>>> On 11/11/16 10:17, Kyrill Tkachov wrote:
>>>> On 10/11/16 23:39, Segher Boessenkool wrote:
>>>>> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
>>>>>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
>>>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
>>>>>>> some interesting swings.
>>>>>>> 458.sjeng     +1.45%
>>>>>>> 471.omnetpp   +2.19%
>>>>>>> 445.gobmk     -2.01%
>>>>>>>
>>>>>>> On SPECFP:
>>>>>>> 453.povray    +7.00%
>>>>>> Wow, this looks really good.  Thank you for implementing this.  If I
>>>>>> get some time I am going to try it out on other processors than A72
>>>>>> but I doubt I have time any time soon.
>>>>> I'd love to hear what causes the slowdown for gobmk as well, btw.
>>>> I haven't yet gotten a direct answer for that (through performance analysis
>>>> tools) but I have noticed that load/store pairs are not generated as
>>>> aggressively as I hoped.  They are being merged by the sched fusion pass
>>>> and peepholes (which runs after this) but it still misses cases. I've
>>>> hacked the SWS hooks to generate pairs explicitly and that increases the
>>>> number of pairs and helps code size to boot. It complicates the logic of
>>>> the hooks a bit but not too much.
>>>>
>>>> I'll make those changes and re-benchmark, hopefully that
>>>> will help performance.
>>>>
>>> And here's a version that explicitly emits pairs. I've looked at assembly
>>> codegen on SPEC2006 and it generates quite a few more LDP/STP pairs than the
>>> original version.  I kicked off benchmarks over the weekend to see the
>>> effect.  Andrew, if you want to try it out (more benchmarking and testing
>>> always welcome) this is the one to try.
>>>
>> And I discovered over the weekend that gamess and wrf have validation errors.
>> This version runs correctly.  SPECINT results were fine though and there is
>> even a small overall gain due to sjeng and omnetpp. However, gobmk still has
>> the regression.  I'll rerun SPECFP with this patch (it's really just a small
>> bugfix over the previous version) and get on with analysing gobmk.
> I have some comments in line, most of which are about hardcoding the
> maximum register number, but at a high level this looks good to me.

Thanks for having a look.
I'll respin with the comments addressed and I have a couple of comments inline.

Kyrill

> Thanks,
> James
>
>

<snip>

> +
> +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
> +
> +static sbitmap
> +aarch64_components_for_bb (basic_block bb)
> +{
> +  bitmap in = DF_LIVE_IN (bb);
> +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
> +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
> +
> +  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
> +  bitmap_clear (components);
> +
> +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
> The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
> where the end of the register file is (does this, for example, fall apart
> with the SVE work that was recently posted). Something like a
> LAST_HARDREG_NUM might work?
>

I think you mean FIRST_PSEUDO_REGISTER. AFAICS the compiler uses
a loop from 0 to FIRST_PSEUDO_REGISTER to go through the hard registers
in various places in the midend.
I'll change it to use that, though if the way to save/restore such new registers becomes
different from the current approach (i.e. perform a DI/DFmode memory op) the code in these
hooks will have to be updated anyway to take it into account.

>> +    if ((!call_used_regs[regno])
>> +       && (bitmap_bit_p (in, regno)
>> +	   || bitmap_bit_p (gen, regno)
>> +	   || bitmap_bit_p (kill, regno)))
>> +	  bitmap_set_bit (components, regno);
>> +
>> +  return components;
>> +}
>> +
>> +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
>> +   Nothing to do for aarch64.  */
>> +
>> +static void
>> +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
>> +{
>> +}
> Is there no default "do nothing" hook for this?
>

I don't see one defined anywhere and the documentation for TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
says that if it is defined, all other hooks in the group must be defined.

>> +
>> +/* Return the next set bit in BMP from START onwards.  Return the total number
>> +   of bits in BMP if no set bit is found at or after START.  */
>> +
>> +static unsigned int
>> +aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
>> +{
>> +  unsigned int nbits = SBITMAP_SIZE (bmp);
>> +  if (start == nbits)
>> +    return start;
>> +
>> +  gcc_assert (start < nbits);
>> +  for (unsigned int i = start; i < nbits; i++)
>> +    if (bitmap_bit_p (bmp, i))
>> +      return i;
>> +
>> +  return nbits;
>> +}
>> +
>> +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
>> +
>> +static void
>> +aarch64_emit_prologue_components (sbitmap components)
>> +{
>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
>> +			     ? HARD_FRAME_POINTER_REGNUM
>> +			     : STACK_POINTER_REGNUM);
>> +
>> +  unsigned total_bits = SBITMAP_SIZE (components);
> Would this be clearer called last_regno ?
>
>> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
>> +  rtx_insn *insn = NULL;
>> +
>> +  while (regno != total_bits)
>> +    {
>> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
> Comment about why this can be DFmode rather than some 128-bit mode might
> be useful here.
>
>> +      rtx reg = gen_rtx_REG (mode, regno);
>> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
>> +      if (!frame_pointer_needed)
>> +	offset += cfun->machine->frame.frame_size
>> +		  - cfun->machine->frame.hard_fp_offset;
>> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
>> +      rtx mem = gen_frame_mem (mode, addr);
>> +
>> +      rtx set = gen_rtx_SET (mem, reg);
>> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
>> +      /* No more registers to save after REGNO.
>> +	 Emit a single save and exit.  */
>> +      if (regno2 == total_bits)
>> +	{
>> +	  insn = emit_insn (set);
>> +	  RTX_FRAME_RELATED_P (insn) = 1;
>> +	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
>> +	  break;
>> +	}
>> +
>> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
>> +      /* The next register is not of the same class or its offset is not
>> +	 mergeable with the current one into a pair.  */
>> +      if (!satisfies_constraint_Ump (mem)
>> +	  || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
>> +	  || (offset2 - cfun->machine->frame.reg_offset[regno])
>> +		!= GET_MODE_SIZE (DImode))
>> +	{
>> +	  insn = emit_insn (set);
>> +	  RTX_FRAME_RELATED_P (insn) = 1;
>> +	  add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
>> +
>> +	  regno = regno2;
>> +	  continue;
>> +	}
>> +
>> +      /* REGNO2 can be stored in a pair with REGNO.  */
>> +      rtx reg2 = gen_rtx_REG (mode, regno2);
>> +      if (!frame_pointer_needed)
>> +	offset2 += cfun->machine->frame.frame_size
>> +		  - cfun->machine->frame.hard_fp_offset;
>> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
>> +      rtx mem2 = gen_frame_mem (mode, addr2);
>> +      rtx set2 = gen_rtx_SET (mem2, reg2);
>> +
>> +      insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
>> +      RTX_FRAME_RELATED_P (insn) = 1;
>> +      add_reg_note (insn, REG_CFA_OFFSET, set);
>> +      add_reg_note (insn, REG_CFA_OFFSET, set2);
>> +
>> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
>> +    }
>> +
>> +}
>> +
>> +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
>> +
>> +static void
>> +aarch64_emit_epilogue_components (sbitmap components)
> Given the similarity of the logic, is there any way you can refactor this
> with the prologue code above?
>
>> +{
>> +
>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
>> +			     ? HARD_FRAME_POINTER_REGNUM
>> +			     : STACK_POINTER_REGNUM);
>> +  unsigned total_bits = SBITMAP_SIZE (components);
>> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
>> +  rtx_insn *insn = NULL;
>> +
>> +  while (regno != total_bits)
>> +    {
>> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
>> +      rtx reg = gen_rtx_REG (mode, regno);
>> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
>> +      if (!frame_pointer_needed)
>> +	offset += cfun->machine->frame.frame_size
>> +		  - cfun->machine->frame.hard_fp_offset;
>> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
>> +      rtx mem = gen_frame_mem (mode, addr);
>> +
>> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
>> +      /* No more registers after REGNO to restore.
>> +	 Emit a single restore and exit.  */
>> +      if (regno2 == total_bits)
>> +	{
>> +	  insn = emit_move_insn (reg, mem);
>> +	  RTX_FRAME_RELATED_P (insn) = 1;
>> +	  add_reg_note (insn, REG_CFA_RESTORE, reg);
>> +	  break;
>> +	}
>> +
>> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
>> +      /* The next register is not of the same class or its offset is not
>> +	 mergeable with the current one into a pair or the offset doesn't fit
>> +	 for a load pair.  Emit a single restore and continue from REGNO2.  */
>> +      if (!satisfies_constraint_Ump (mem)
>> +	  || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
>> +	  || (offset2 - cfun->machine->frame.reg_offset[regno])
>> +		!= GET_MODE_SIZE (DImode))
>> +	{
>> +	  insn = emit_move_insn (reg, mem);
>> +	  RTX_FRAME_RELATED_P (insn) = 1;
>> +	  add_reg_note (insn, REG_CFA_RESTORE, reg);
>> +
>> +	  regno = regno2;
>> +	  continue;
>> +	}
>> +
>> +      /* REGNO2 can be loaded in a pair with REGNO.  */
>> +      rtx reg2 = gen_rtx_REG (mode, regno2);
>> +      if (!frame_pointer_needed)
>> +	offset2 += cfun->machine->frame.frame_size
>> +		  - cfun->machine->frame.hard_fp_offset;
>> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
>> +      rtx mem2 = gen_frame_mem (mode, addr2);
>> +
>> +      insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
>> +      RTX_FRAME_RELATED_P (insn) = 1;
>> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
>> +      add_reg_note (insn, REG_CFA_RESTORE, reg2);
>> +
>> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
>> +    }
>> +}
>> +
>> +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
>> +
>> +static void
>> +aarch64_set_handled_components (sbitmap components)
>> +{
>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
>> +    if (bitmap_bit_p (components, regno))
>> +      cfun->machine->reg_is_wrapped_separately[regno] = true;
>> +}
>> +
>>   /* AArch64 stack frames generated by this compiler look like:
>>   
>>   	+-------------------------------+
>> @@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
>>     return false;
>>   }
>>   
>> -bool
>> -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
>> -{
>> -  return (offset >= -64 * GET_MODE_SIZE (mode)
>> -	  && offset < 64 * GET_MODE_SIZE (mode)
>> -	  && offset % GET_MODE_SIZE (mode) == 0);
>> -}
>> -
>> -static inline bool
>> -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
>> -			       HOST_WIDE_INT offset)
>> -{
>> -  return offset >= -256 && offset < 256;
>> -}
>> -
>> -static inline bool
>> -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
>> -{
>> -  return (offset >= 0
>> -	  && offset < 4096 * GET_MODE_SIZE (mode)
>> -	  && offset % GET_MODE_SIZE (mode) == 0);
>> -}
>> -
>>   /* Return true if MODE is one of the modes for which we
>>      support LDP/STP operations.  */
>>   
>> @@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
>>   #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
>>     aarch64_first_cycle_multipass_dfa_lookahead_guard
>>   
>> +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
>> +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
>> +  aarch64_get_separate_components
>> +
>> +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
>> +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
>> +  aarch64_components_for_bb
>> +
>> +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
>> +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
>> +  aarch64_disqualify_components
>> +
>> +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
>> +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
>> +  aarch64_emit_prologue_components
>> +
>> +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
>> +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
>> +  aarch64_emit_epilogue_components
>> +
>> +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
>> +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
>> +  aarch64_set_handled_components
>> +
>>   #undef TARGET_TRAMPOLINE_INIT
>>   #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
>>   
>> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>> index 584ff5c..fb89e5a 100644
>> --- a/gcc/config/aarch64/aarch64.h
>> +++ b/gcc/config/aarch64/aarch64.h
>> @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
>>   typedef struct GTY (()) machine_function
>>   {
>>     struct aarch64_frame frame;
>> +  /* One entry for each GPR and FP register.  */
>> +  bool reg_is_wrapped_separately[V31_REGNUM + 1];
> Another hardcoded use of V31_REGNUM.
>
> Thanks,
> James
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-29 11:18             ` Kyrill Tkachov
@ 2016-11-29 11:32               ` Kyrill Tkachov
  2016-11-29 11:37                 ` James Greenhalgh
  0 siblings, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-29 11:32 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Segher Boessenkool, Andrew Pinski, GCC Patches, Marcus Shawcroft,
	Richard Earnshaw


On 29/11/16 11:18, Kyrill Tkachov wrote:
> Hi James,
>
> On 29/11/16 10:57, James Greenhalgh wrote:
>> On Mon, Nov 14, 2016 at 02:25:28PM +0000, Kyrill Tkachov wrote:
>>> On 11/11/16 15:31, Kyrill Tkachov wrote:
>>>> On 11/11/16 10:17, Kyrill Tkachov wrote:
>>>>> On 10/11/16 23:39, Segher Boessenkool wrote:
>>>>>> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
>>>>>>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
>>>>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
>>>>>>>> some interesting swings.
>>>>>>>> 458.sjeng     +1.45%
>>>>>>>> 471.omnetpp   +2.19%
>>>>>>>> 445.gobmk     -2.01%
>>>>>>>>
>>>>>>>> On SPECFP:
>>>>>>>> 453.povray    +7.00%
>>>>>>> Wow, this looks really good.  Thank you for implementing this.  If I
>>>>>>> get some time I am going to try it out on other processors than A72
>>>>>>> but I doubt I have time any time soon.
>>>>>> I'd love to hear what causes the slowdown for gobmk as well, btw.
>>>>> I haven't yet gotten a direct answer for that (through performance analysis
>>>>> tools) but I have noticed that load/store pairs are not generated as
>>>>> aggressively as I hoped.  They are being merged by the sched fusion pass
>>>>> and peepholes (which runs after this) but it still misses cases. I've
>>>>> hacked the SWS hooks to generate pairs explicitly and that increases the
>>>>> number of pairs and helps code size to boot. It complicates the logic of
>>>>> the hooks a bit but not too much.
>>>>>
>>>>> I'll make those changes and re-benchmark, hopefully that
>>>>> will help performance.
>>>>>
>>>> And here's a version that explicitly emits pairs. I've looked at assembly
>>>> codegen on SPEC2006 and it generates quite a few more LDP/STP pairs than the
>>>> original version.  I kicked off benchmarks over the weekend to see the
>>>> effect.  Andrew, if you want to try it out (more benchmarking and testing
>>>> always welcome) this is the one to try.
>>>>
>>> And I discovered over the weekend that gamess and wrf have validation errors.
>>> This version runs correctly.  SPECINT results were fine though and there is
>>> even a small overall gain due to sjeng and omnetpp. However, gobmk still has
>>> the regression.  I'll rerun SPECFP with this patch (it's really just a small
>>> bugfix over the previous version) and get on with analysing gobmk.
>> I have some comments in line, most of which are about hardcoding the
>> maximum register number, but at a high level this looks good to me.
>
> Thanks for having a look.
> I'll respin with the comments addressed and I have a couple of comments inline.
>
> Kyrill
>
>> Thanks,
>> James
>>
>>
>
> <snip>
>
>> +
>> +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
>> +
>> +static sbitmap
>> +aarch64_components_for_bb (basic_block bb)
>> +{
>> +  bitmap in = DF_LIVE_IN (bb);
>> +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
>> +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
>> +
>> +  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
>> +  bitmap_clear (components);
>> +
>> +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
>> The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
>> where the end of the register file is (does this, for example, fall apart
>> with the SVE work that was recently posted). Something like a
>> LAST_HARDREG_NUM might work?
>>
>
> I think you mean FIRST_PSEUDO_REGISTER. AFAICS the compiler uses
> a loop from 0 to FIRST_PSEUDO_REGISTER to go through the hard registers
> in various places in the midend.
> I'll change it to use that, though if the way to save/restore such new registers becomes
> different from the current approach (i.e. perform a DI/DFmode memory op) the code in these
> hooks will have to be updated anyway to take it into account.
>

And actually trying to implement this blows up. The "hard" registers include CC_REGNUM, which we
definitely want to avoid 'saving/restoring'. We really just want to save the normal register
data registers, so is it okay if I leave it as it is?
The prologue/epilogue code already uses V31_REGNUM, so it would need to change anyway if new
registers are added in the future.

Kyrill

>>> +    if ((!call_used_regs[regno])
>>> +       && (bitmap_bit_p (in, regno)
>>> +       || bitmap_bit_p (gen, regno)
>>> +       || bitmap_bit_p (kill, regno)))
>>> +      bitmap_set_bit (components, regno);
>>> +
>>> +  return components;
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
>>> +   Nothing to do for aarch64.  */
>>> +
>>> +static void
>>> +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
>>> +{
>>> +}
>> Is there no default "do nothing" hook for this?
>>
>
> I don't see one defined anywhere and the documentation for TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> says that if it is defined, all other hooks in the group must be defined.
>
>>> +
>>> +/* Return the next set bit in BMP from START onwards.  Return the total number
>>> +   of bits in BMP if no set bit is found at or after START. */
>>> +
>>> +static unsigned int
>>> +aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
>>> +{
>>> +  unsigned int nbits = SBITMAP_SIZE (bmp);
>>> +  if (start == nbits)
>>> +    return start;
>>> +
>>> +  gcc_assert (start < nbits);
>>> +  for (unsigned int i = start; i < nbits; i++)
>>> +    if (bitmap_bit_p (bmp, i))
>>> +      return i;
>>> +
>>> +  return nbits;
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
>>> +
>>> +static void
>>> +aarch64_emit_prologue_components (sbitmap components)
>>> +{
>>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
>>> +                 ? HARD_FRAME_POINTER_REGNUM
>>> +                 : STACK_POINTER_REGNUM);
>>> +
>>> +  unsigned total_bits = SBITMAP_SIZE (components);
>> Would this be clearer called last_regno ?
>>
>>> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
>>> +  rtx_insn *insn = NULL;
>>> +
>>> +  while (regno != total_bits)
>>> +    {
>>> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
>> Comment about why this can be DFmode rather than some 128-bit mode might
>> be useful here.
>>
>>> +      rtx reg = gen_rtx_REG (mode, regno);
>>> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
>>> +      if (!frame_pointer_needed)
>>> +    offset += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
>>> +      rtx mem = gen_frame_mem (mode, addr);
>>> +
>>> +      rtx set = gen_rtx_SET (mem, reg);
>>> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
>>> +      /* No more registers to save after REGNO.
>>> +     Emit a single save and exit.  */
>>> +      if (regno2 == total_bits)
>>> +    {
>>> +      insn = emit_insn (set);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
>>> +      break;
>>> +    }
>>> +
>>> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
>>> +      /* The next register is not of the same class or its offset is not
>>> +     mergeable with the current one into a pair.  */
>>> +      if (!satisfies_constraint_Ump (mem)
>>> +      || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
>>> +      || (offset2 - cfun->machine->frame.reg_offset[regno])
>>> +        != GET_MODE_SIZE (DImode))
>>> +    {
>>> +      insn = emit_insn (set);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
>>> +
>>> +      regno = regno2;
>>> +      continue;
>>> +    }
>>> +
>>> +      /* REGNO2 can be stored in a pair with REGNO.  */
>>> +      rtx reg2 = gen_rtx_REG (mode, regno2);
>>> +      if (!frame_pointer_needed)
>>> +    offset2 += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
>>> +      rtx mem2 = gen_frame_mem (mode, addr2);
>>> +      rtx set2 = gen_rtx_SET (mem2, reg2);
>>> +
>>> +      insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_OFFSET, set);
>>> +      add_reg_note (insn, REG_CFA_OFFSET, set2);
>>> +
>>> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
>>> +    }
>>> +
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
>>> +
>>> +static void
>>> +aarch64_emit_epilogue_components (sbitmap components)
>> Given the similarity of the logic, is there any way you can refactor this
>> with the prologue code above?
>>
>>> +{
>>> +
>>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
>>> +                 ? HARD_FRAME_POINTER_REGNUM
>>> +                 : STACK_POINTER_REGNUM);
>>> +  unsigned total_bits = SBITMAP_SIZE (components);
>>> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
>>> +  rtx_insn *insn = NULL;
>>> +
>>> +  while (regno != total_bits)
>>> +    {
>>> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
>>> +      rtx reg = gen_rtx_REG (mode, regno);
>>> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
>>> +      if (!frame_pointer_needed)
>>> +    offset += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
>>> +      rtx mem = gen_frame_mem (mode, addr);
>>> +
>>> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
>>> +      /* No more registers after REGNO to restore.
>>> +     Emit a single restore and exit.  */
>>> +      if (regno2 == total_bits)
>>> +    {
>>> +      insn = emit_move_insn (reg, mem);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
>>> +      break;
>>> +    }
>>> +
>>> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
>>> +      /* The next register is not of the same class or its offset is not
>>> +     mergeable with the current one into a pair or the offset doesn't fit
>>> +     for a load pair.  Emit a single restore and continue from REGNO2.  */
>>> +      if (!satisfies_constraint_Ump (mem)
>>> +      || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
>>> +      || (offset2 - cfun->machine->frame.reg_offset[regno])
>>> +        != GET_MODE_SIZE (DImode))
>>> +    {
>>> +      insn = emit_move_insn (reg, mem);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
>>> +
>>> +      regno = regno2;
>>> +      continue;
>>> +    }
>>> +
>>> +      /* REGNO2 can be loaded in a pair with REGNO.  */
>>> +      rtx reg2 = gen_rtx_REG (mode, regno2);
>>> +      if (!frame_pointer_needed)
>>> +    offset2 += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
>>> +      rtx mem2 = gen_frame_mem (mode, addr2);
>>> +
>>> +      insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg2);
>>> +
>>> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
>>> +    }
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
>>> +
>>> +static void
>>> +aarch64_set_handled_components (sbitmap components)
>>> +{
>>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
>>> +    if (bitmap_bit_p (components, regno))
>>> +      cfun->machine->reg_is_wrapped_separately[regno] = true;
>>> +}
>>> +
>>>   /* AArch64 stack frames generated by this compiler look like:
>>>         +-------------------------------+
>>> @@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
>>>     return false;
>>>   }
>>>   -bool
>>> -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
>>> -{
>>> -  return (offset >= -64 * GET_MODE_SIZE (mode)
>>> -      && offset < 64 * GET_MODE_SIZE (mode)
>>> -      && offset % GET_MODE_SIZE (mode) == 0);
>>> -}
>>> -
>>> -static inline bool
>>> -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
>>> -                   HOST_WIDE_INT offset)
>>> -{
>>> -  return offset >= -256 && offset < 256;
>>> -}
>>> -
>>> -static inline bool
>>> -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
>>> -{
>>> -  return (offset >= 0
>>> -      && offset < 4096 * GET_MODE_SIZE (mode)
>>> -      && offset % GET_MODE_SIZE (mode) == 0);
>>> -}
>>> -
>>>   /* Return true if MODE is one of the modes for which we
>>>      support LDP/STP operations.  */
>>>   @@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
>>>   #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
>>>     aarch64_first_cycle_multipass_dfa_lookahead_guard
>>>   +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
>>> +  aarch64_get_separate_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
>>> +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
>>> +  aarch64_components_for_bb
>>> +
>>> +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
>>> +  aarch64_disqualify_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
>>> +  aarch64_emit_prologue_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
>>> +  aarch64_emit_epilogue_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
>>> +  aarch64_set_handled_components
>>> +
>>>   #undef TARGET_TRAMPOLINE_INIT
>>>   #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
>>>   diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>>> index 584ff5c..fb89e5a 100644
>>> --- a/gcc/config/aarch64/aarch64.h
>>> +++ b/gcc/config/aarch64/aarch64.h
>>> @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
>>>   typedef struct GTY (()) machine_function
>>>   {
>>>     struct aarch64_frame frame;
>>> +  /* One entry for each GPR and FP register.  */
>>> +  bool reg_is_wrapped_separately[V31_REGNUM + 1];
>> Another hardcoded use of V31_REGNUM.
>>
>> Thanks,
>> James
>>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-29 11:32               ` Kyrill Tkachov
@ 2016-11-29 11:37                 ` James Greenhalgh
  0 siblings, 0 replies; 25+ messages in thread
From: James Greenhalgh @ 2016-11-29 11:37 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Segher Boessenkool, Andrew Pinski, GCC Patches, Marcus Shawcroft,
	Richard Earnshaw, nd

On Tue, Nov 29, 2016 at 11:32:41AM +0000, Kyrill Tkachov wrote:

<snip>

> >>+
> >>+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
> >>+
> >>+static sbitmap
> >>+aarch64_components_for_bb (basic_block bb)
> >>+{
> >>+  bitmap in = DF_LIVE_IN (bb);
> >>+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
> >>+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
> >>+
> >>+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
> >>+  bitmap_clear (components);
> >>+
> >>+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
> >>+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
> >>The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
> >>where the end of the register file is (does this, for example, fall apart
> >>with the SVE work that was recently posted). Something like a
> >>LAST_HARDREG_NUM might work?
> >>
> >
> >I think you mean FIRST_PSEUDO_REGISTER. AFAICS the compiler uses
> >a loop from 0 to FIRST_PSEUDO_REGISTER to go through the hard registers
> >in various places in the midend.
> >I'll change it to use that, though if the way to save/restore such new registers becomes
> >different from the current approach (i.e. perform a DI/DFmode memory op) the code in these
> >hooks will have to be updated anyway to take it into account.
> >
> 
> And actually trying to implement this blows up. The "hard" registers include
> CC_REGNUM, which we definitely want to avoid 'saving/restoring'. We really
> just want to save the normal register data registers, so is it okay if I
> leave it as it is?  The prologue/epilogue code already uses V31_REGNUM, so it
> would need to change anyway if new registers are added in the future.

Well, you could always define a new constant in aarch64.md ?

   (LAST_SAVED_REGISTER  63)

Which would still save hardcoding V31_REGNUM everywhere.

And yes, the pro-/epilogue functions could also do with this fix
(preapproved).

James

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-29 10:58           ` James Greenhalgh
  2016-11-29 11:18             ` Kyrill Tkachov
@ 2016-11-29 20:29             ` Segher Boessenkool
  2016-11-30 14:08               ` Kyrill Tkachov
  1 sibling, 1 reply; 25+ messages in thread
From: Segher Boessenkool @ 2016-11-29 20:29 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Kyrill Tkachov, Andrew Pinski, GCC Patches, Marcus Shawcroft,
	Richard Earnshaw, nd

Hi James, Kyrill,

On Tue, Nov 29, 2016 at 10:57:33AM +0000, James Greenhalgh wrote:
> > +static sbitmap
> > +aarch64_components_for_bb (basic_block bb)
> > +{
> > +  bitmap in = DF_LIVE_IN (bb);
> > +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
> > +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
> > +
> > +  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
> > +  bitmap_clear (components);
> > +
> > +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
> > +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
> 
> The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
> where the end of the register file is (does this, for example, fall apart
> with the SVE work that was recently posted). Something like a
> LAST_HARDREG_NUM might work?

Components and registers aren't the same thing (you can have components
for things that aren't just a register save, e.g. the frame setup, stack
alignment, save of some non-GPR via a GPR, PIC register setup, etc.)
The loop here should really only cover the non-volatile registers, and
there should be some translation from register number to component number
(it of course is convenient to have a 1-1 translation for the GPRs and
floating point registers).  For rs6000 many things in the backend already
use non-symbolic numbers for the FPRs and GPRs, so that is easier there.

> > +static void
> > +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
> > +{
> > +}
> 
> Is there no default "do nothing" hook for this?

I can make the shrink-wrap code do nothing here if this hook isn't
defined, if you want?


Segher

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-29 20:29             ` Segher Boessenkool
@ 2016-11-30 14:08               ` Kyrill Tkachov
  2016-12-02 17:09                 ` James Greenhalgh
  0 siblings, 1 reply; 25+ messages in thread
From: Kyrill Tkachov @ 2016-11-30 14:08 UTC (permalink / raw)
  To: Segher Boessenkool, James Greenhalgh
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw

[-- Attachment #1: Type: text/plain, Size: 3497 bytes --]


On 29/11/16 20:29, Segher Boessenkool wrote:
> Hi James, Kyrill,
>
> On Tue, Nov 29, 2016 at 10:57:33AM +0000, James Greenhalgh wrote:
>>> +static sbitmap
>>> +aarch64_components_for_bb (basic_block bb)
>>> +{
>>> +  bitmap in = DF_LIVE_IN (bb);
>>> +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
>>> +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
>>> +
>>> +  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
>>> +  bitmap_clear (components);
>>> +
>>> +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
>>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
>> The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
>> where the end of the register file is (does this, for example, fall apart
>> with the SVE work that was recently posted). Something like a
>> LAST_HARDREG_NUM might work?
> Components and registers aren't the same thing (you can have components
> for things that aren't just a register save, e.g. the frame setup, stack
> alignment, save of some non-GPR via a GPR, PIC register setup, etc.)
> The loop here should really only cover the non-volatile registers, and
> there should be some translation from register number to component number
> (it of course is convenient to have a 1-1 translation for the GPRs and
> floating point registers).  For rs6000 many things in the backend already
> use non-symbolic numbers for the FPRs and GPRs, so that is easier there.

Anyway, here's the patch with James's comments implemented.
I've introduced LAST_SAVED_REGNUM which is used to delimit the registers
considered for shrink-wrapping.

aarch64_process_components is introduced and used to implement
the emit_prologue_components and emit_epilogue_components functions in a single place.

Bootstrapped and tested on aarch64-none-linux-gnu.

Thanks,
Kyrill

2016-11-30  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.h (machine_function): Add
     reg_is_wrapped_separately field.
     * config/aarch64/aarch64.md (LAST_SAVED_REGNUM): Define new constant.
     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
     rtx_insn *.
     (aarch64_save_callee_saves): Don't save registers that are wrapped
     separately.
     (aarch64_restore_callee_saves): Don't restore registers that are
     wrapped separately.
     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
     (aarch64_get_separate_components): New function.
     (aarch64_get_next_set_bit): Likewise.
     (aarch64_components_for_bb): Likewise.
     (aarch64_disqualify_components): Likewise.
     (aarch64_emit_prologue_components): Likewise.
     (aarch64_emit_epilogue_components): Likewise.
     (aarch64_set_handled_components): Likewise.
     (aarch64_process_components): Likewise.
     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
>>> +static void
>>> +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
>>> +{
>>> +}
>> Is there no default "do nothing" hook for this?
> I can make the shrink-wrap code do nothing here if this hook isn't
> defined, if you want?

I don't mind either way.
If you do it I'll then remove the empty implementation in aarch64.


>
> Segher


[-- Attachment #2: aarch64-sws.patch --]
[-- Type: text/x-patch, Size: 12694 bytes --]

commit 194816281ec6da2620bb34c9278ed7edf8bcf0da
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Oct 11 09:25:54 2016 +0100

    [AArch64] Separate shrink wrapping hooks implementation

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 82bfe14..48e6e2c 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm,
 
 /* Emit an insn that's a simple single-set.  Both the operands must be
    known to be valid.  */
-inline static rtx
+inline static rtx_insn *
 emit_set_insn (rtx x, rtx y)
 {
   return emit_insn (gen_rtx_SET (x, y));
@@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	      || regno == cfun->machine->frame.wb_candidate2))
 	continue;
 
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
@@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 
@@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode,
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
     {
+      if (cfun->machine->reg_is_wrapped_separately[regno])
+       continue;
+
       rtx reg, mem;
 
       if (skip_wb
@@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode,
       regno2 = aarch64_next_callee_save (regno + 1, limit);
 
       if (regno2 <= limit
+	  && !cfun->machine->reg_is_wrapped_separately[regno2]
 	  && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD)
 	      == cfun->machine->frame.reg_offset[regno2]))
 	{
@@ -3224,6 +3232,245 @@ aarch64_restore_callee_saves (machine_mode mode,
     }
 }
 
+static inline bool
+offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
+			       HOST_WIDE_INT offset)
+{
+  return offset >= -256 && offset < 256;
+}
+
+static inline bool
+offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= 0
+	  && offset < 4096 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+bool
+aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
+{
+  return (offset >= -64 * GET_MODE_SIZE (mode)
+	  && offset < 64 * GET_MODE_SIZE (mode)
+	  && offset % GET_MODE_SIZE (mode) == 0);
+}
+
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+
+static sbitmap
+aarch64_get_separate_components (void)
+{
+  aarch64_layout_frame ();
+
+  sbitmap components = sbitmap_alloc (LAST_SAVED_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* The registers we need saved to the frame.  */
+  for (unsigned regno = 0; regno <= LAST_SAVED_REGNUM; regno++)
+    if (aarch64_register_saved_on_entry (regno))
+      {
+	HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+	if (!frame_pointer_needed)
+	  offset += cfun->machine->frame.frame_size
+		    - cfun->machine->frame.hard_fp_offset;
+	/* Check that we can access the stack slot of the register with one
+	   direct load with no adjustments needed.  */
+	if (offset_12bit_unsigned_scaled_p (DImode, offset))
+	  bitmap_set_bit (components, regno);
+      }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  unsigned reg1 = cfun->machine->frame.wb_candidate1;
+  unsigned reg2 = cfun->machine->frame.wb_candidate2;
+  /* If aarch64_layout_frame has chosen registers to store/restore with
+     writeback don't interfere with them to avoid having to output explicit
+     stack adjustment instructions.  */
+  if (reg2 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg2);
+  if (reg1 != INVALID_REGNUM)
+    bitmap_clear_bit (components, reg1);
+
+  bitmap_clear_bit (components, LR_REGNUM);
+  bitmap_clear_bit (components, SP_REGNUM);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+
+static sbitmap
+aarch64_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (LAST_SAVED_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = 0; regno <= LAST_SAVED_REGNUM; regno++)
+    if ((!call_used_regs[regno])
+       && (bitmap_bit_p (in, regno)
+	   || bitmap_bit_p (gen, regno)
+	   || bitmap_bit_p (kill, regno)))
+	  bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
+   Nothing to do for aarch64.  */
+
+static void
+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+}
+
+/* Return the next set bit in BMP from START onwards.  Return the total number
+   of bits in BMP if no set bit is found at or after START.  */
+
+static unsigned int
+aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
+{
+  unsigned int nbits = SBITMAP_SIZE (bmp);
+  if (start == nbits)
+    return start;
+
+  gcc_assert (start < nbits);
+  for (unsigned int i = start; i < nbits; i++)
+    if (bitmap_bit_p (bmp, i))
+      return i;
+
+  return nbits;
+}
+
+/* Do the work for aarch64_emit_prologue_components and
+   aarch64_emit_epilogue_components.  COMPONENTS is the bitmap of registers
+   to save/restore, PROLOGUE_P indicates whether to emit the prologue sequence
+   for these components or the epilogue sequence.  That is, it determines
+   whether we should emit stores or loads and what kind of CFA notes to attach
+   to the insns.  Otherwise the logic for the two sequences is very
+   similar.  */
+
+static void
+aarch64_process_components (sbitmap components, bool prologue_p)
+{
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+
+  unsigned last_regno = SBITMAP_SIZE (components);
+  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
+  rtx_insn *insn = NULL;
+
+  while (regno != last_regno)
+    {
+      /* AAPCS64 section 5.1.2 requires only the bottom 64 bits to be saved
+	 so DFmode for the vector registers is enough.  */
+      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
+      rtx reg = gen_rtx_REG (mode, regno);
+      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+      if (!frame_pointer_needed)
+	offset += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr = plus_constant (Pmode, ptr_reg, offset);
+      rtx mem = gen_frame_mem (mode, addr);
+
+      rtx set = prologue_p ? gen_rtx_SET (mem, reg) : gen_rtx_SET (reg, mem);
+      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
+      /* No more registers to handle after REGNO.
+	 Emit a single save/restore and exit.  */
+      if (regno2 == last_regno)
+	{
+	  insn = emit_insn (set);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  if (prologue_p)
+	    add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+	  else
+	    add_reg_note (insn, REG_CFA_RESTORE, reg);
+	  break;
+	}
+
+      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
+      /* The next register is not of the same class or its offset is not
+	 mergeable with the current one into a pair.  */
+      if (!satisfies_constraint_Ump (mem)
+	  || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
+	  || (offset2 - cfun->machine->frame.reg_offset[regno])
+		!= GET_MODE_SIZE (mode))
+	{
+	  insn = emit_insn (set);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  if (prologue_p)
+	    add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+	  else
+	    add_reg_note (insn, REG_CFA_RESTORE, reg);
+
+	  regno = regno2;
+	  continue;
+	}
+
+      /* REGNO2 can be saved/restored in a pair with REGNO.  */
+      rtx reg2 = gen_rtx_REG (mode, regno2);
+      if (!frame_pointer_needed)
+	offset2 += cfun->machine->frame.frame_size
+		  - cfun->machine->frame.hard_fp_offset;
+      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
+      rtx mem2 = gen_frame_mem (mode, addr2);
+      rtx set2 = prologue_p ? gen_rtx_SET (mem2, reg2)
+			     : gen_rtx_SET (reg2, mem2);
+
+      if (prologue_p)
+	insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
+      else
+	insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
+
+      RTX_FRAME_RELATED_P (insn) = 1;
+      if (prologue_p)
+	{
+	  add_reg_note (insn, REG_CFA_OFFSET, set);
+	  add_reg_note (insn, REG_CFA_OFFSET, set2);
+	}
+      else
+	{
+	  add_reg_note (insn, REG_CFA_RESTORE, reg);
+	  add_reg_note (insn, REG_CFA_RESTORE, reg2);
+	}
+
+      regno = aarch64_get_next_set_bit (components, regno2 + 1);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_prologue_components (sbitmap components)
+{
+  aarch64_process_components (components, true);
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_epilogue_components (sbitmap components)
+{
+  aarch64_process_components (components, false);
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+
+static void
+aarch64_set_handled_components (sbitmap components)
+{
+  for (unsigned regno = 0; regno <= LAST_SAVED_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      cfun->machine->reg_is_wrapped_separately[regno] = true;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3982,29 +4229,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
   return false;
 }
 
-bool
-aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= -64 * GET_MODE_SIZE (mode)
-	  && offset < 64 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
-static inline bool
-offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
-			       HOST_WIDE_INT offset)
-{
-  return offset >= -256 && offset < 256;
-}
-
-static inline bool
-offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= 0
-	  && offset < 4096 * GET_MODE_SIZE (mode)
-	  && offset % GET_MODE_SIZE (mode) == 0);
-}
-
 /* Return true if MODE is one of the modes for which we
    support LDP/STP operations.  */
 
@@ -14573,6 +14797,30 @@ aarch64_libgcc_floating_mode_supported_p
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
   aarch64_first_cycle_multipass_dfa_lookahead_guard
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
+  aarch64_get_separate_components
+
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
+  aarch64_components_for_bb
+
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
+  aarch64_disqualify_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  aarch64_emit_prologue_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  aarch64_emit_epilogue_components
+
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
+  aarch64_set_handled_components
+
 #undef TARGET_TRAMPOLINE_INIT
 #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
 
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 584ff5c..c417569 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
 typedef struct GTY (()) machine_function
 {
   struct aarch64_frame frame;
+  /* One entry for each hard register.  */
+  bool reg_is_wrapped_separately[LAST_SAVED_REGNUM];
 } machine_function;
 #endif
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 3b67be0..6b4d0ba 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -59,6 +59,7 @@ (define_constants
     (V0_REGNUM		32)
     (V15_REGNUM		47)
     (V31_REGNUM		63)
+    (LAST_SAVED_REGNUM	63)
     (SFP_REGNUM		64)
     (AP_REGNUM		65)
     (CC_REGNUM		66)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  2016-11-30 14:08               ` Kyrill Tkachov
@ 2016-12-02 17:09                 ` James Greenhalgh
  0 siblings, 0 replies; 25+ messages in thread
From: James Greenhalgh @ 2016-12-02 17:09 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Segher Boessenkool, Andrew Pinski, GCC Patches, Marcus Shawcroft,
	Richard Earnshaw, nd

On Wed, Nov 30, 2016 at 02:07:58PM +0000, Kyrill Tkachov wrote:
> 
> On 29/11/16 20:29, Segher Boessenkool wrote:
> >Hi James, Kyrill,
> >
> >On Tue, Nov 29, 2016 at 10:57:33AM +0000, James Greenhalgh wrote:
> >>>+static sbitmap
> >>>+aarch64_components_for_bb (basic_block bb)
> >>>+{
> >>>+  bitmap in = DF_LIVE_IN (bb);
> >>>+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
> >>>+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
> >>>+
> >>>+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
> >>>+  bitmap_clear (components);
> >>>+
> >>>+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
> >>>+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
> >>The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
> >>where the end of the register file is (does this, for example, fall apart
> >>with the SVE work that was recently posted). Something like a
> >>LAST_HARDREG_NUM might work?
> >Components and registers aren't the same thing (you can have components
> >for things that aren't just a register save, e.g. the frame setup, stack
> >alignment, save of some non-GPR via a GPR, PIC register setup, etc.)
> >The loop here should really only cover the non-volatile registers, and
> >there should be some translation from register number to component number
> >(it of course is convenient to have a 1-1 translation for the GPRs and
> >floating point registers).  For rs6000 many things in the backend already
> >use non-symbolic numbers for the FPRs and GPRs, so that is easier there.
> 
> Anyway, here's the patch with James's comments implemented.
> I've introduced LAST_SAVED_REGNUM which is used to delimit the registers
> considered for shrink-wrapping.
> 
> aarch64_process_components is introduced and used to implement the
> emit_prologue_components and emit_epilogue_components functions in a single
> place.

OK.

> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Thanks,
> Kyrill
> 
> 2016-11-30  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     * config/aarch64/aarch64.h (machine_function): Add
>     reg_is_wrapped_separately field.
>     * config/aarch64/aarch64.md (LAST_SAVED_REGNUM): Define new constant.
>     * config/aarch64/aarch64.c (emit_set_insn): Change return type to
>     rtx_insn *.
>     (aarch64_save_callee_saves): Don't save registers that are wrapped
>     separately.
>     (aarch64_restore_callee_saves): Don't restore registers that are
>     wrapped separately.
>     (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p,
>     aarch64_offset_7bit_signed_scaled_p): Move earlier in the file.
>     (aarch64_get_separate_components): New function.
>     (aarch64_get_next_set_bit): Likewise.
>     (aarch64_components_for_bb): Likewise.
>     (aarch64_disqualify_components): Likewise.
>     (aarch64_emit_prologue_components): Likewise.
>     (aarch64_emit_epilogue_components): Likewise.
>     (aarch64_set_handled_components): Likewise.
>     (aarch64_process_components): Likewise.
>     (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
>     TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
>     TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
>     TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
> >>>+static void
> >>>+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
> >>>+{
> >>>+}
> >>Is there no default "do nothing" hook for this?
> >I can make the shrink-wrap code do nothing here if this hook isn't
> >defined, if you want?
> 
> I don't mind either way.
> If you do it I'll then remove the empty implementation in aarch64.

If there is no empty default, this is fine.

Thanks,
James


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-12-02 17:09 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-10 14:26 [PATCH][AArch64] Separate shrink wrapping hooks implementation Kyrill Tkachov
2016-11-10 16:33 ` Segher Boessenkool
2016-11-10 16:45   ` Kyrill Tkachov
2016-11-10 22:42 ` Andrew Pinski
2016-11-10 23:39   ` Segher Boessenkool
2016-11-11 10:18     ` Kyrill Tkachov
2016-11-11 15:31       ` Kyrill Tkachov
2016-11-14 14:25         ` Kyrill Tkachov
2016-11-17 14:22           ` Kyrill Tkachov
2016-11-17 14:44             ` Segher Boessenkool
2016-11-17 14:55               ` Kyrill Tkachov
2016-11-17 15:02                 ` Segher Boessenkool
2016-11-17 15:06                   ` Kyrill Tkachov
2016-11-17 16:50                 ` Kyrill Tkachov
2016-11-17 17:46                   ` Segher Boessenkool
2016-11-18  9:29                     ` Kyrill Tkachov
2016-11-18 12:50                       ` Segher Boessenkool
2016-11-22 16:47                         ` Kyrill Tkachov
2016-11-29 10:58           ` James Greenhalgh
2016-11-29 11:18             ` Kyrill Tkachov
2016-11-29 11:32               ` Kyrill Tkachov
2016-11-29 11:37                 ` James Greenhalgh
2016-11-29 20:29             ` Segher Boessenkool
2016-11-30 14:08               ` Kyrill Tkachov
2016-12-02 17:09                 ` James Greenhalgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).