Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor
@ 2018-02-15 14:20 Wilco Dijkstra
  2018-02-16  4:42 ` Siddhesh Poyarekar
  0 siblings, 1 reply; 4+ messages in thread
From: Wilco Dijkstra @ 2018-02-15 14:20 UTC (permalink / raw)
  To: siddhesh; +Cc: GCC Patches, nd

Hi Siddhesh,

I still don't like the idea of disabling a whole class of instructions in the md file.
It seems much better to adjust the costs here so that you get most of the
improvement now, and fine tune it once we can differentiate between
loads and stores.

Taking your example, adding -funroll-loops generates this for Falkor:

	ldr	q7, [x2, x18]
	add	x5, x18, 16
	add	x4, x1, x18
	add	x10, x18, 32
	add	x11, x1, x5
	add	x3, x18, 48
	add	x12, x1, x10
	add	x9, x18, 64
	add	x14, x1, x3
	add	x8, x18, 80
	add	x15, x1, x9
	add	x7, x18, 96
	add	x16, x1, x8
	str	q7, [x4]
	ldr	q16, [x2, x5]
	add	x6, x18, 112
	add	x17, x1, x7
	add	x18, x18, 128
	add	x5, x1, x6
	cmp	x18, x13
	str	q16, [x11]
	ldr	q17, [x2, x10]
	str	q17, [x12]
	ldr	q18, [x2, x3]
	str	q18, [x14]
	ldr	q19, [x2, x9]
	str	q19, [x15]
	ldr	q20, [x2, x8]
	str	q20, [x16]
	ldr	q21, [x2, x7]
	str	q21, [x17]
	ldr	q22, [x2, x6]
	str	q22, [x5]
	bne	.L25

If you adjust costs however you'd get this:

.L25:
	ldr	q7, [x14]
	add	x14, x14, 128
	add	x4, x4, 128
	str	q7, [x4, -128]
	ldr	q16, [x14, -112]
	str	q16, [x4, -112]
	ldr	q17, [x14, -96]
	str	q17, [x4, -96]
	ldr	q18, [x14, -80]
	str	q18, [x4, -80]
	ldr	q19, [x14, -64]
	str	q19, [x4, -64]
	ldr	q20, [x14, -48]
	str	q20, [x4, -48]
	ldr	q21, [x14, -32]
	str	q21, [x4, -32]
	ldr	q22, [x14, -16]
	cmp	x14, x9
	str	q22, [x4, -16]
	bne	.L25

So it seems to me using existing cost mechanisms is always preferable, even if you
currently can't differentiate between loads and stores.

Wilco

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor
  2018-02-15 14:20 [PING][PATCH v3] Disable reg offset in quad-word store for Falkor Wilco Dijkstra
@ 2018-02-16  4:42 ` Siddhesh Poyarekar
  2018-02-19 12:58   ` Wilco Dijkstra
  0 siblings, 1 reply; 4+ messages in thread
From: Siddhesh Poyarekar @ 2018-02-16  4:42 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote:
> So it seems to me using existing cost mechanisms is always preferable, even if you
> currently can't differentiate between loads and stores.

Luis is working on address cost adjustments among other things, so I
guess the path of least resistance for gcc8 is to have those adjustments
go in and then figure out how much improvement this patch (or separating
loads and stores) would get on top of that.  Would that be acceptable?

Siddhesh

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor
  2018-02-16  4:42 ` Siddhesh Poyarekar
@ 2018-02-19 12:58   ` Wilco Dijkstra
  0 siblings, 0 replies; 4+ messages in thread
From: Wilco Dijkstra @ 2018-02-19 12:58 UTC (permalink / raw)
  To: siddhesh; +Cc: GCC Patches, nd

Siddhesh Poyarekar wrote:
> On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote:
>> So it seems to me using existing cost mechanisms is always preferable, even if you
>> currently can't differentiate between loads and stores.
>
> Luis is working on address cost adjustments among other things, so I
> guess the path of least resistance for gcc8 is to have those adjustments
> go in and then figure out how much improvement this patch (or separating
> loads and stores) would get on top of that.  Would that be acceptable?

Yes adjusting costs is not an issue as that's clearly something we need to improve
anyway. Having numbers for both approaches would be useful, however I think it's still
best to go with the cost approach for GCC8 as that should get most of the gain.

Wilco

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PING][PATCH v3] Disable reg offset in quad-word store for Falkor
  2018-02-09  7:33 [PATCH " Siddhesh Poyarekar
@ 2018-02-15  9:52 ` Siddhesh Poyarekar
  0 siblings, 0 replies; 4+ messages in thread
From: Siddhesh Poyarekar @ 2018-02-15  9:52 UTC (permalink / raw)
  To: gcc-patches
  Cc: wilson, kugan.vivekanandarajah, james.greenhalgh, richard.earnshaw

Ping!

On Friday 09 February 2018 01:02 PM, Siddhesh Poyarekar wrote:
> Hi,
> 
> Here's v3 of the patch to disable register offset addressing mode for
> stores of 128-bit values on Falkor because they're very costly.
> Following Kyrill's suggestion, I compared the codegen for a53 and
> found that the codegen was quite different.  Jim's original patch is
> the most minimal compromise for this and is also a cleaner temporary
> fix before I attempt to split address costs into loads and stores for
> gcc9.
> 
> So v3 is essentially a very slightly cleaned up version of v1 again,
> this time with confirmation that there are no codegen changes in
> CPU2017 on non-falkor builds; only the codegen for -mcpu=falkor is
> different.
> 
> ----
> 
> On Falkor, because of an idiosyncracy of how the pipelines are
> designed, a quad-word store using a reg+reg addressing mode is almost
> twice as slow as an add followed by a quad-word store with a single
> reg addressing mode.  So we get better performance if we disallow
> addressing modes using register offsets with quad-word stores.  This
> is the most minimal change for gcc8, I will volunteer to make a more
> lasting change for gcc9 where I split the addressing mode costs into
> loads and stores wherever possible and needed.
> 
> This patch improves fpspeed by 0.17% and intspeed by 0.62% in CPU2017,
> with xalancbmk_s (3.84%) wrf_s (1.46%) and mcf_s (1.62%) being the
> biggest winners.  There were no regressions beyond 0.4%.
> 
> 2018-xx-xx  Jim Wilson  <jim.wilson@linaro.org>
>             Kugan Vivenakandarajah  <kugan.vivekanandarajah@linaro.org>
> 	    Siddhesh Poyarekar  <siddhesh@sourceware.org>
> 
> 	gcc/
> 	* config/aarch64/aarch64-protos.h (aarch64_movti_target_operand_p):
> 	New.
> 	* config/aarch64/aarch64-simd.md (aarch64_simd_mov<mode>): Use Utf.
> 	* config/aarch64/aarch64-tuning-flags.def
> 	(SLOW_REGOFFSET_QUADWORD_STORE): New.
> 	* config/aarch64/aarch64.c (qdf24xx_tunings): Add
> 	SLOW_REGOFFSET_QUADWORD_STORE to tuning flags.
> 	(aarch64_movti_target_operand_p): New.
> 	* config/aarch64/aarch64.md (movti_aarch64): Use Utf.
> 	(movtf_aarch64): Likewise.
> 	* config/aarch64/constraints.md (Utf): New.
> 
> 	gcc/testsuite
> 	* gcc.target/aarch64/pr82533.c: New test case.
> ---
>  gcc/config/aarch64/aarch64-protos.h         |  1 +
>  gcc/config/aarch64/aarch64-simd.md          |  4 ++--
>  gcc/config/aarch64/aarch64-tuning-flags.def |  4 ++++
>  gcc/config/aarch64/aarch64.c                | 14 +++++++++++++-
>  gcc/config/aarch64/aarch64.md               |  8 ++++----
>  gcc/config/aarch64/constraints.md           |  6 ++++++
>  gcc/testsuite/gcc.target/aarch64/pr82533.c  | 11 +++++++++++
>  7 files changed, 41 insertions(+), 7 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr82533.c
> 
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index cda2895d28e..5a0323deb1e 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -433,6 +433,7 @@ bool aarch64_simd_mem_operand_p (rtx);
>  bool aarch64_sve_ld1r_operand_p (rtx);
>  bool aarch64_sve_ldr_operand_p (rtx);
>  bool aarch64_sve_struct_memory_operand_p (rtx);
> +bool aarch64_movti_target_operand_p (rtx);
>  rtx aarch64_simd_vect_par_cnst_half (machine_mode, int, bool);
>  rtx aarch64_tls_get_addr (void);
>  tree aarch64_fold_builtin (tree, int, tree *, bool);
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 3d1f6a01cb7..f7daac3e28d 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -131,9 +131,9 @@
>  
>  (define_insn "*aarch64_simd_mov<VQ:mode>"
>    [(set (match_operand:VQ 0 "nonimmediate_operand"
> -		"=w, Umq,  m,  w, ?r, ?w, ?r, w")
> +		"=w, Umq, Utf,  w, ?r, ?w, ?r, w")
>  	(match_operand:VQ 1 "general_operand"
> -		"m,  Dz, w,  w,  w,  r,  r, Dn"))]
> +		"m,  Dz,    w,  w,  w,  r,  r, Dn"))]
>    "TARGET_SIMD
>     && (register_operand (operands[0], <MODE>mode)
>         || aarch64_simd_reg_or_zero (operands[1], <MODE>mode))"
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> index ea9ead234cb..04baf5b6de6 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -41,4 +41,8 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
>     are not considered cheap.  */
>  AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
>  
> +/* Don't use a register offset in a memory address for a quad-word store.  */
> +AARCH64_EXTRA_TUNING_OPTION ("slow_regoffset_quadword_store",
> +			     SLOW_REGOFFSET_QUADWORD_STORE)
> +
>  #undef AARCH64_EXTRA_TUNING_OPTION
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 228fd1b908d..c0a05598415 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -875,7 +875,7 @@ static const struct tune_params qdf24xx_tunings =
>    2,	/* min_div_recip_mul_df.  */
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NONE),		/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE),	/* tune_flags.  */
>    &qdf24xx_prefetch_tune
>  };
>  
> @@ -13634,6 +13634,18 @@ aarch64_sve_struct_memory_operand_p (rtx op)
>  	  && offset_4bit_signed_scaled_p (SVE_BYTE_MODE, last));
>  }
>  
> +/* Return TRUE if OP uses an efficient memory address for quad-word target.  */
> +bool
> +aarch64_movti_target_operand_p (rtx op)
> +{
> +  if (! optimize_size
> +      && (aarch64_tune_params.extra_tuning_flags
> +	  & AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE))
> +    return MEM_P (op) && ! (GET_CODE (XEXP (op, 0)) == PLUS
> +			    && ! CONST_INT_P (XEXP (XEXP (op, 0), 1)));
> +  return MEM_P (op);
> +}
> +
>  /* Emit a register copy from operand to operand, taking care not to
>     early-clobber source registers in the process.
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 5a2a9309a3b..1e6edcf51f2 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1080,9 +1080,9 @@
>  
>  (define_insn "*movti_aarch64"
>    [(set (match_operand:TI 0
> -	 "nonimmediate_operand"  "=   r,w, r,w,r,m,m,w,m")
> +	 "nonimmediate_operand"  "=   r,w, r,w,r,m,m,w,Utf")
>  	(match_operand:TI 1
> -	 "aarch64_movti_operand" " rUti,r, w,w,m,r,Z,m,w"))]
> +	 "aarch64_movti_operand" " rUti,r, w,w,m,r,Z,m,  w"))]
>    "(register_operand (operands[0], TImode)
>      || aarch64_reg_or_zero (operands[1], TImode))"
>    "@
> @@ -1227,9 +1227,9 @@
>  
>  (define_insn "*movtf_aarch64"
>    [(set (match_operand:TF 0
> -	 "nonimmediate_operand" "=w,?&r,w ,?r,w,?w,w,m,?r,m ,m")
> +	 "nonimmediate_operand" "=w,?&r,w ,?r,w,?w,w,Utf,?r,m ,m")
>  	(match_operand:TF 1
> -	 "general_operand"      " w,?r, ?r,w ,Y,Y ,m,w,m ,?r,Y"))]
> +	 "general_operand"      " w,?r, ?r,w ,Y,Y ,m,w  ,m ,?r,Y"))]
>    "TARGET_FLOAT && (register_operand (operands[0], TFmode)
>      || aarch64_reg_or_fp_zero (operands[1], TFmode))"
>    "@
> diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md
> index f052103e859..35aa62996ae 100644
> --- a/gcc/config/aarch64/constraints.md
> +++ b/gcc/config/aarch64/constraints.md
> @@ -235,6 +235,12 @@
>    (and (match_code "mem")
>         (match_test "aarch64_sve_ldr_operand_p (op)")))
>  
> +(define_memory_constraint "Utf"
> +  "@internal
> +   An efficient memory address for a quad-word target operand."
> +  (and (match_code "mem")
> +       (match_test "aarch64_movti_target_operand_p (op)")))
> +
>  (define_memory_constraint "Utv"
>    "@internal
>     An address valid for loading/storing opaque structure
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr82533.c b/gcc/testsuite/gcc.target/aarch64/pr82533.c
> new file mode 100644
> index 00000000000..fa28ffac03a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr82533.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mcpu=falkor -O2 -ftree-vectorize" } */
> +
> +void
> +copy (int N, double *c, double *a)
> +{
> +  for (int i = 0; i < N; ++i)
> +    c[i] = a[i];
> +}
> +
> +/* { dg-final { scan-assembler-not "str\tq\[0-9\]+, \\\[x\[0-9\]+, x\[0-9\]+\\\]" } } */
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-02-19 12:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-15 14:20 [PING][PATCH v3] Disable reg offset in quad-word store for Falkor Wilco Dijkstra
2018-02-16  4:42 ` Siddhesh Poyarekar
2018-02-19 12:58   ` Wilco Dijkstra
  -- strict thread matches above, loose matches on Subject: below --
2018-02-09  7:33 [PATCH " Siddhesh Poyarekar
2018-02-15  9:52 ` [PING][PATCH " Siddhesh Poyarekar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).