Re: [PATCH, ARM] Suppress Redundant Flag Setting for Cortex-A15

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH, ARM] Suppress Redundant Flag Setting for Cortex-A15
       [not found] <52e2a013.89e8440a.49e1.fffffee3SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-01-28 12:10 ` Ramana Radhakrishnan
  2014-04-23 12:51   ` Christophe Lyon
  0 siblings, 1 reply; 4+ messages in thread
From: Ramana Radhakrishnan @ 2014-01-28 12:10 UTC (permalink / raw)
  To: Ian Bolton; +Cc: gcc-patches

On Fri, Jan 24, 2014 at 5:16 PM, Ian Bolton <ian.bolton@arm.com> wrote:
> Hi there!
>
> An existing optimisation for Thumb-2 converts t32 encodings to
> t16 encodings to reduce codesize, at the expense of causing
> redundant flag setting for ADD, AND, etc.  This redundant flag
> setting can have negative performance impact on cortex-a15.
>
> This patch introduces two new tuning options so that the conversion
> from t32 to t16, which takes place in thumb2_reorg, can be suppressed
> for cortex-a15.
>
> To maintain some of the original benefit (reduced codesize), the
> suppression is only done where the enclosing basic block is deemed
> worthy of optimising for speed.
>
> This tested with no regressions and performance has improved for
> the workloads tested on cortex-a15.  (It might be beneficial to
> other processors too, but that has not been investigated yet.)
>
> OK for stage 1?

This is OK for stage1.

Ramana

>
> Cheers,
> Ian
>
>
> 2014-01-24  Ian Bolton  <ian.bolton@arm.com>
>
> gcc/
>         * config/arm/arm-protos.h (tune_params): New struct members.
>         * config/arm/arm.c: Initialise tune_params per processor.
>         (thumb2_reorg): Suppress conversion from t32 to t16 when
>         optimizing for speed, based on new tune_params.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH, ARM] Suppress Redundant Flag Setting for Cortex-A15
  2014-01-28 12:10 ` [PATCH, ARM] Suppress Redundant Flag Setting for Cortex-A15 Ramana Radhakrishnan
@ 2014-04-23 12:51   ` Christophe Lyon
  2014-04-24 17:24     ` Ian Bolton
  0 siblings, 1 reply; 4+ messages in thread
From: Christophe Lyon @ 2014-04-23 12:51 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Ian Bolton, gcc-patches

Hi,

On 28 January 2014 13:10, Ramana Radhakrishnan
<ramana.gcc@googlemail.com> wrote:
> On Fri, Jan 24, 2014 at 5:16 PM, Ian Bolton <ian.bolton@arm.com> wrote:
>> Hi there!
>>
>> An existing optimisation for Thumb-2 converts t32 encodings to
>> t16 encodings to reduce codesize, at the expense of causing
>> redundant flag setting for ADD, AND, etc.  This redundant flag
>> setting can have negative performance impact on cortex-a15.
>>
>> This patch introduces two new tuning options so that the conversion
>> from t32 to t16, which takes place in thumb2_reorg, can be suppressed
>> for cortex-a15.
>>
>> To maintain some of the original benefit (reduced codesize), the
>> suppression is only done where the enclosing basic block is deemed
>> worthy of optimising for speed.
>>
>> This tested with no regressions and performance has improved for
>> the workloads tested on cortex-a15.  (It might be beneficial to
>> other processors too, but that has not been investigated yet.)
>>
>> OK for stage 1?
>
> This is OK for stage1.
>
> Ramana
>
>>
>> Cheers,
>> Ian
>>
>>
>> 2014-01-24  Ian Bolton  <ian.bolton@arm.com>
>>
>> gcc/
>>         * config/arm/arm-protos.h (tune_params): New struct members.
>>         * config/arm/arm.c: Initialise tune_params per processor.
>>         (thumb2_reorg): Suppress conversion from t32 to t16 when
>>         optimizing for speed, based on new tune_params.

This causes
gcc.target/arm/negdi-1.c
gcc.target/arm/negdi-2.c
to FAIL when GCC is configured as:
--with-mode=ar
--with-cpu=cortex-a15
--with-fpu=neon-vfpv4

both tests used to PASS.
(see http://cbuild.validation.linaro.org/build/cross-validation/gcc/209561/report-build-info.html)

Christophe.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH, ARM] Suppress Redundant Flag Setting for Cortex-A15
  2014-04-23 12:51   ` Christophe Lyon
@ 2014-04-24 17:24     ` Ian Bolton
  0 siblings, 0 replies; 4+ messages in thread
From: Ian Bolton @ 2014-04-24 17:24 UTC (permalink / raw)
  To: 'Christophe Lyon', Ramana Radhakrishnan; +Cc: gcc-patches

> Hi,
> 
> On 28 January 2014 13:10, Ramana Radhakrishnan
> <ramana.gcc@googlemail.com> wrote:
> > On Fri, Jan 24, 2014 at 5:16 PM, Ian Bolton <ian.bolton@arm.com>
> wrote:
> >> Hi there!
> >>
> >> An existing optimisation for Thumb-2 converts t32 encodings to
> >> t16 encodings to reduce codesize, at the expense of causing
> >> redundant flag setting for ADD, AND, etc.  This redundant flag
> >> setting can have negative performance impact on cortex-a15.
> >>
> >> This patch introduces two new tuning options so that the conversion
> >> from t32 to t16, which takes place in thumb2_reorg, can be
> suppressed
> >> for cortex-a15.
> >>
> >> To maintain some of the original benefit (reduced codesize), the
> >> suppression is only done where the enclosing basic block is deemed
> >> worthy of optimising for speed.
> >>
> >> This tested with no regressions and performance has improved for
> >> the workloads tested on cortex-a15.  (It might be beneficial to
> >> other processors too, but that has not been investigated yet.)
> >>
> >> OK for stage 1?
> >
> > This is OK for stage1.
> >
> > Ramana
> >
> >>
> >> Cheers,
> >> Ian
> >>
> >>
> >> 2014-01-24  Ian Bolton  <ian.bolton@arm.com>
> >>
> >> gcc/
> >>         * config/arm/arm-protos.h (tune_params): New struct members.
> >>         * config/arm/arm.c: Initialise tune_params per processor.
> >>         (thumb2_reorg): Suppress conversion from t32 to t16 when
> >>         optimizing for speed, based on new tune_params.
> 
> This causes
> gcc.target/arm/negdi-1.c
> gcc.target/arm/negdi-2.c
> to FAIL when GCC is configured as:
> --with-mode=ar
> --with-cpu=cortex-a15
> --with-fpu=neon-vfpv4
> 
> both tests used to PASS.
> (see http://cbuild.validation.linaro.org/build/cross-
> validation/gcc/209561/report-build-info.html)

Hi Christophe,

I don't recall the failure when I did the work, but I see now that
the test is looking for negs when my patch is specifically trying to
avoid flag-setting operations.

So we are now getting an rsb instead of a negs, as intended, and the
test needs fixing!

Open question: Should I look for either rsb or negs in a single
scan-assembler or look for different ones dependent on the cpu in
question or just not run the test for cortex-a15?

Cheers,
Ian



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH, ARM] Suppress Redundant Flag Setting for Cortex-A15
@ 2014-01-24 17:16 Ian Bolton
  0 siblings, 0 replies; 4+ messages in thread
From: Ian Bolton @ 2014-01-24 17:16 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]

Hi there!

An existing optimisation for Thumb-2 converts t32 encodings to
t16 encodings to reduce codesize, at the expense of causing
redundant flag setting for ADD, AND, etc.  This redundant flag
setting can have negative performance impact on cortex-a15.

This patch introduces two new tuning options so that the conversion
from t32 to t16, which takes place in thumb2_reorg, can be suppressed
for cortex-a15.

To maintain some of the original benefit (reduced codesize), the
suppression is only done where the enclosing basic block is deemed
worthy of optimising for speed.

This tested with no regressions and performance has improved for
the workloads tested on cortex-a15.  (It might be beneficial to
other processors too, but that has not been investigated yet.)

OK for stage 1?

Cheers,
Ian


2014-01-24  Ian Bolton  <ian.bolton@arm.com>

gcc/
	* config/arm/arm-protos.h (tune_params): New struct members.
	* config/arm/arm.c: Initialise tune_params per processor.
	(thumb2_reorg): Suppress conversion from t32 to t16 when
	optimizing for speed, based on new tune_params.

[-- Attachment #2: aarch32_prefer_t32_encodings_patch-v5.txt --]
[-- Type: text/plain, Size: 11783 bytes --]

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 13874ee..74645ee 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -272,6 +272,11 @@ struct tune_params
   const struct cpu_vec_costs* vec_costs;
   /* Prefer Neon for 64-bit bitops.  */
   bool prefer_neon_for_64bits;
+  /* Prefer 32-bit encoding instead of flag-setting 16-bit encoding.  */
+  bool disparage_flag_setting_t16_encodings;
+  /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
+     would be set.  */
+  bool disparage_partial_flag_setting_t16_encodings;
 };
 
 extern const struct tune_params *current_tune;
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index fc81bf6..1ebaf84 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -1481,7 +1481,8 @@ const struct tune_params arm_slowmul_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1497,7 +1498,8 @@ const struct tune_params arm_fastmul_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1516,7 +1518,8 @@ const struct tune_params arm_strongarm_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1532,7 +1535,8 @@ const struct tune_params arm_xscale_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_9e_tune =
@@ -1548,7 +1552,8 @@ const struct tune_params arm_9e_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_v6t2_tune =
@@ -1564,7 +1569,8 @@ const struct tune_params arm_v6t2_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1581,7 +1587,8 @@ const struct tune_params arm_cortex_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1597,7 +1604,8 @@ const struct tune_params arm_cortex_a7_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false						/* Prefer Neon for 64-bits bitops.  */
+  false,					/* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
@@ -1613,7 +1621,8 @@ const struct tune_params arm_cortex_a15_tune =
   true,						/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  true, true                                    /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
@@ -1629,7 +1638,8 @@ const struct tune_params arm_cortex_a53_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false						/* Prefer Neon for 64-bits bitops.  */
+  false,					/* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1648,7 +1658,8 @@ const struct tune_params arm_cortex_a5_tune =
   false,					/* Prefer LDRD/STRD.  */
   {false, false},				/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1664,7 +1675,8 @@ const struct tune_params arm_cortex_a9_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1703,7 +1715,8 @@ const struct tune_params arm_v7m_tune =
   false,					/* Prefer LDRD/STRD.  */
   {false, false},				/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1721,7 +1734,8 @@ const struct tune_params arm_v6m_tune =
   false,					/* Prefer LDRD/STRD.  */
   {false, false},				/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -1737,7 +1751,8 @@ const struct tune_params arm_fa726te_tune =
   false,					/* Prefer LDRD/STRD.  */
   {true, true},					/* Prefer non short circuit.  */
   &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false                                         /* Prefer Neon for 64-bits bitops.  */
+  false,                                        /* Prefer Neon for 64-bits bitops.  */
+  false, false                                  /* Prefer 32-bit encodings.  */
 };
 
 
@@ -16763,9 +16778,20 @@ thumb2_reorg (void)
   compute_bb_for_insn ();
   df_analyze ();
 
+  enum Convert_Action {SKIP, CONV, SWAP_CONV};
+
   FOR_EACH_BB_FN (bb, cfun)
     {
+      if (current_tune->disparage_flag_setting_t16_encodings
+	  && optimize_bb_for_speed_p (bb))
+	continue;
+
       rtx insn;
+      Convert_Action action = SKIP;
+      Convert_Action action_for_partial_flag_setting
+	= (current_tune->disparage_partial_flag_setting_t16_encodings
+	   && optimize_bb_for_speed_p (bb))
+	  ? SKIP : CONV;
 
       COPY_REG_SET (&live, DF_LR_OUT (bb));
       df_simulate_initialize_backwards (bb, &live);
@@ -16775,7 +16801,7 @@ thumb2_reorg (void)
 	      && !REGNO_REG_SET_P (&live, CC_REGNUM)
 	      && GET_CODE (PATTERN (insn)) == SET)
 	    {
-	      enum {SKIP, CONV, SWAP_CONV} action = SKIP;
+	      action = SKIP;
 	      rtx pat = PATTERN (insn);
 	      rtx dst = XEXP (pat, 0);
 	      rtx src = XEXP (pat, 1);
@@ -16856,10 +16882,11 @@ thumb2_reorg (void)
 		      /* ANDS <Rdn>,<Rm>  */
 		      if (rtx_equal_p (dst, op0)
 			  && low_register_operand (op1, SImode))
-			action = CONV;
+			action = action_for_partial_flag_setting;
 		      else if (rtx_equal_p (dst, op1)
 			       && low_register_operand (op0, SImode))
-			action = SWAP_CONV;
+			action = action_for_partial_flag_setting == SKIP
+				 ? SKIP : SWAP_CONV;
 		      break;
 
 		    case ASHIFTRT:
@@ -16870,26 +16897,30 @@ thumb2_reorg (void)
 		      /* LSLS <Rdn>,<Rm> */
 		      if (rtx_equal_p (dst, op0)
 			  && low_register_operand (op1, SImode))
-			action = CONV;
+			action = action_for_partial_flag_setting;
 		      /* ASRS <Rd>,<Rm>,#<imm5> */
 		      /* LSRS <Rd>,<Rm>,#<imm5> */
 		      /* LSLS <Rd>,<Rm>,#<imm5> */
 		      else if (low_register_operand (op0, SImode)
 			       && CONST_INT_P (op1)
 			       && IN_RANGE (INTVAL (op1), 0, 31))
-			action = CONV;
+			action = action_for_partial_flag_setting;
 		      break;
 
 		    case ROTATERT:
 		      /* RORS <Rdn>,<Rm>  */
 		      if (rtx_equal_p (dst, op0)
 			  && low_register_operand (op1, SImode))
-			action = CONV;
+			action = action_for_partial_flag_setting;
 		      break;
 
 		    case NOT:
-		    case NEG:
 		      /* MVNS <Rd>,<Rm>  */
+		      if (low_register_operand (op0, SImode))
+			action = action_for_partial_flag_setting;
+		      break;
+
+		    case NEG:
 		      /* NEGS <Rd>,<Rm>  (a.k.a RSBS)  */
 		      if (low_register_operand (op0, SImode))
 			action = CONV;
@@ -16899,7 +16930,7 @@ thumb2_reorg (void)
 		      /* MOVS <Rd>,#<imm8>  */
 		      if (CONST_INT_P (src)
 			  && IN_RANGE (INTVAL (src), 0, 255))
-			action = CONV;
+			action = action_for_partial_flag_setting;
 		      break;
 
 		    case REG:

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-04-24 17:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <52e2a013.89e8440a.49e1.fffffee3SMTPIN_ADDED_BROKEN@mx.google.com>
2014-01-28 12:10 ` [PATCH, ARM] Suppress Redundant Flag Setting for Cortex-A15 Ramana Radhakrishnan
2014-04-23 12:51   ` Christophe Lyon
2014-04-24 17:24     ` Ian Bolton
2014-01-24 17:16 Ian Bolton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).