public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles
@ 2019-09-17 16:50 Richard Sandiford
  2019-09-17 18:35 ` Toon Moene
  2019-09-20 15:39 ` Vladimir Makarov
  0 siblings, 2 replies; 3+ messages in thread
From: Richard Sandiford @ 2019-09-17 16:50 UTC (permalink / raw)
  To: gcc-patches; +Cc: vmakarov

If an insn requires two operands to be tied, and the input operand dies
in the insn, IRA acts as though there were a copy from the input to the
output with the same execution frequency as the insn.  Allocating the
same register to the input and the output then saves the cost of a move.

If there is no such tie, but an input operand nevertheless dies
in the insn, IRA creates a similar move, but with an eighth of the
frequency.  This helps to ensure that chains of instructions reuse
registers in a natural way, rather than using arbitrarily different
registers for no reason.

This heuristic seems to work well in the vast majority of cases.
However, for SVE, the handling of untied operands ends up creating
copies between dying predicate registers and vector outputs, even though
vector and predicate registers are distinct classes and can never be
tied.  This is a particular problem because the dying predicate tends
to be the loop control predicate, which is used by most instructions
in a vector loop and so (rightly) has a very high allocation priority.
Any copies involving the loop predicate therefore tend to get processed
before copies involving only vector registers.  The end result is that
we tend to allocate the output of the last vector instruction in a loop
ahead of its natural place in the allocation order and don't benefit
from chains created between vector registers.

This patch tries to avoid the problem by not adding register shuffle
copies if there appears to be no chance that the two operands could be
allocated to the same register.

Tested on aarch64-linux-gnu and x86_64-linux-gnu (which needs the
pr82361-*.c tweak I just posted).  Also tested by comparing the
gcc.c-torture, gcc.dg and g++.dg output for one target per CPU
directory at -O2 -ftree-vectorize.  The table below summarises
the tests that had more or fewer assembly lines after the patch
(always a bad metric, but it's just to get a flavour):

Target                 Tests  Delta   Best  Worst Median
======                 =====  =====   ====  ===== ======
aarch64-linux-gnu         11    -16    -15      9     -1
aarch64_be-linux-gnu      10    -16    -15      9     -1
alpha-linux-gnu           10     12     -5     10      2
amdgcn-amdhsa             51     48   -138    174     -1
arc-elf                    4     -3     -2      2     -2
bfin-elf                  53     19    -13      5      1
cris-elf                   1     -1     -1     -1     -1
frv-linux-gnu            139     22    -16     12     -1
h8300-elf                  2      2      1      1      1
hppa64-hp-hpux11.23       20      5     -3      3      1
i686-apple-darwin         42   -305    -87     11     -1
i686-pc-linux-gnu         32  -3205   -402     15    -15
ia64-linux-gnu           774   -636   -274    261     -1
m32r-elf                 179  -1386   -161     31     -2
m68k-linux-gnu            21    -64    -18      4     -2
mipsel-linux-gnu          61      8    -25     34      1
mipsisa64-linux-gnu       40    -11    -10     13     -1
mmix                       4      1     -2      6     -2
mn10300-elf                9      0     -6      5      1
pdp11                     13     -6     -5      2      1
powerpc-ibm-aix7.0       152   1736   -142    481     -1
powerpc64-linux-gnu      111  -1366   -370     39     -1
powerpc64le-linux-gnu    136  -1183   -409    241     -1
pru-elf                   45     24     -6     17      1
riscv32-elf               22    -83    -62     10     -2
riscv64-elf               26    -78    -19      4     -1
s390-linux-gnu            43   -145    -32     16     -1
s390x-linux-gnu           40   -285    -96     16     -1
visium-elf               125   -317    -50     10     -1
x86_64-darwin             40    -83    -42     13     -1
x86_64-linux-gnu          39   -577   -164     23     -1
xstormy16-elf             26    -12     -4      4     -1

Richard


2019-09-17  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* ira-conflicts.c (can_use_same_reg_p): New function.
	(process_reg_shuffles): Take an insn parameter.  Ignore cases
	in which input operand op_num could seemingly never be allocated
	to the same register as the destination.
	(add_insn_allocno_copies): Update call to process_reg_shuffles.

gcc/testsuite/
	* gcc.target/aarch64/sve/cond_convert_1.c: Remove XFAILs.
	* gcc.target/aarch64/sve/cond_convert_4.c: Likewise.
	* gcc.target/aarch64/sve/cond_unary_2.c: Likewise.

Index: gcc/ira-conflicts.c
===================================================================
--- gcc/ira-conflicts.c	2019-09-12 10:52:55.000000000 +0100
+++ gcc/ira-conflicts.c	2019-09-17 17:43:06.426031052 +0100
@@ -325,12 +325,37 @@ process_regs_for_copy (rtx reg1, rtx reg
   return true;
 }
 
-/* Process all of the output registers of the current insn which are
-   not bound (BOUND_P) and the input register REG (its operand number
+/* Return true if output operand OUTPUT and input operand INPUT of
+   INSN can use the same register class for at least one alternative.
+   INSN is already described in recog_data and recog_op_alt.  */
+static bool
+can_use_same_reg_p (rtx_insn *insn, int output, int input)
+{
+  alternative_mask preferred = get_preferred_alternatives (insn);
+  for (int nalt = 0; nalt < recog_data.n_alternatives; nalt++)
+    {
+      if (!TEST_BIT (preferred, nalt))
+	continue;
+
+      const operand_alternative *op_alt
+	= &recog_op_alt[nalt * recog_data.n_operands];
+      if (op_alt[input].matches == output)
+	return true;
+
+      if (ira_reg_class_intersect[op_alt[input].cl][op_alt[output].cl]
+	  != NO_REGS)
+	return true;
+    }
+  return false;
+}
+
+/* Process all of the output registers of the current insn (INSN) which
+   are not bound (BOUND_P) and the input register REG (its operand number
    OP_NUM) which dies in the insn as if there were a move insn between
    them with frequency FREQ.  */
 static void
-process_reg_shuffles (rtx reg, int op_num, int freq, bool *bound_p)
+process_reg_shuffles (rtx_insn *insn, rtx reg, int op_num, int freq,
+		      bool *bound_p)
 {
   int i;
   rtx another_reg;
@@ -342,7 +367,13 @@ process_reg_shuffles (rtx reg, int op_nu
 
       if (!REG_SUBREG_P (another_reg) || op_num == i
 	  || recog_data.operand_type[i] != OP_OUT
-	  || bound_p[i])
+	  || bound_p[i]
+	  || (!can_use_same_reg_p (insn, i, op_num)
+	      && (recog_data.constraints[op_num][0] != '%'
+		  || !can_use_same_reg_p (insn, i, op_num + 1))
+	      && (op_num == 0
+		  || recog_data.constraints[op_num - 1][0] != '%'
+		  || !can_use_same_reg_p (insn, i, op_num - 1))))
 	continue;
 
       process_regs_for_copy (reg, another_reg, false, NULL, freq);
@@ -412,7 +443,8 @@ add_insn_allocno_copies (rtx_insn *insn)
 	   the corresponding allocno copies.  The cost will not
 	   correspond to a real move insn cost, so make the frequency
 	   smaller.  */
-	process_reg_shuffles (operand, i, freq < 8 ? 1 : freq / 8, bound_p);
+	process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8,
+			      bound_p);
     }
 }
 
Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c	2019-08-14 11:56:54.223221654 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c	2019-09-17 17:43:06.426031052 +0100
@@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz} } } */
-/* At the moment we don't manage to avoid using MOVPRFX.  */
-/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c	2019-08-14 11:56:54.223221654 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c	2019-09-17 17:43:06.430031021 +0100
@@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz} } } */
-/* At the moment we don't manage to avoid using MOVPRFX.  */
-/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c	2019-08-14 11:53:04.636898923 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c	2019-09-17 17:43:06.430031021 +0100
@@ -54,8 +54,5 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz} } } */
-/* At the moment we don't manage to avoid using MOVPRFX for the
-   floating-point functions.  */
-/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\t} 6 } } */
+/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles
  2019-09-17 16:50 Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles Richard Sandiford
@ 2019-09-17 18:35 ` Toon Moene
  2019-09-20 15:39 ` Vladimir Makarov
  1 sibling, 0 replies; 3+ messages in thread
From: Toon Moene @ 2019-09-17 18:35 UTC (permalink / raw)
  To: gcc-patches

On 9/17/19 6:50 PM, Richard Sandiford wrote:

[ ... ]

> This patch tries to avoid the problem by not adding register shuffle
> copies if there appears to be no chance that the two operands could be
> allocated to the same register.

> The table below summarises
> the tests that had more or fewer assembly lines after the patch
> (always a bad metric, but it's just to get a flavour):
> 
> Target                 Tests  Delta   Best  Worst Median
> ======                 =====  =====   ====  ===== ======

> x86_64-linux-gnu          39   -577   -164     23     -1

Hmmm, this sounds certainly interesting enough to try on its own merits, 
even if it's not committed by tomorrow morning ...

Fascinating analysis - thanks !

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles
  2019-09-17 16:50 Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles Richard Sandiford
  2019-09-17 18:35 ` Toon Moene
@ 2019-09-20 15:39 ` Vladimir Makarov
  1 sibling, 0 replies; 3+ messages in thread
From: Vladimir Makarov @ 2019-09-20 15:39 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 09/17/2019 12:50 PM, Richard Sandiford wrote:
> If an insn requires two operands to be tied, and the input operand dies
> in the insn, IRA acts as though there were a copy from the input to the
> output with the same execution frequency as the insn.  Allocating the
> same register to the input and the output then saves the cost of a move.
>
> If there is no such tie, but an input operand nevertheless dies
> in the insn, IRA creates a similar move, but with an eighth of the
> frequency.  This helps to ensure that chains of instructions reuse
> registers in a natural way, rather than using arbitrarily different
> registers for no reason.
>
> This heuristic seems to work well in the vast majority of cases.
> However, for SVE, the handling of untied operands ends up creating
> copies between dying predicate registers and vector outputs, even though
> vector and predicate registers are distinct classes and can never be
> tied.  This is a particular problem because the dying predicate tends
> to be the loop control predicate, which is used by most instructions
> in a vector loop and so (rightly) has a very high allocation priority.
> Any copies involving the loop predicate therefore tend to get processed
> before copies involving only vector registers.  The end result is that
> we tend to allocate the output of the last vector instruction in a loop
> ahead of its natural place in the allocation order and don't benefit
> from chains created between vector registers.
>
> This patch tries to avoid the problem by not adding register shuffle
> copies if there appears to be no chance that the two operands could be
> allocated to the same register.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu (which needs the
> pr82361-*.c tweak I just posted).  Also tested by comparing the
> gcc.c-torture, gcc.dg and g++.dg output for one target per CPU
> directory at -O2 -ftree-vectorize.  The table below summarises
> the tests that had more or fewer assembly lines after the patch
> (always a bad metric, but it's just to get a flavour):
>
> Target                 Tests  Delta   Best  Worst Median
> ======                 =====  =====   ====  ===== ======
> aarch64-linux-gnu         11    -16    -15      9     -1
> aarch64_be-linux-gnu      10    -16    -15      9     -1
> alpha-linux-gnu           10     12     -5     10      2
> amdgcn-amdhsa             51     48   -138    174     -1
> arc-elf                    4     -3     -2      2     -2
> bfin-elf                  53     19    -13      5      1
> cris-elf                   1     -1     -1     -1     -1
> frv-linux-gnu            139     22    -16     12     -1
> h8300-elf                  2      2      1      1      1
> hppa64-hp-hpux11.23       20      5     -3      3      1
> i686-apple-darwin         42   -305    -87     11     -1
> i686-pc-linux-gnu         32  -3205   -402     15    -15
> ia64-linux-gnu           774   -636   -274    261     -1
> m32r-elf                 179  -1386   -161     31     -2
> m68k-linux-gnu            21    -64    -18      4     -2
> mipsel-linux-gnu          61      8    -25     34      1
> mipsisa64-linux-gnu       40    -11    -10     13     -1
> mmix                       4      1     -2      6     -2
> mn10300-elf                9      0     -6      5      1
> pdp11                     13     -6     -5      2      1
> powerpc-ibm-aix7.0       152   1736   -142    481     -1
> powerpc64-linux-gnu      111  -1366   -370     39     -1
> powerpc64le-linux-gnu    136  -1183   -409    241     -1
> pru-elf                   45     24     -6     17      1
> riscv32-elf               22    -83    -62     10     -2
> riscv64-elf               26    -78    -19      4     -1
> s390-linux-gnu            43   -145    -32     16     -1
> s390x-linux-gnu           40   -285    -96     16     -1
> visium-elf               125   -317    -50     10     -1
> x86_64-darwin             40    -83    -42     13     -1
> x86_64-linux-gnu          39   -577   -164     23     -1
> xstormy16-elf             26    -12     -4      4     -1

As usually a great report and patch explanation, Richard.
Thank you for finding this problem and resolving it.
The patch is ok for the trunk.

>
> 2019-09-17  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
> 	* ira-conflicts.c (can_use_same_reg_p): New function.
> 	(process_reg_shuffles): Take an insn parameter.  Ignore cases
> 	in which input operand op_num could seemingly never be allocated
> 	to the same register as the destination.
> 	(add_insn_allocno_copies): Update call to process_reg_shuffles.
>
> gcc/testsuite/
> 	* gcc.target/aarch64/sve/cond_convert_1.c: Remove XFAILs.
> 	* gcc.target/aarch64/sve/cond_convert_4.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_unary_2.c: Likewise.
>
> Index: gcc/ira-conflicts.c
> ===================================================================
> --- gcc/ira-conflicts.c	2019-09-12 10:52:55.000000000 +0100
> +++ gcc/ira-conflicts.c	2019-09-17 17:43:06.426031052 +0100
> @@ -325,12 +325,37 @@ process_regs_for_copy (rtx reg1, rtx reg
>     return true;
>   }
>   
> -/* Process all of the output registers of the current insn which are
> -   not bound (BOUND_P) and the input register REG (its operand number
> +/* Return true if output operand OUTPUT and input operand INPUT of
> +   INSN can use the same register class for at least one alternative.
> +   INSN is already described in recog_data and recog_op_alt.  */
> +static bool
> +can_use_same_reg_p (rtx_insn *insn, int output, int input)
> +{
> +  alternative_mask preferred = get_preferred_alternatives (insn);
> +  for (int nalt = 0; nalt < recog_data.n_alternatives; nalt++)
> +    {
> +      if (!TEST_BIT (preferred, nalt))
> +	continue;
> +
> +      const operand_alternative *op_alt
> +	= &recog_op_alt[nalt * recog_data.n_operands];
> +      if (op_alt[input].matches == output)
> +	return true;
> +
> +      if (ira_reg_class_intersect[op_alt[input].cl][op_alt[output].cl]
> +	  != NO_REGS)
> +	return true;
> +    }
> +  return false;
> +}
> +
> +/* Process all of the output registers of the current insn (INSN) which
> +   are not bound (BOUND_P) and the input register REG (its operand number
>      OP_NUM) which dies in the insn as if there were a move insn between
>      them with frequency FREQ.  */
>   static void
> -process_reg_shuffles (rtx reg, int op_num, int freq, bool *bound_p)
> +process_reg_shuffles (rtx_insn *insn, rtx reg, int op_num, int freq,
> +		      bool *bound_p)
>   {
>     int i;
>     rtx another_reg;
> @@ -342,7 +367,13 @@ process_reg_shuffles (rtx reg, int op_nu
>   
>         if (!REG_SUBREG_P (another_reg) || op_num == i
>   	  || recog_data.operand_type[i] != OP_OUT
> -	  || bound_p[i])
> +	  || bound_p[i]
> +	  || (!can_use_same_reg_p (insn, i, op_num)
> +	      && (recog_data.constraints[op_num][0] != '%'
> +		  || !can_use_same_reg_p (insn, i, op_num + 1))
> +	      && (op_num == 0
> +		  || recog_data.constraints[op_num - 1][0] != '%'
> +		  || !can_use_same_reg_p (insn, i, op_num - 1))))
>   	continue;
>   
>         process_regs_for_copy (reg, another_reg, false, NULL, freq);
> @@ -412,7 +443,8 @@ add_insn_allocno_copies (rtx_insn *insn)
>   	   the corresponding allocno copies.  The cost will not
>   	   correspond to a real move insn cost, so make the frequency
>   	   smaller.  */
> -	process_reg_shuffles (operand, i, freq < 8 ? 1 : freq / 8, bound_p);
> +	process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8,
> +			      bound_p);
>       }
>   }
>   
> Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c	2019-08-14 11:56:54.223221654 +0100
> +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c	2019-09-17 17:43:06.426031052 +0100
> @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP)
>   /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>   
>   /* { dg-final { scan-assembler-not {\tmov\tz} } } */
> -/* At the moment we don't manage to avoid using MOVPRFX.  */
> -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
>   /* { dg-final { scan-assembler-not {\tsel\t} } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c	2019-08-14 11:56:54.223221654 +0100
> +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c	2019-09-17 17:43:06.430031021 +0100
> @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP)
>   /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>   
>   /* { dg-final { scan-assembler-not {\tmov\tz} } } */
> -/* At the moment we don't manage to avoid using MOVPRFX.  */
> -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
>   /* { dg-final { scan-assembler-not {\tsel\t} } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c	2019-08-14 11:53:04.636898923 +0100
> +++ gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c	2019-09-17 17:43:06.430031021 +0100
> @@ -54,8 +54,5 @@ TEST_ALL (DEF_LOOP)
>   /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>   
>   /* { dg-final { scan-assembler-not {\tmov\tz} } } */
> -/* At the moment we don't manage to avoid using MOVPRFX for the
> -   floating-point functions.  */
> -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\t} 6 } } */
> +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
>   /* { dg-final { scan-assembler-not {\tsel\t} } } */


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-09-20 15:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-17 16:50 Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles Richard Sandiford
2019-09-17 18:35 ` Toon Moene
2019-09-20 15:39 ` Vladimir Makarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).