public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
@ 2011-03-24  3:57 Chung-Lin Tang
  2011-03-24 10:52 ` Richard Earnshaw
  0 siblings, 1 reply; 13+ messages in thread
From: Chung-Lin Tang @ 2011-03-24  3:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Earnshaw, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 1512 bytes --]

Hi,
PR48250 happens under TARGET_NEON, where DImode is included within the
valid NEON modes. This turns the range of legitimate constant indexes to
step-4 (coproc load/store), thus arm_legitimize_reload_address() when
trying to decompose the [reg+index] reload address into
[(reg+index_high)+index_low], can cause an ICE later when 'index_low'
part is not aligned to 4.

I'm not sure why the current DImode index is computed as:
low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
values, then subtracting back, actually creates further off indexes.
e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].

My patch changes the index decomposing to a more straightforward way; it
also sort of outlines the way the other reload address indexes are
broken by using and-masks, is not the most effective.  The address is
computed by addition, subtracting away the parts to obtain low+high
should be the optimal way of giving the largest computable index range.

I have included a few Thumb-2 bits in the patch; I know currently
arm_legitimize_reload_address() is only used under TARGET_ARM, but I
guess it might eventually be turned into TARGET_32BIT.

Cross-tested on QEMU without regressions, is this okay?

Thanks,
Chung-Lin

2011-03-24  Chung-Lin Tang  <cltang@codesourcery.com>

	PR target/48250
	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
	DImode constant index decomposing. Mask out lower 2-bits for
	NEON and Thumb-2.

	testsuite/
	* gcc.target/arm/pr48250.c: New.

[-- Attachment #2: pr48250.diff --]
[-- Type: text/plain, Size: 1838 bytes --]

Index: config/arm/arm.c
===================================================================
--- config/arm/arm.c	(revision 171379)
+++ config/arm/arm.c	(working copy)
@@ -6416,7 +6416,22 @@
       HOST_WIDE_INT low, high;
 
       if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
-	low = ((val & 0xf) ^ 0x8) - 0x8;
+	{
+	  /* ??? There may be more adjustments later for Thumb-2,
+	     which has a ldrd insn with +-1020 index range.  */
+	  int max_idx = 255;
+
+	  /* low == val, if val is within range [-max_idx, +max_idx].
+	     If not, val is set to the boundary +-max_idx.  */
+	  low = (-max_idx <= val && val <= max_idx
+		 ? val : (val > 0 ? max_idx : -max_idx));
+
+	  /* Thumb-2 ldrd, and NEON coprocessor load/store indexes
+	     are in steps of 4, so the least two bits need to be
+	     cleared to zero.  */
+	  if (TARGET_NEON || TARGET_THUMB2)
+	    low &= ~0x3;
+	}
       else if (TARGET_MAVERICK && TARGET_HARD_FLOAT)
 	/* Need to be careful, -256 is not a valid offset.  */
 	low = val >= 0 ? (val & 0xff) : -((-val) & 0xff);
Index: testsuite/gcc.target/arm/pr48250.c
===================================================================
--- testsuite/gcc.target/arm/pr48250.c	(revision 0)
+++ testsuite/gcc.target/arm/pr48250.c	(revision 0)
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O2" } */
+/* { dg-add-options arm_neon } */
+
+void bar();
+
+struct S
+{
+  unsigned int u32;
+  unsigned long long int u64;
+};
+
+void
+foo ()
+{
+  char a[100];
+  unsigned int ptr = (unsigned int) &a;
+  struct S *unaligned_S;
+  int index;
+
+  ptr = ptr + (ptr & 1 ? 0 : 1);
+  unaligned_S = (struct S *) ptr;
+
+  for (index = 0; index < 3; index++)
+    {
+      switch (index)
+	{
+	case 1:
+	  unaligned_S->u64 = 0;
+	  bar ();
+	}
+    }
+}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-24  3:57 [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing Chung-Lin Tang
@ 2011-03-24 10:52 ` Richard Earnshaw
  2011-03-29 10:27   ` Chung-Lin Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Earnshaw @ 2011-03-24 10:52 UTC (permalink / raw)
  To: Chung-Lin Tang; +Cc: gcc-patches, Ramana Radhakrishnan


On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
> Hi,
> PR48250 happens under TARGET_NEON, where DImode is included within the
> valid NEON modes. This turns the range of legitimate constant indexes to
> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
> trying to decompose the [reg+index] reload address into
> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
> part is not aligned to 4.
> 
> I'm not sure why the current DImode index is computed as:
> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
> values, then subtracting back, actually creates further off indexes.
> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
> 

Hysterical Raisins... the code there was clearly written for the days
before we had LDRD in the architecture.  At that time the most efficient
way to load a 64-bit object was to use the LDM{ia,ib,da,db}
instructions.  The computation here was (I think), intended to try and
make the most efficient use of an add/sub instruction followed by
LDM/STM offsetting.  At that time the architecture had no unaligned
access either, so dealing with 64-bit that were less than 32-bit aligned
(in those days 32-bit was the maximum alignment) probably wasn't
considered, or couldn't even get through to reload.

> My patch changes the index decomposing to a more straightforward way; it
> also sort of outlines the way the other reload address indexes are
> broken by using and-masks, is not the most effective.  The address is
> computed by addition, subtracting away the parts to obtain low+high
> should be the optimal way of giving the largest computable index range.
> 
> I have included a few Thumb-2 bits in the patch; I know currently
> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
> guess it might eventually be turned into TARGET_32BIT.
> 

I think this needs to be looked at carefully on ARMv4/ARMv4T to check
that it doesn't cause regressions there when we don't have LDRD in the
instruction set.

> Cross-tested on QEMU without regressions, is this okay?
> 
> Thanks,
> Chung-Lin
> 
> 2011-03-24  Chung-Lin Tang  <cltang@codesourcery.com>
> 
> 	PR target/48250
> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
> 	DImode constant index decomposing. Mask out lower 2-bits for
> 	NEON and Thumb-2.
> 
> 	testsuite/
> 	* gcc.target/arm/pr48250.c: New.

R.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-24 10:52 ` Richard Earnshaw
@ 2011-03-29 10:27   ` Chung-Lin Tang
  2011-03-29 14:35     ` Richard Earnshaw
  0 siblings, 1 reply; 13+ messages in thread
From: Chung-Lin Tang @ 2011-03-29 10:27 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 2926 bytes --]

On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
> 
> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
>> Hi,
>> PR48250 happens under TARGET_NEON, where DImode is included within the
>> valid NEON modes. This turns the range of legitimate constant indexes to
>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
>> trying to decompose the [reg+index] reload address into
>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
>> part is not aligned to 4.
>>
>> I'm not sure why the current DImode index is computed as:
>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
>> values, then subtracting back, actually creates further off indexes.
>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
>>
> 
> Hysterical Raisins... the code there was clearly written for the days
> before we had LDRD in the architecture.  At that time the most efficient
> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
> instructions.  The computation here was (I think), intended to try and
> make the most efficient use of an add/sub instruction followed by
> LDM/STM offsetting.  At that time the architecture had no unaligned
> access either, so dealing with 64-bit that were less than 32-bit aligned
> (in those days 32-bit was the maximum alignment) probably wasn't
> considered, or couldn't even get through to reload.
>

I see it now. The code in output_move_double() returning assembly for
ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.

I have changed the patch to let the new code handle the TARGET_LDRD case
only.  The pre-LDRD case is still handled by the original code, with an
additional & ~0x3 for aligning the offset to 4.

I've also added a comment for the pre-TARGET_LDRD case. Please see if
the description is accurate enough.

>> My patch changes the index decomposing to a more straightforward way; it
>> also sort of outlines the way the other reload address indexes are
>> broken by using and-masks, is not the most effective.  The address is
>> computed by addition, subtracting away the parts to obtain low+high
>> should be the optimal way of giving the largest computable index range.
>>
>> I have included a few Thumb-2 bits in the patch; I know currently
>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
>> guess it might eventually be turned into TARGET_32BIT.
>>
> 
> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
> that it doesn't cause regressions there when we don't have LDRD in the
> instruction set.

I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
Okay for trunk if no regressions?

Thanks,
Chung-Lin

	PR target/48250
	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
	DImode constant index decomposing under TARGET_LDRD. Clear
	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
	comment for !TARGET_LDRD case.

[-- Attachment #2: pr48250-2.diff --]
[-- Type: text/plain, Size: 1369 bytes --]

Index: config/arm/arm.c
===================================================================
--- config/arm/arm.c	(revision 171652)
+++ config/arm/arm.c	(working copy)
@@ -6420,7 +6420,32 @@
       HOST_WIDE_INT low, high;
 
       if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
-	low = ((val & 0xf) ^ 0x8) - 0x8;
+	{
+	  if (TARGET_LDRD)
+	    {
+	      /* ??? There may be more adjustments later for Thumb-2,
+		 which has a ldrd insn with +-1020 index range.  */
+	      int max_idx = 255;
+
+	      /* low == val, if val is within range [-max_idx, +max_idx].
+		 If not, val is set to the boundary +-max_idx.  */
+	      low = (-max_idx <= val && val <= max_idx
+		     ? val : (val > 0 ? max_idx : -max_idx));
+
+	      /* Thumb-2 ldrd, and NEON coprocessor load/store indexes
+		 are in steps of 4, so the least two bits need to be
+		 cleared to zero.  */
+	      if (TARGET_NEON || TARGET_THUMB2)
+		low &= ~0x3;
+	    }
+	  else
+	    {
+	      /* For pre-ARMv5TE (without ldrd), we use ldm/stm(db/da/ib)
+		 to access doublewords. The supported load/store offsets are
+		 -8, -4, and 4, which we try to produce here.  */
+	      low = (((val & 0xf) ^ 0x8) - 0x8) & ~0x3;
+	    }
+	}
       else if (TARGET_MAVERICK && TARGET_HARD_FLOAT)
 	/* Need to be careful, -256 is not a valid offset.  */
 	low = val >= 0 ? (val & 0xff) : -((-val) & 0xff);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-29 10:27   ` Chung-Lin Tang
@ 2011-03-29 14:35     ` Richard Earnshaw
  2011-03-29 15:46       ` Chung-Lin Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Earnshaw @ 2011-03-29 14:35 UTC (permalink / raw)
  To: Chung-Lin Tang; +Cc: gcc-patches, Ramana Radhakrishnan

On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
> > 
> > On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
> >> Hi,
> >> PR48250 happens under TARGET_NEON, where DImode is included within the
> >> valid NEON modes. This turns the range of legitimate constant indexes to
> >> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
> >> trying to decompose the [reg+index] reload address into
> >> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
> >> part is not aligned to 4.
> >>
> >> I'm not sure why the current DImode index is computed as:
> >> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
> >> values, then subtracting back, actually creates further off indexes.
> >> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
> >>
> > 
> > Hysterical Raisins... the code there was clearly written for the days
> > before we had LDRD in the architecture.  At that time the most efficient
> > way to load a 64-bit object was to use the LDM{ia,ib,da,db}
> > instructions.  The computation here was (I think), intended to try and
> > make the most efficient use of an add/sub instruction followed by
> > LDM/STM offsetting.  At that time the architecture had no unaligned
> > access either, so dealing with 64-bit that were less than 32-bit aligned
> > (in those days 32-bit was the maximum alignment) probably wasn't
> > considered, or couldn't even get through to reload.
> >
> 
> I see it now. The code in output_move_double() returning assembly for
> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
> 
> I have changed the patch to let the new code handle the TARGET_LDRD case
> only.  The pre-LDRD case is still handled by the original code, with an
> additional & ~0x3 for aligning the offset to 4.
> 
> I've also added a comment for the pre-TARGET_LDRD case. Please see if
> the description is accurate enough.
> 
> >> My patch changes the index decomposing to a more straightforward way; it
> >> also sort of outlines the way the other reload address indexes are
> >> broken by using and-masks, is not the most effective.  The address is
> >> computed by addition, subtracting away the parts to obtain low+high
> >> should be the optimal way of giving the largest computable index range.
> >>
> >> I have included a few Thumb-2 bits in the patch; I know currently
> >> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
> >> guess it might eventually be turned into TARGET_32BIT.
> >>
> > 
> > I think this needs to be looked at carefully on ARMv4/ARMv4T to check
> > that it doesn't cause regressions there when we don't have LDRD in the
> > instruction set.
> 
> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
> Okay for trunk if no regressions?
> 
> Thanks,
> Chung-Lin
> 
> 	PR target/48250
> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
> 	DImode constant index decomposing under TARGET_LDRD. Clear
> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
> 	comment for !TARGET_LDRD case.

This looks technically correct, but I can't help feeling that trying to
deal with the bottom two bits being non-zero is not really worthwhile.
This hook is an optimization that's intended to generate better code
than the default mechanisms that reload provides.  It is allowed to
fail, but it must say so if it does (by returning false).

What this hook is trying to do for DImode is to take an address of the
form (reg + TOO_BIG_CONST) and turn it into two instructions that are
legitimate:

	tmp = (REG + LEGAL_BIG_CONST)
	some_use_of (mem (tmp + SMALL_LEGAL_CONST))

The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
impossible in ARM state) that pushing the bottom two bits of the address
into LEGAL_BIG_CONST part of the offset is going to lead to a better
code sequence.  

So I think it would be more sensible to just return false if we have a
DImode address with the bottom 2 bits non-zero and then let the normal
reload recovery mechanism take over.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-29 14:35     ` Richard Earnshaw
@ 2011-03-29 15:46       ` Chung-Lin Tang
  2011-03-29 16:42         ` Richard Earnshaw
  0 siblings, 1 reply; 13+ messages in thread
From: Chung-Lin Tang @ 2011-03-29 15:46 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches, Ramana Radhakrishnan

On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
>>>
>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
>>>> Hi,
>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
>>>> valid NEON modes. This turns the range of legitimate constant indexes to
>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
>>>> trying to decompose the [reg+index] reload address into
>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
>>>> part is not aligned to 4.
>>>>
>>>> I'm not sure why the current DImode index is computed as:
>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
>>>> values, then subtracting back, actually creates further off indexes.
>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
>>>>
>>>
>>> Hysterical Raisins... the code there was clearly written for the days
>>> before we had LDRD in the architecture.  At that time the most efficient
>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
>>> instructions.  The computation here was (I think), intended to try and
>>> make the most efficient use of an add/sub instruction followed by
>>> LDM/STM offsetting.  At that time the architecture had no unaligned
>>> access either, so dealing with 64-bit that were less than 32-bit aligned
>>> (in those days 32-bit was the maximum alignment) probably wasn't
>>> considered, or couldn't even get through to reload.
>>>
>>
>> I see it now. The code in output_move_double() returning assembly for
>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
>>
>> I have changed the patch to let the new code handle the TARGET_LDRD case
>> only.  The pre-LDRD case is still handled by the original code, with an
>> additional & ~0x3 for aligning the offset to 4.
>>
>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
>> the description is accurate enough.
>>
>>>> My patch changes the index decomposing to a more straightforward way; it
>>>> also sort of outlines the way the other reload address indexes are
>>>> broken by using and-masks, is not the most effective.  The address is
>>>> computed by addition, subtracting away the parts to obtain low+high
>>>> should be the optimal way of giving the largest computable index range.
>>>>
>>>> I have included a few Thumb-2 bits in the patch; I know currently
>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
>>>> guess it might eventually be turned into TARGET_32BIT.
>>>>
>>>
>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
>>> that it doesn't cause regressions there when we don't have LDRD in the
>>> instruction set.
>>
>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
>> Okay for trunk if no regressions?
>>
>> Thanks,
>> Chung-Lin
>>
>> 	PR target/48250
>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
>> 	DImode constant index decomposing under TARGET_LDRD. Clear
>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
>> 	comment for !TARGET_LDRD case.
> 
> This looks technically correct, but I can't help feeling that trying to
> deal with the bottom two bits being non-zero is not really worthwhile.
> This hook is an optimization that's intended to generate better code
> than the default mechanisms that reload provides.  It is allowed to
> fail, but it must say so if it does (by returning false).
> 
> What this hook is trying to do for DImode is to take an address of the
> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
> legitimate:
> 
> 	tmp = (REG + LEGAL_BIG_CONST)
> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
> 
> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
> impossible in ARM state) that pushing the bottom two bits of the address
> into LEGAL_BIG_CONST part of the offset is going to lead to a better
> code sequence.  
> 
> So I think it would be more sensible to just return false if we have a
> DImode address with the bottom 2 bits non-zero and then let the normal
> reload recovery mechanism take over.

It is supposed to provide better utilization of the full range of
LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
(reg)) for the load/store (correct me if I'm wrong).

Also, the new code slighty improves the reloading, for example currently
[reg+64] is broken into [(reg+72)-8], creating an additional unneeded
reload, which is certainly not good when we have ldrd/strd available.

Thanks,
Chung-Lin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-29 15:46       ` Chung-Lin Tang
@ 2011-03-29 16:42         ` Richard Earnshaw
  2011-03-30  9:00           ` Chung-Lin Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Earnshaw @ 2011-03-29 16:42 UTC (permalink / raw)
  To: Chung-Lin Tang; +Cc: gcc-patches, Ramana Radhakrishnan


On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
> > On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
> >> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
> >>>
> >>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
> >>>> Hi,
> >>>> PR48250 happens under TARGET_NEON, where DImode is included within the
> >>>> valid NEON modes. This turns the range of legitimate constant indexes to
> >>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
> >>>> trying to decompose the [reg+index] reload address into
> >>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
> >>>> part is not aligned to 4.
> >>>>
> >>>> I'm not sure why the current DImode index is computed as:
> >>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
> >>>> values, then subtracting back, actually creates further off indexes.
> >>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
> >>>>
> >>>
> >>> Hysterical Raisins... the code there was clearly written for the days
> >>> before we had LDRD in the architecture.  At that time the most efficient
> >>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
> >>> instructions.  The computation here was (I think), intended to try and
> >>> make the most efficient use of an add/sub instruction followed by
> >>> LDM/STM offsetting.  At that time the architecture had no unaligned
> >>> access either, so dealing with 64-bit that were less than 32-bit aligned
> >>> (in those days 32-bit was the maximum alignment) probably wasn't
> >>> considered, or couldn't even get through to reload.
> >>>
> >>
> >> I see it now. The code in output_move_double() returning assembly for
> >> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
> >>
> >> I have changed the patch to let the new code handle the TARGET_LDRD case
> >> only.  The pre-LDRD case is still handled by the original code, with an
> >> additional & ~0x3 for aligning the offset to 4.
> >>
> >> I've also added a comment for the pre-TARGET_LDRD case. Please see if
> >> the description is accurate enough.
> >>
> >>>> My patch changes the index decomposing to a more straightforward way; it
> >>>> also sort of outlines the way the other reload address indexes are
> >>>> broken by using and-masks, is not the most effective.  The address is
> >>>> computed by addition, subtracting away the parts to obtain low+high
> >>>> should be the optimal way of giving the largest computable index range.
> >>>>
> >>>> I have included a few Thumb-2 bits in the patch; I know currently
> >>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
> >>>> guess it might eventually be turned into TARGET_32BIT.
> >>>>
> >>>
> >>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
> >>> that it doesn't cause regressions there when we don't have LDRD in the
> >>> instruction set.
> >>
> >> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
> >> Okay for trunk if no regressions?
> >>
> >> Thanks,
> >> Chung-Lin
> >>
> >> 	PR target/48250
> >> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
> >> 	DImode constant index decomposing under TARGET_LDRD. Clear
> >> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
> >> 	comment for !TARGET_LDRD case.
> > 
> > This looks technically correct, but I can't help feeling that trying to
> > deal with the bottom two bits being non-zero is not really worthwhile.
> > This hook is an optimization that's intended to generate better code
> > than the default mechanisms that reload provides.  It is allowed to
> > fail, but it must say so if it does (by returning false).
> > 
> > What this hook is trying to do for DImode is to take an address of the
> > form (reg + TOO_BIG_CONST) and turn it into two instructions that are
> > legitimate:
> > 
> > 	tmp = (REG + LEGAL_BIG_CONST)
> > 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
> > 
> > The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
> > instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
> > impossible in ARM state) that pushing the bottom two bits of the address
> > into LEGAL_BIG_CONST part of the offset is going to lead to a better
> > code sequence.  
> > 
> > So I think it would be more sensible to just return false if we have a
> > DImode address with the bottom 2 bits non-zero and then let the normal
> > reload recovery mechanism take over.
> 
> It is supposed to provide better utilization of the full range of
> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
> (reg)) for the load/store (correct me if I'm wrong).
> 
> Also, the new code slighty improves the reloading, for example currently
> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
> reload, which is certainly not good when we have ldrd/strd available.
> 

Sorry, didn't mean to suggest that we don't want to do something better
for addresses that are a multiple of 4, just that for addresses that
aren't (at least) word-aligned that we should just bail as the code in
that case won't benefit from the optimization.  So something like

       if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
	 {
	    if (val & 3)
		return false;  /* No point in trying to handle this.  */
	    ... /* Cases that are useful to handle */

R.


> Thanks,
> Chung-Lin
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-29 16:42         ` Richard Earnshaw
@ 2011-03-30  9:00           ` Chung-Lin Tang
  2011-03-30  9:46             ` Richard Earnshaw
  0 siblings, 1 reply; 13+ messages in thread
From: Chung-Lin Tang @ 2011-03-30  9:00 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches, Ramana Radhakrishnan

On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
> 
> On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
>> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
>>> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
>>>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
>>>>>
>>>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
>>>>>> Hi,
>>>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
>>>>>> valid NEON modes. This turns the range of legitimate constant indexes to
>>>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
>>>>>> trying to decompose the [reg+index] reload address into
>>>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
>>>>>> part is not aligned to 4.
>>>>>>
>>>>>> I'm not sure why the current DImode index is computed as:
>>>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
>>>>>> values, then subtracting back, actually creates further off indexes.
>>>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
>>>>>>
>>>>>
>>>>> Hysterical Raisins... the code there was clearly written for the days
>>>>> before we had LDRD in the architecture.  At that time the most efficient
>>>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
>>>>> instructions.  The computation here was (I think), intended to try and
>>>>> make the most efficient use of an add/sub instruction followed by
>>>>> LDM/STM offsetting.  At that time the architecture had no unaligned
>>>>> access either, so dealing with 64-bit that were less than 32-bit aligned
>>>>> (in those days 32-bit was the maximum alignment) probably wasn't
>>>>> considered, or couldn't even get through to reload.
>>>>>
>>>>
>>>> I see it now. The code in output_move_double() returning assembly for
>>>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
>>>>
>>>> I have changed the patch to let the new code handle the TARGET_LDRD case
>>>> only.  The pre-LDRD case is still handled by the original code, with an
>>>> additional & ~0x3 for aligning the offset to 4.
>>>>
>>>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
>>>> the description is accurate enough.
>>>>
>>>>>> My patch changes the index decomposing to a more straightforward way; it
>>>>>> also sort of outlines the way the other reload address indexes are
>>>>>> broken by using and-masks, is not the most effective.  The address is
>>>>>> computed by addition, subtracting away the parts to obtain low+high
>>>>>> should be the optimal way of giving the largest computable index range.
>>>>>>
>>>>>> I have included a few Thumb-2 bits in the patch; I know currently
>>>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
>>>>>> guess it might eventually be turned into TARGET_32BIT.
>>>>>>
>>>>>
>>>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
>>>>> that it doesn't cause regressions there when we don't have LDRD in the
>>>>> instruction set.
>>>>
>>>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
>>>> Okay for trunk if no regressions?
>>>>
>>>> Thanks,
>>>> Chung-Lin
>>>>
>>>> 	PR target/48250
>>>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
>>>> 	DImode constant index decomposing under TARGET_LDRD. Clear
>>>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
>>>> 	comment for !TARGET_LDRD case.
>>>
>>> This looks technically correct, but I can't help feeling that trying to
>>> deal with the bottom two bits being non-zero is not really worthwhile.
>>> This hook is an optimization that's intended to generate better code
>>> than the default mechanisms that reload provides.  It is allowed to
>>> fail, but it must say so if it does (by returning false).
>>>
>>> What this hook is trying to do for DImode is to take an address of the
>>> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
>>> legitimate:
>>>
>>> 	tmp = (REG + LEGAL_BIG_CONST)
>>> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
>>>
>>> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
>>> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
>>> impossible in ARM state) that pushing the bottom two bits of the address
>>> into LEGAL_BIG_CONST part of the offset is going to lead to a better
>>> code sequence.  
>>>
>>> So I think it would be more sensible to just return false if we have a
>>> DImode address with the bottom 2 bits non-zero and then let the normal
>>> reload recovery mechanism take over.
>>
>> It is supposed to provide better utilization of the full range of
>> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
>> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
>> (reg)) for the load/store (correct me if I'm wrong).
>>
>> Also, the new code slighty improves the reloading, for example currently
>> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
>> reload, which is certainly not good when we have ldrd/strd available.
>>
> 
> Sorry, didn't mean to suggest that we don't want to do something better
> for addresses that are a multiple of 4, just that for addresses that
> aren't (at least) word-aligned that we should just bail as the code in
> that case won't benefit from the optimization.  So something like
> 
>        if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
> 	 {
> 	    if (val & 3)
> 		return false;  /* No point in trying to handle this.  */
> 	    ... /* Cases that are useful to handle */

I've looked at the reload code surrounding the call to
LEGITIMIZE_RELOAD_ADDRESS. It looks like for ARM, reload transforms the
address from [reg+#const] to newreg=#const; [reg+newreg]. ARM/Thumb-2
has 16-bits to move that constant, which is much more wider in range
than a 12-bit constant operand + 8-bit index. So I agree that simply
bailing out should be okay.

OTOH, I'll still add that, for some micro-architectures, register read
ports may be a critical resource; for those cores, handling as many
reloads here as possible by breaking into an address add is still
slightly better than a 'move + [reg+reg]', for the latter load/store
uses one more register read.  So maybe the best should be, to handle
when the 'high' part is a valid add-immediate-operand, and bail out if
not...

C.L.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-30  9:00           ` Chung-Lin Tang
@ 2011-03-30  9:46             ` Richard Earnshaw
  2011-03-31  6:31               ` Chung-Lin Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Earnshaw @ 2011-03-30  9:46 UTC (permalink / raw)
  To: Chung-Lin Tang; +Cc: gcc-patches, Ramana Radhakrishnan


On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
> On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
> > 
> > On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
> >> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
> >>> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
> >>>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
> >>>>>
> >>>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
> >>>>>> Hi,
> >>>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
> >>>>>> valid NEON modes. This turns the range of legitimate constant indexes to
> >>>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
> >>>>>> trying to decompose the [reg+index] reload address into
> >>>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
> >>>>>> part is not aligned to 4.
> >>>>>>
> >>>>>> I'm not sure why the current DImode index is computed as:
> >>>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
> >>>>>> values, then subtracting back, actually creates further off indexes.
> >>>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
> >>>>>>
> >>>>>
> >>>>> Hysterical Raisins... the code there was clearly written for the days
> >>>>> before we had LDRD in the architecture.  At that time the most efficient
> >>>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
> >>>>> instructions.  The computation here was (I think), intended to try and
> >>>>> make the most efficient use of an add/sub instruction followed by
> >>>>> LDM/STM offsetting.  At that time the architecture had no unaligned
> >>>>> access either, so dealing with 64-bit that were less than 32-bit aligned
> >>>>> (in those days 32-bit was the maximum alignment) probably wasn't
> >>>>> considered, or couldn't even get through to reload.
> >>>>>
> >>>>
> >>>> I see it now. The code in output_move_double() returning assembly for
> >>>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
> >>>>
> >>>> I have changed the patch to let the new code handle the TARGET_LDRD case
> >>>> only.  The pre-LDRD case is still handled by the original code, with an
> >>>> additional & ~0x3 for aligning the offset to 4.
> >>>>
> >>>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
> >>>> the description is accurate enough.
> >>>>
> >>>>>> My patch changes the index decomposing to a more straightforward way; it
> >>>>>> also sort of outlines the way the other reload address indexes are
> >>>>>> broken by using and-masks, is not the most effective.  The address is
> >>>>>> computed by addition, subtracting away the parts to obtain low+high
> >>>>>> should be the optimal way of giving the largest computable index range.
> >>>>>>
> >>>>>> I have included a few Thumb-2 bits in the patch; I know currently
> >>>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
> >>>>>> guess it might eventually be turned into TARGET_32BIT.
> >>>>>>
> >>>>>
> >>>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
> >>>>> that it doesn't cause regressions there when we don't have LDRD in the
> >>>>> instruction set.
> >>>>
> >>>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
> >>>> Okay for trunk if no regressions?
> >>>>
> >>>> Thanks,
> >>>> Chung-Lin
> >>>>
> >>>> 	PR target/48250
> >>>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
> >>>> 	DImode constant index decomposing under TARGET_LDRD. Clear
> >>>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
> >>>> 	comment for !TARGET_LDRD case.
> >>>
> >>> This looks technically correct, but I can't help feeling that trying to
> >>> deal with the bottom two bits being non-zero is not really worthwhile.
> >>> This hook is an optimization that's intended to generate better code
> >>> than the default mechanisms that reload provides.  It is allowed to
> >>> fail, but it must say so if it does (by returning false).
> >>>
> >>> What this hook is trying to do for DImode is to take an address of the
> >>> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
> >>> legitimate:
> >>>
> >>> 	tmp = (REG + LEGAL_BIG_CONST)
> >>> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
> >>>
> >>> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
> >>> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
> >>> impossible in ARM state) that pushing the bottom two bits of the address
> >>> into LEGAL_BIG_CONST part of the offset is going to lead to a better
> >>> code sequence.  
> >>>
> >>> So I think it would be more sensible to just return false if we have a
> >>> DImode address with the bottom 2 bits non-zero and then let the normal
> >>> reload recovery mechanism take over.
> >>
> >> It is supposed to provide better utilization of the full range of
> >> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
> >> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
> >> (reg)) for the load/store (correct me if I'm wrong).
> >>
> >> Also, the new code slighty improves the reloading, for example currently
> >> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
> >> reload, which is certainly not good when we have ldrd/strd available.
> >>
> > 
> > Sorry, didn't mean to suggest that we don't want to do something better
> > for addresses that are a multiple of 4, just that for addresses that
> > aren't (at least) word-aligned that we should just bail as the code in
> > that case won't benefit from the optimization.  So something like
> > 
> >        if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
> > 	 {
> > 	    if (val & 3)
> > 		return false;  /* No point in trying to handle this.  */
> > 	    ... /* Cases that are useful to handle */
> 
> I've looked at the reload code surrounding the call to
> LEGITIMIZE_RELOAD_ADDRESS. It looks like for ARM, reload transforms the
> address from [reg+#const] to newreg=#const; [reg+newreg]. ARM/Thumb-2
> has 16-bits to move that constant, which is much more wider in range
> than a 12-bit constant operand + 8-bit index. So I agree that simply
> bailing out should be okay.
> 
> OTOH, I'll still add that, for some micro-architectures, register read
> ports may be a critical resource; for those cores, handling as many
> reloads here as possible by breaking into an address add is still
> slightly better than a 'move + [reg+reg]', for the latter load/store
> uses one more register read.  So maybe the best should be, to handle
> when the 'high' part is a valid add-immediate-operand, and bail out if
> not...
> 
> C.L.

If the address is unaligned, then the access is going to be slow anyway;
but this is all corner case stuff - the vast majority of accesses will
be at natural alignment.  I think it's better to seek clarity in these
cases than outright performance in theoretical micro-architectural
corner cases.

The largest number of read ports would be needed by store[reg+reg].
That's only 3 ports for a normal store (four for a pair of registers),
but cores can normally handle this without penalty by reading the
address registers in one cycle and the data to be stored in a later
cycle -- critical paths tend to be on address generation, not data to be
stored.

R.

> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-30  9:46             ` Richard Earnshaw
@ 2011-03-31  6:31               ` Chung-Lin Tang
  2011-03-31 10:16                 ` Richard Earnshaw
  0 siblings, 1 reply; 13+ messages in thread
From: Chung-Lin Tang @ 2011-03-31  6:31 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 8104 bytes --]

On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
> 
> On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
>> On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
>>>
>>> On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
>>>> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
>>>>> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
>>>>>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
>>>>>>>
>>>>>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
>>>>>>>> Hi,
>>>>>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
>>>>>>>> valid NEON modes. This turns the range of legitimate constant indexes to
>>>>>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
>>>>>>>> trying to decompose the [reg+index] reload address into
>>>>>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
>>>>>>>> part is not aligned to 4.
>>>>>>>>
>>>>>>>> I'm not sure why the current DImode index is computed as:
>>>>>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
>>>>>>>> values, then subtracting back, actually creates further off indexes.
>>>>>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
>>>>>>>>
>>>>>>>
>>>>>>> Hysterical Raisins... the code there was clearly written for the days
>>>>>>> before we had LDRD in the architecture.  At that time the most efficient
>>>>>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
>>>>>>> instructions.  The computation here was (I think), intended to try and
>>>>>>> make the most efficient use of an add/sub instruction followed by
>>>>>>> LDM/STM offsetting.  At that time the architecture had no unaligned
>>>>>>> access either, so dealing with 64-bit that were less than 32-bit aligned
>>>>>>> (in those days 32-bit was the maximum alignment) probably wasn't
>>>>>>> considered, or couldn't even get through to reload.
>>>>>>>
>>>>>>
>>>>>> I see it now. The code in output_move_double() returning assembly for
>>>>>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
>>>>>>
>>>>>> I have changed the patch to let the new code handle the TARGET_LDRD case
>>>>>> only.  The pre-LDRD case is still handled by the original code, with an
>>>>>> additional & ~0x3 for aligning the offset to 4.
>>>>>>
>>>>>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
>>>>>> the description is accurate enough.
>>>>>>
>>>>>>>> My patch changes the index decomposing to a more straightforward way; it
>>>>>>>> also sort of outlines the way the other reload address indexes are
>>>>>>>> broken by using and-masks, is not the most effective.  The address is
>>>>>>>> computed by addition, subtracting away the parts to obtain low+high
>>>>>>>> should be the optimal way of giving the largest computable index range.
>>>>>>>>
>>>>>>>> I have included a few Thumb-2 bits in the patch; I know currently
>>>>>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
>>>>>>>> guess it might eventually be turned into TARGET_32BIT.
>>>>>>>>
>>>>>>>
>>>>>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
>>>>>>> that it doesn't cause regressions there when we don't have LDRD in the
>>>>>>> instruction set.
>>>>>>
>>>>>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
>>>>>> Okay for trunk if no regressions?
>>>>>>
>>>>>> Thanks,
>>>>>> Chung-Lin
>>>>>>
>>>>>> 	PR target/48250
>>>>>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
>>>>>> 	DImode constant index decomposing under TARGET_LDRD. Clear
>>>>>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
>>>>>> 	comment for !TARGET_LDRD case.
>>>>>
>>>>> This looks technically correct, but I can't help feeling that trying to
>>>>> deal with the bottom two bits being non-zero is not really worthwhile.
>>>>> This hook is an optimization that's intended to generate better code
>>>>> than the default mechanisms that reload provides.  It is allowed to
>>>>> fail, but it must say so if it does (by returning false).
>>>>>
>>>>> What this hook is trying to do for DImode is to take an address of the
>>>>> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
>>>>> legitimate:
>>>>>
>>>>> 	tmp = (REG + LEGAL_BIG_CONST)
>>>>> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
>>>>>
>>>>> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
>>>>> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
>>>>> impossible in ARM state) that pushing the bottom two bits of the address
>>>>> into LEGAL_BIG_CONST part of the offset is going to lead to a better
>>>>> code sequence.  
>>>>>
>>>>> So I think it would be more sensible to just return false if we have a
>>>>> DImode address with the bottom 2 bits non-zero and then let the normal
>>>>> reload recovery mechanism take over.
>>>>
>>>> It is supposed to provide better utilization of the full range of
>>>> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
>>>> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
>>>> (reg)) for the load/store (correct me if I'm wrong).
>>>>
>>>> Also, the new code slighty improves the reloading, for example currently
>>>> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
>>>> reload, which is certainly not good when we have ldrd/strd available.
>>>>
>>>
>>> Sorry, didn't mean to suggest that we don't want to do something better
>>> for addresses that are a multiple of 4, just that for addresses that
>>> aren't (at least) word-aligned that we should just bail as the code in
>>> that case won't benefit from the optimization.  So something like
>>>
>>>        if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
>>> 	 {
>>> 	    if (val & 3)
>>> 		return false;  /* No point in trying to handle this.  */
>>> 	    ... /* Cases that are useful to handle */
>>
>> I've looked at the reload code surrounding the call to
>> LEGITIMIZE_RELOAD_ADDRESS. It looks like for ARM, reload transforms the
>> address from [reg+#const] to newreg=#const; [reg+newreg]. ARM/Thumb-2
>> has 16-bits to move that constant, which is much more wider in range
>> than a 12-bit constant operand + 8-bit index. So I agree that simply
>> bailing out should be okay.
>>
>> OTOH, I'll still add that, for some micro-architectures, register read
>> ports may be a critical resource; for those cores, handling as many
>> reloads here as possible by breaking into an address add is still
>> slightly better than a 'move + [reg+reg]', for the latter load/store
>> uses one more register read.  So maybe the best should be, to handle
>> when the 'high' part is a valid add-immediate-operand, and bail out if
>> not...
>>
>> C.L.
> 
> If the address is unaligned, then the access is going to be slow anyway;
> but this is all corner case stuff - the vast majority of accesses will
> be at natural alignment.  I think it's better to seek clarity in these
> cases than outright performance in theoretical micro-architectural
> corner cases.
> 
> The largest number of read ports would be needed by store[reg+reg].
> That's only 3 ports for a normal store (four for a pair of registers),
> but cores can normally handle this without penalty by reading the
> address registers in one cycle and the data to be stored in a later
> cycle -- critical paths tend to be on address generation, not data to be
> stored.

Actually, I was thinking of cores with dual-issue, where an additional
port read may prevent it from happening...

Anyways, here's a new patch. I've removed the unaligned handling bits as
you suggested, simply returning false for those cases.

The points above did inspire another improvement, I think. I have added
a test to also return false when the high part is not a valid immediate
operand.  The rationale is, after such a reg=reg+high address compute is
created, it will still have to be split into multiple adds later, so it
may be better to let reload turn it into the [reg+reg] form.

I'll be testing this patch later, here it is for reviewing.

Thanks,
Chung-Lin

[-- Attachment #2: pr48250-3.diff --]
[-- Type: text/plain, Size: 1737 bytes --]

Index: config/arm/arm.c
===================================================================
--- config/arm/arm.c	(revision 171716)
+++ config/arm/arm.c	(working copy)
@@ -6420,7 +6420,30 @@
       HOST_WIDE_INT low, high;
 
       if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
-	low = ((val & 0xf) ^ 0x8) - 0x8;
+	{
+	  /* We handle the aligned to 4 case only.  */
+	  if ((val & 0x3) != 0)
+	    return false;
+
+	  if (TARGET_LDRD)
+	    {
+	      /* ??? There may be more adjustments later for Thumb-2,
+		 which has a ldrd insn with +-1020 index range.  */
+	      int max_idx = 255;
+
+	      /* low == val, if val is within range [-max_idx, +max_idx].
+		 If not, val is set to the boundary +-max_idx.  */
+	      low = (-max_idx <= val && val <= max_idx
+		     ? val : (val > 0 ? max_idx : -max_idx));
+	    }
+	  else
+	    {
+	      /* For pre-ARMv5TE (without ldrd), we use ldm/stm(db/da/ib)
+		 to access doublewords. The supported load/store offsets are
+		 -8, -4, and 4, which we try to produce here.  */
+	      low = ((val & 0xf) ^ 0x8) - 0x8;
+	    }
+	}
       else if (TARGET_MAVERICK && TARGET_HARD_FLOAT)
 	/* Need to be careful, -256 is not a valid offset.  */
 	low = val >= 0 ? (val & 0xff) : -((-val) & 0xff);
@@ -6446,6 +6469,12 @@
       if (low == 0 || high == 0 || (high + low != val))
 	return false;
 
+      /* If the high part is not a valid immediate operand,
+	 creating the (reg+high) address reload here will result
+	 in more splitted insns after reload, so bail out.  */
+      if (!const_ok_for_arm (high) && !const_ok_for_arm (-high))
+	return false;
+
       /* Reload the high part into a base reg; leave the low part
 	 in the mem.  */
       *p = gen_rtx_PLUS (GET_MODE (*p),

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-31  6:31               ` Chung-Lin Tang
@ 2011-03-31 10:16                 ` Richard Earnshaw
  2011-03-31 14:36                   ` Chung-Lin Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Earnshaw @ 2011-03-31 10:16 UTC (permalink / raw)
  To: Chung-Lin Tang; +Cc: gcc-patches, Ramana Radhakrishnan


On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
> On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
> > 
> > On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
> >> On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
> >>>
> >>> On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
> >>>> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
> >>>>> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
> >>>>>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
> >>>>>>>
> >>>>>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
> >>>>>>>> Hi,
> >>>>>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
> >>>>>>>> valid NEON modes. This turns the range of legitimate constant indexes to
> >>>>>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
> >>>>>>>> trying to decompose the [reg+index] reload address into
> >>>>>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
> >>>>>>>> part is not aligned to 4.
> >>>>>>>>
> >>>>>>>> I'm not sure why the current DImode index is computed as:
> >>>>>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
> >>>>>>>> values, then subtracting back, actually creates further off indexes.
> >>>>>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hysterical Raisins... the code there was clearly written for the days
> >>>>>>> before we had LDRD in the architecture.  At that time the most efficient
> >>>>>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
> >>>>>>> instructions.  The computation here was (I think), intended to try and
> >>>>>>> make the most efficient use of an add/sub instruction followed by
> >>>>>>> LDM/STM offsetting.  At that time the architecture had no unaligned
> >>>>>>> access either, so dealing with 64-bit that were less than 32-bit aligned
> >>>>>>> (in those days 32-bit was the maximum alignment) probably wasn't
> >>>>>>> considered, or couldn't even get through to reload.
> >>>>>>>
> >>>>>>
> >>>>>> I see it now. The code in output_move_double() returning assembly for
> >>>>>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
> >>>>>>
> >>>>>> I have changed the patch to let the new code handle the TARGET_LDRD case
> >>>>>> only.  The pre-LDRD case is still handled by the original code, with an
> >>>>>> additional & ~0x3 for aligning the offset to 4.
> >>>>>>
> >>>>>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
> >>>>>> the description is accurate enough.
> >>>>>>
> >>>>>>>> My patch changes the index decomposing to a more straightforward way; it
> >>>>>>>> also sort of outlines the way the other reload address indexes are
> >>>>>>>> broken by using and-masks, is not the most effective.  The address is
> >>>>>>>> computed by addition, subtracting away the parts to obtain low+high
> >>>>>>>> should be the optimal way of giving the largest computable index range.
> >>>>>>>>
> >>>>>>>> I have included a few Thumb-2 bits in the patch; I know currently
> >>>>>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
> >>>>>>>> guess it might eventually be turned into TARGET_32BIT.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
> >>>>>>> that it doesn't cause regressions there when we don't have LDRD in the
> >>>>>>> instruction set.
> >>>>>>
> >>>>>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
> >>>>>> Okay for trunk if no regressions?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Chung-Lin
> >>>>>>
> >>>>>> 	PR target/48250
> >>>>>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
> >>>>>> 	DImode constant index decomposing under TARGET_LDRD. Clear
> >>>>>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
> >>>>>> 	comment for !TARGET_LDRD case.
> >>>>>
> >>>>> This looks technically correct, but I can't help feeling that trying to
> >>>>> deal with the bottom two bits being non-zero is not really worthwhile.
> >>>>> This hook is an optimization that's intended to generate better code
> >>>>> than the default mechanisms that reload provides.  It is allowed to
> >>>>> fail, but it must say so if it does (by returning false).
> >>>>>
> >>>>> What this hook is trying to do for DImode is to take an address of the
> >>>>> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
> >>>>> legitimate:
> >>>>>
> >>>>> 	tmp = (REG + LEGAL_BIG_CONST)
> >>>>> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
> >>>>>
> >>>>> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
> >>>>> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
> >>>>> impossible in ARM state) that pushing the bottom two bits of the address
> >>>>> into LEGAL_BIG_CONST part of the offset is going to lead to a better
> >>>>> code sequence.  
> >>>>>
> >>>>> So I think it would be more sensible to just return false if we have a
> >>>>> DImode address with the bottom 2 bits non-zero and then let the normal
> >>>>> reload recovery mechanism take over.
> >>>>
> >>>> It is supposed to provide better utilization of the full range of
> >>>> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
> >>>> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
> >>>> (reg)) for the load/store (correct me if I'm wrong).
> >>>>
> >>>> Also, the new code slighty improves the reloading, for example currently
> >>>> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
> >>>> reload, which is certainly not good when we have ldrd/strd available.
> >>>>
> >>>
> >>> Sorry, didn't mean to suggest that we don't want to do something better
> >>> for addresses that are a multiple of 4, just that for addresses that
> >>> aren't (at least) word-aligned that we should just bail as the code in
> >>> that case won't benefit from the optimization.  So something like
> >>>
> >>>        if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
> >>> 	 {
> >>> 	    if (val & 3)
> >>> 		return false;  /* No point in trying to handle this.  */
> >>> 	    ... /* Cases that are useful to handle */
> >>
> >> I've looked at the reload code surrounding the call to
> >> LEGITIMIZE_RELOAD_ADDRESS. It looks like for ARM, reload transforms the
> >> address from [reg+#const] to newreg=#const; [reg+newreg]. ARM/Thumb-2
> >> has 16-bits to move that constant, which is much more wider in range
> >> than a 12-bit constant operand + 8-bit index. So I agree that simply
> >> bailing out should be okay.
> >>
> >> OTOH, I'll still add that, for some micro-architectures, register read
> >> ports may be a critical resource; for those cores, handling as many
> >> reloads here as possible by breaking into an address add is still
> >> slightly better than a 'move + [reg+reg]', for the latter load/store
> >> uses one more register read.  So maybe the best should be, to handle
> >> when the 'high' part is a valid add-immediate-operand, and bail out if
> >> not...
> >>
> >> C.L.
> > 
> > If the address is unaligned, then the access is going to be slow anyway;
> > but this is all corner case stuff - the vast majority of accesses will
> > be at natural alignment.  I think it's better to seek clarity in these
> > cases than outright performance in theoretical micro-architectural
> > corner cases.
> > 
> > The largest number of read ports would be needed by store[reg+reg].
> > That's only 3 ports for a normal store (four for a pair of registers),
> > but cores can normally handle this without penalty by reading the
> > address registers in one cycle and the data to be stored in a later
> > cycle -- critical paths tend to be on address generation, not data to be
> > stored.
> 
> Actually, I was thinking of cores with dual-issue, where an additional
> port read may prevent it from happening...
> 
> Anyways, here's a new patch. I've removed the unaligned handling bits as
> you suggested, simply returning false for those cases.
> 
> The points above did inspire another improvement, I think. I have added
> a test to also return false when the high part is not a valid immediate
> operand.  The rationale is, after such a reg=reg+high address compute is
> created, it will still have to be split into multiple adds later, so it
> may be better to let reload turn it into the [reg+reg] form.
> 

Hmm, I think you've missed the point with some of this, which is that
not only is it generally more efficient to try and use offset addressing
but careful selection of the immediate values used in the load and the
ADD insns can often also lead to better reload CSE.  For example:

	ldr	r0, [r2, #4100]  // Offset too large
	ldr	r1, [r2, #4104]  // Offset too large

is best reloaded as
	add	t1, r2, #4096
	ldr	r0, [t1, #4]
	add	t2, r2, #4096
	ldr	r1, [t2, #8]

which of course post-reload CSE can simplify in most cases to eliminate
the second add instruction:

	add	t1, r2, #4096
	ldr	r0, [t1, #4]
	ldr	r1, [t1, #8]

This is true even if the amount of the offset being split out is larger
than a simple legitimate_constant.

The idea here is that we want to split out the bits of the constant as a
mask rather than as subtracting the maximum offset that ldr can handle
(the same principle applies to LDRD too).

A further trick is that we can make use of negative offsets even if the
overall offset is positive and that sometimes this may lead to an
immediate that can be constructed with one fewer add instructions  For
example,

	ldr	r0, [r2, #0x3FFFFC]

This is best reloaded as:

	add	t1, r2, #0x400000
	ldr	r0, [t1, #-4]

The trick for spotting this for a load instruction with N bits of offset
(ie bits N-1:0) is to look at bit N: if it is set, then chose a negative
offset that is going to make bit N and all the bits below it come to
zero in the remainder part.

The final thing to note is that offsets for negative values in Thumb2
are asymmetrical from the positive values that are available.  That
makes selecting the best offset more complicated, and thus using
negative values is less likely to be worth while.

> I'll be testing this patch later, here it is for reviewing.
> 
> Thanks,
> Chung-Lin


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-31 10:16                 ` Richard Earnshaw
@ 2011-03-31 14:36                   ` Chung-Lin Tang
  2011-03-31 18:25                     ` Richard Earnshaw
  0 siblings, 1 reply; 13+ messages in thread
From: Chung-Lin Tang @ 2011-03-31 14:36 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 10536 bytes --]

On 2011/3/31 06:14 PM, Richard Earnshaw wrote:
> 
> On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
>> On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
>>>
>>> On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
>>>> On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
>>>>>
>>>>> On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
>>>>>> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
>>>>>>> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
>>>>>>>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
>>>>>>>>>
>>>>>>>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
>>>>>>>>>> valid NEON modes. This turns the range of legitimate constant indexes to
>>>>>>>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
>>>>>>>>>> trying to decompose the [reg+index] reload address into
>>>>>>>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
>>>>>>>>>> part is not aligned to 4.
>>>>>>>>>>
>>>>>>>>>> I'm not sure why the current DImode index is computed as:
>>>>>>>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
>>>>>>>>>> values, then subtracting back, actually creates further off indexes.
>>>>>>>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hysterical Raisins... the code there was clearly written for the days
>>>>>>>>> before we had LDRD in the architecture.  At that time the most efficient
>>>>>>>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
>>>>>>>>> instructions.  The computation here was (I think), intended to try and
>>>>>>>>> make the most efficient use of an add/sub instruction followed by
>>>>>>>>> LDM/STM offsetting.  At that time the architecture had no unaligned
>>>>>>>>> access either, so dealing with 64-bit that were less than 32-bit aligned
>>>>>>>>> (in those days 32-bit was the maximum alignment) probably wasn't
>>>>>>>>> considered, or couldn't even get through to reload.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I see it now. The code in output_move_double() returning assembly for
>>>>>>>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
>>>>>>>>
>>>>>>>> I have changed the patch to let the new code handle the TARGET_LDRD case
>>>>>>>> only.  The pre-LDRD case is still handled by the original code, with an
>>>>>>>> additional & ~0x3 for aligning the offset to 4.
>>>>>>>>
>>>>>>>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
>>>>>>>> the description is accurate enough.
>>>>>>>>
>>>>>>>>>> My patch changes the index decomposing to a more straightforward way; it
>>>>>>>>>> also sort of outlines the way the other reload address indexes are
>>>>>>>>>> broken by using and-masks, is not the most effective.  The address is
>>>>>>>>>> computed by addition, subtracting away the parts to obtain low+high
>>>>>>>>>> should be the optimal way of giving the largest computable index range.
>>>>>>>>>>
>>>>>>>>>> I have included a few Thumb-2 bits in the patch; I know currently
>>>>>>>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
>>>>>>>>>> guess it might eventually be turned into TARGET_32BIT.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
>>>>>>>>> that it doesn't cause regressions there when we don't have LDRD in the
>>>>>>>>> instruction set.
>>>>>>>>
>>>>>>>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
>>>>>>>> Okay for trunk if no regressions?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Chung-Lin
>>>>>>>>
>>>>>>>> 	PR target/48250
>>>>>>>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
>>>>>>>> 	DImode constant index decomposing under TARGET_LDRD. Clear
>>>>>>>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
>>>>>>>> 	comment for !TARGET_LDRD case.
>>>>>>>
>>>>>>> This looks technically correct, but I can't help feeling that trying to
>>>>>>> deal with the bottom two bits being non-zero is not really worthwhile.
>>>>>>> This hook is an optimization that's intended to generate better code
>>>>>>> than the default mechanisms that reload provides.  It is allowed to
>>>>>>> fail, but it must say so if it does (by returning false).
>>>>>>>
>>>>>>> What this hook is trying to do for DImode is to take an address of the
>>>>>>> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
>>>>>>> legitimate:
>>>>>>>
>>>>>>> 	tmp = (REG + LEGAL_BIG_CONST)
>>>>>>> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
>>>>>>>
>>>>>>> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
>>>>>>> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
>>>>>>> impossible in ARM state) that pushing the bottom two bits of the address
>>>>>>> into LEGAL_BIG_CONST part of the offset is going to lead to a better
>>>>>>> code sequence.  
>>>>>>>
>>>>>>> So I think it would be more sensible to just return false if we have a
>>>>>>> DImode address with the bottom 2 bits non-zero and then let the normal
>>>>>>> reload recovery mechanism take over.
>>>>>>
>>>>>> It is supposed to provide better utilization of the full range of
>>>>>> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
>>>>>> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
>>>>>> (reg)) for the load/store (correct me if I'm wrong).
>>>>>>
>>>>>> Also, the new code slighty improves the reloading, for example currently
>>>>>> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
>>>>>> reload, which is certainly not good when we have ldrd/strd available.
>>>>>>
>>>>>
>>>>> Sorry, didn't mean to suggest that we don't want to do something better
>>>>> for addresses that are a multiple of 4, just that for addresses that
>>>>> aren't (at least) word-aligned that we should just bail as the code in
>>>>> that case won't benefit from the optimization.  So something like
>>>>>
>>>>>        if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
>>>>> 	 {
>>>>> 	    if (val & 3)
>>>>> 		return false;  /* No point in trying to handle this.  */
>>>>> 	    ... /* Cases that are useful to handle */
>>>>
>>>> I've looked at the reload code surrounding the call to
>>>> LEGITIMIZE_RELOAD_ADDRESS. It looks like for ARM, reload transforms the
>>>> address from [reg+#const] to newreg=#const; [reg+newreg]. ARM/Thumb-2
>>>> has 16-bits to move that constant, which is much more wider in range
>>>> than a 12-bit constant operand + 8-bit index. So I agree that simply
>>>> bailing out should be okay.
>>>>
>>>> OTOH, I'll still add that, for some micro-architectures, register read
>>>> ports may be a critical resource; for those cores, handling as many
>>>> reloads here as possible by breaking into an address add is still
>>>> slightly better than a 'move + [reg+reg]', for the latter load/store
>>>> uses one more register read.  So maybe the best should be, to handle
>>>> when the 'high' part is a valid add-immediate-operand, and bail out if
>>>> not...
>>>>
>>>> C.L.
>>>
>>> If the address is unaligned, then the access is going to be slow anyway;
>>> but this is all corner case stuff - the vast majority of accesses will
>>> be at natural alignment.  I think it's better to seek clarity in these
>>> cases than outright performance in theoretical micro-architectural
>>> corner cases.
>>>
>>> The largest number of read ports would be needed by store[reg+reg].
>>> That's only 3 ports for a normal store (four for a pair of registers),
>>> but cores can normally handle this without penalty by reading the
>>> address registers in one cycle and the data to be stored in a later
>>> cycle -- critical paths tend to be on address generation, not data to be
>>> stored.
>>
>> Actually, I was thinking of cores with dual-issue, where an additional
>> port read may prevent it from happening...
>>
>> Anyways, here's a new patch. I've removed the unaligned handling bits as
>> you suggested, simply returning false for those cases.
>>
>> The points above did inspire another improvement, I think. I have added
>> a test to also return false when the high part is not a valid immediate
>> operand.  The rationale is, after such a reg=reg+high address compute is
>> created, it will still have to be split into multiple adds later, so it
>> may be better to let reload turn it into the [reg+reg] form.
>>
> 
> Hmm, I think you've missed the point with some of this, which is that
> not only is it generally more efficient to try and use offset addressing
> but careful selection of the immediate values used in the load and the
> ADD insns can often also lead to better reload CSE.  For example:
> 
> 	ldr	r0, [r2, #4100]  // Offset too large
> 	ldr	r1, [r2, #4104]  // Offset too large
> 
> is best reloaded as
> 	add	t1, r2, #4096
> 	ldr	r0, [t1, #4]
> 	add	t2, r2, #4096
> 	ldr	r1, [t2, #8]
> 
> which of course post-reload CSE can simplify in most cases to eliminate
> the second add instruction:
> 
> 	add	t1, r2, #4096
> 	ldr	r0, [t1, #4]
> 	ldr	r1, [t1, #8]
> 
> This is true even if the amount of the offset being split out is larger
> than a simple legitimate_constant.
> 
> The idea here is that we want to split out the bits of the constant as a
> mask rather than as subtracting the maximum offset that ldr can handle
> (the same principle applies to LDRD too).
> 
> A further trick is that we can make use of negative offsets even if the
> overall offset is positive and that sometimes this may lead to an
> immediate that can be constructed with one fewer add instructions  For
> example,
> 
> 	ldr	r0, [r2, #0x3FFFFC]
> 
> This is best reloaded as:
> 
> 	add	t1, r2, #0x400000
> 	ldr	r0, [t1, #-4]
> 
> The trick for spotting this for a load instruction with N bits of offset
> (ie bits N-1:0) is to look at bit N: if it is set, then chose a negative
> offset that is going to make bit N and all the bits below it come to
> zero in the remainder part.
> 
> The final thing to note is that offsets for negative values in Thumb2
> are asymmetrical from the positive values that are available.  That
> makes selecting the best offset more complicated, and thus using
> negative values is less likely to be worth while.

Richard, thanks for the detailed explanation, you should really turn
this into a comment in arm.c some time.

I have revised the patch again, this time mostly in an analogous form
following the other cases, as you explained above.

Thanks,
Chung-Lin

[-- Attachment #2: pr48250-4.diff --]
[-- Type: text/plain, Size: 1035 bytes --]

Index: config/arm/arm.c
===================================================================
--- config/arm/arm.c	(revision 171716)
+++ config/arm/arm.c	(working copy)
@@ -6420,7 +6420,21 @@
       HOST_WIDE_INT low, high;
 
       if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
-	low = ((val & 0xf) ^ 0x8) - 0x8;
+	{
+	  /* We handle the aligned to 4 case only.  */
+	  if ((val & 0x3) != 0)
+	    return false;
+
+	  if (TARGET_LDRD)
+	    /* ??? There may be more adjustments later for Thumb-2,
+	       which has a ldrd insn with +-1020 index range.  */
+	    low = val >= 0 ? (val & 0xff) : -((-val) & 0xff);
+	  else
+	    /* For pre-ARMv5TE (without ldrd), we use ldm/stm(db/da/ib)
+	       to access doublewords. The supported load/store offsets are
+	       -8, -4, and 4, which we try to produce here.  */
+	    low = ((val & 0xf) ^ 0x8) - 0x8;
+	}
       else if (TARGET_MAVERICK && TARGET_HARD_FLOAT)
 	/* Need to be careful, -256 is not a valid offset.  */
 	low = val >= 0 ? (val & 0xff) : -((-val) & 0xff);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-31 14:36                   ` Chung-Lin Tang
@ 2011-03-31 18:25                     ` Richard Earnshaw
  2011-03-31 19:47                       ` Chung-Lin Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Earnshaw @ 2011-03-31 18:25 UTC (permalink / raw)
  To: Chung-Lin Tang; +Cc: gcc-patches, Ramana Radhakrishnan


On Thu, 2011-03-31 at 22:18 +0800, Chung-Lin Tang wrote:
> On 2011/3/31 06:14 PM, Richard Earnshaw wrote:
> > 
> > On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
> >> On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
> >>>
> >>> On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
> >>>> On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
> >>>>>
> >>>>> On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
> >>>>>> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
> >>>>>>> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
> >>>>>>>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
> >>>>>>>>>
> >>>>>>>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
> >>>>>>>>>> valid NEON modes. This turns the range of legitimate constant indexes to
> >>>>>>>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
> >>>>>>>>>> trying to decompose the [reg+index] reload address into
> >>>>>>>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
> >>>>>>>>>> part is not aligned to 4.
> >>>>>>>>>>
> >>>>>>>>>> I'm not sure why the current DImode index is computed as:
> >>>>>>>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
> >>>>>>>>>> values, then subtracting back, actually creates further off indexes.
> >>>>>>>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hysterical Raisins... the code there was clearly written for the days
> >>>>>>>>> before we had LDRD in the architecture.  At that time the most efficient
> >>>>>>>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
> >>>>>>>>> instructions.  The computation here was (I think), intended to try and
> >>>>>>>>> make the most efficient use of an add/sub instruction followed by
> >>>>>>>>> LDM/STM offsetting.  At that time the architecture had no unaligned
> >>>>>>>>> access either, so dealing with 64-bit that were less than 32-bit aligned
> >>>>>>>>> (in those days 32-bit was the maximum alignment) probably wasn't
> >>>>>>>>> considered, or couldn't even get through to reload.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I see it now. The code in output_move_double() returning assembly for
> >>>>>>>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
> >>>>>>>>
> >>>>>>>> I have changed the patch to let the new code handle the TARGET_LDRD case
> >>>>>>>> only.  The pre-LDRD case is still handled by the original code, with an
> >>>>>>>> additional & ~0x3 for aligning the offset to 4.
> >>>>>>>>
> >>>>>>>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
> >>>>>>>> the description is accurate enough.
> >>>>>>>>
> >>>>>>>>>> My patch changes the index decomposing to a more straightforward way; it
> >>>>>>>>>> also sort of outlines the way the other reload address indexes are
> >>>>>>>>>> broken by using and-masks, is not the most effective.  The address is
> >>>>>>>>>> computed by addition, subtracting away the parts to obtain low+high
> >>>>>>>>>> should be the optimal way of giving the largest computable index range.
> >>>>>>>>>>
> >>>>>>>>>> I have included a few Thumb-2 bits in the patch; I know currently
> >>>>>>>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
> >>>>>>>>>> guess it might eventually be turned into TARGET_32BIT.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
> >>>>>>>>> that it doesn't cause regressions there when we don't have LDRD in the
> >>>>>>>>> instruction set.
> >>>>>>>>
> >>>>>>>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
> >>>>>>>> Okay for trunk if no regressions?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Chung-Lin
> >>>>>>>>
> >>>>>>>> 	PR target/48250
> >>>>>>>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
> >>>>>>>> 	DImode constant index decomposing under TARGET_LDRD. Clear
> >>>>>>>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
> >>>>>>>> 	comment for !TARGET_LDRD case.
> >>>>>>>
> >>>>>>> This looks technically correct, but I can't help feeling that trying to
> >>>>>>> deal with the bottom two bits being non-zero is not really worthwhile.
> >>>>>>> This hook is an optimization that's intended to generate better code
> >>>>>>> than the default mechanisms that reload provides.  It is allowed to
> >>>>>>> fail, but it must say so if it does (by returning false).
> >>>>>>>
> >>>>>>> What this hook is trying to do for DImode is to take an address of the
> >>>>>>> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
> >>>>>>> legitimate:
> >>>>>>>
> >>>>>>> 	tmp = (REG + LEGAL_BIG_CONST)
> >>>>>>> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
> >>>>>>>
> >>>>>>> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
> >>>>>>> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
> >>>>>>> impossible in ARM state) that pushing the bottom two bits of the address
> >>>>>>> into LEGAL_BIG_CONST part of the offset is going to lead to a better
> >>>>>>> code sequence.  
> >>>>>>>
> >>>>>>> So I think it would be more sensible to just return false if we have a
> >>>>>>> DImode address with the bottom 2 bits non-zero and then let the normal
> >>>>>>> reload recovery mechanism take over.
> >>>>>>
> >>>>>> It is supposed to provide better utilization of the full range of
> >>>>>> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
> >>>>>> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
> >>>>>> (reg)) for the load/store (correct me if I'm wrong).
> >>>>>>
> >>>>>> Also, the new code slighty improves the reloading, for example currently
> >>>>>> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
> >>>>>> reload, which is certainly not good when we have ldrd/strd available.
> >>>>>>
> >>>>>
> >>>>> Sorry, didn't mean to suggest that we don't want to do something better
> >>>>> for addresses that are a multiple of 4, just that for addresses that
> >>>>> aren't (at least) word-aligned that we should just bail as the code in
> >>>>> that case won't benefit from the optimization.  So something like
> >>>>>
> >>>>>        if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
> >>>>> 	 {
> >>>>> 	    if (val & 3)
> >>>>> 		return false;  /* No point in trying to handle this.  */
> >>>>> 	    ... /* Cases that are useful to handle */
> >>>>
> >>>> I've looked at the reload code surrounding the call to
> >>>> LEGITIMIZE_RELOAD_ADDRESS. It looks like for ARM, reload transforms the
> >>>> address from [reg+#const] to newreg=#const; [reg+newreg]. ARM/Thumb-2
> >>>> has 16-bits to move that constant, which is much more wider in range
> >>>> than a 12-bit constant operand + 8-bit index. So I agree that simply
> >>>> bailing out should be okay.
> >>>>
> >>>> OTOH, I'll still add that, for some micro-architectures, register read
> >>>> ports may be a critical resource; for those cores, handling as many
> >>>> reloads here as possible by breaking into an address add is still
> >>>> slightly better than a 'move + [reg+reg]', for the latter load/store
> >>>> uses one more register read.  So maybe the best should be, to handle
> >>>> when the 'high' part is a valid add-immediate-operand, and bail out if
> >>>> not...
> >>>>
> >>>> C.L.
> >>>
> >>> If the address is unaligned, then the access is going to be slow anyway;
> >>> but this is all corner case stuff - the vast majority of accesses will
> >>> be at natural alignment.  I think it's better to seek clarity in these
> >>> cases than outright performance in theoretical micro-architectural
> >>> corner cases.
> >>>
> >>> The largest number of read ports would be needed by store[reg+reg].
> >>> That's only 3 ports for a normal store (four for a pair of registers),
> >>> but cores can normally handle this without penalty by reading the
> >>> address registers in one cycle and the data to be stored in a later
> >>> cycle -- critical paths tend to be on address generation, not data to be
> >>> stored.
> >>
> >> Actually, I was thinking of cores with dual-issue, where an additional
> >> port read may prevent it from happening...
> >>
> >> Anyways, here's a new patch. I've removed the unaligned handling bits as
> >> you suggested, simply returning false for those cases.
> >>
> >> The points above did inspire another improvement, I think. I have added
> >> a test to also return false when the high part is not a valid immediate
> >> operand.  The rationale is, after such a reg=reg+high address compute is
> >> created, it will still have to be split into multiple adds later, so it
> >> may be better to let reload turn it into the [reg+reg] form.
> >>
> > 
> > Hmm, I think you've missed the point with some of this, which is that
> > not only is it generally more efficient to try and use offset addressing
> > but careful selection of the immediate values used in the load and the
> > ADD insns can often also lead to better reload CSE.  For example:
> > 
> > 	ldr	r0, [r2, #4100]  // Offset too large
> > 	ldr	r1, [r2, #4104]  // Offset too large
> > 
> > is best reloaded as
> > 	add	t1, r2, #4096
> > 	ldr	r0, [t1, #4]
> > 	add	t2, r2, #4096
> > 	ldr	r1, [t2, #8]
> > 
> > which of course post-reload CSE can simplify in most cases to eliminate
> > the second add instruction:
> > 
> > 	add	t1, r2, #4096
> > 	ldr	r0, [t1, #4]
> > 	ldr	r1, [t1, #8]
> > 
> > This is true even if the amount of the offset being split out is larger
> > than a simple legitimate_constant.
> > 
> > The idea here is that we want to split out the bits of the constant as a
> > mask rather than as subtracting the maximum offset that ldr can handle
> > (the same principle applies to LDRD too).
> > 
> > A further trick is that we can make use of negative offsets even if the
> > overall offset is positive and that sometimes this may lead to an
> > immediate that can be constructed with one fewer add instructions  For
> > example,
> > 
> > 	ldr	r0, [r2, #0x3FFFFC]
> > 
> > This is best reloaded as:
> > 
> > 	add	t1, r2, #0x400000
> > 	ldr	r0, [t1, #-4]
> > 
> > The trick for spotting this for a load instruction with N bits of offset
> > (ie bits N-1:0) is to look at bit N: if it is set, then chose a negative
> > offset that is going to make bit N and all the bits below it come to
> > zero in the remainder part.
> > 
> > The final thing to note is that offsets for negative values in Thumb2
> > are asymmetrical from the positive values that are available.  That
> > makes selecting the best offset more complicated, and thus using
> > negative values is less likely to be worth while.
> 
> Richard, thanks for the detailed explanation, you should really turn
> this into a comment in arm.c some time.
> 

Feel free to put it in...

> I have revised the patch again, this time mostly in an analogous form
> following the other cases, as you explained above.
> 
> Thanks,
> Chung-Lin

Except that the existing ARM cases aren't quite doing the best they can
(in fact, the only case that does fully exploit the negative offsetting
cases is the pre-v5 DImode, but that uses 2's complement addressing so
it's fairly easy :-)

The correct calculation for low for the TARGET_LDRD case should be

#define SIGN_MAG_LOW_ADDR_BITS(VAL, N) \
  (((VAL) & ((1 << (N)) - 1)) \
   ? (((VAL) & ((1 << ((N) + 1)) - 1)) ^ (1 << (N))) - (1 << (N)) : 0)

Now you can simply write

	low = SIGN_MAG_LOW_ADDR_BITS (val, 8);


Really the other cases that use sign-magnitude addressing should be
updated similarly.

R.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
  2011-03-31 18:25                     ` Richard Earnshaw
@ 2011-03-31 19:47                       ` Chung-Lin Tang
  0 siblings, 0 replies; 13+ messages in thread
From: Chung-Lin Tang @ 2011-03-31 19:47 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc-patches, Ramana Radhakrishnan

On 2011/4/1 01:33 AM, Richard Earnshaw wrote:
> 
> On Thu, 2011-03-31 at 22:18 +0800, Chung-Lin Tang wrote:
>> On 2011/3/31 06:14 PM, Richard Earnshaw wrote:
>>>
>>> On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
>>>> On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
>>>>>
>>>>> On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
>>>>>> On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
>>>>>>>
>>>>>>> On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
>>>>>>>> On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
>>>>>>>>> On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
>>>>>>>>>> On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> PR48250 happens under TARGET_NEON, where DImode is included within the
>>>>>>>>>>>> valid NEON modes. This turns the range of legitimate constant indexes to
>>>>>>>>>>>> step-4 (coproc load/store), thus arm_legitimize_reload_address() when
>>>>>>>>>>>> trying to decompose the [reg+index] reload address into
>>>>>>>>>>>> [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
>>>>>>>>>>>> part is not aligned to 4.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure why the current DImode index is computed as:
>>>>>>>>>>>> low = ((val & 0xf) ^ 0x8) - 0x8;  the sign-extending into negative
>>>>>>>>>>>> values, then subtracting back, actually creates further off indexes.
>>>>>>>>>>>> e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hysterical Raisins... the code there was clearly written for the days
>>>>>>>>>>> before we had LDRD in the architecture.  At that time the most efficient
>>>>>>>>>>> way to load a 64-bit object was to use the LDM{ia,ib,da,db}
>>>>>>>>>>> instructions.  The computation here was (I think), intended to try and
>>>>>>>>>>> make the most efficient use of an add/sub instruction followed by
>>>>>>>>>>> LDM/STM offsetting.  At that time the architecture had no unaligned
>>>>>>>>>>> access either, so dealing with 64-bit that were less than 32-bit aligned
>>>>>>>>>>> (in those days 32-bit was the maximum alignment) probably wasn't
>>>>>>>>>>> considered, or couldn't even get through to reload.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I see it now. The code in output_move_double() returning assembly for
>>>>>>>>>> ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
>>>>>>>>>>
>>>>>>>>>> I have changed the patch to let the new code handle the TARGET_LDRD case
>>>>>>>>>> only.  The pre-LDRD case is still handled by the original code, with an
>>>>>>>>>> additional & ~0x3 for aligning the offset to 4.
>>>>>>>>>>
>>>>>>>>>> I've also added a comment for the pre-TARGET_LDRD case. Please see if
>>>>>>>>>> the description is accurate enough.
>>>>>>>>>>
>>>>>>>>>>>> My patch changes the index decomposing to a more straightforward way; it
>>>>>>>>>>>> also sort of outlines the way the other reload address indexes are
>>>>>>>>>>>> broken by using and-masks, is not the most effective.  The address is
>>>>>>>>>>>> computed by addition, subtracting away the parts to obtain low+high
>>>>>>>>>>>> should be the optimal way of giving the largest computable index range.
>>>>>>>>>>>>
>>>>>>>>>>>> I have included a few Thumb-2 bits in the patch; I know currently
>>>>>>>>>>>> arm_legitimize_reload_address() is only used under TARGET_ARM, but I
>>>>>>>>>>>> guess it might eventually be turned into TARGET_32BIT.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think this needs to be looked at carefully on ARMv4/ARMv4T to check
>>>>>>>>>>> that it doesn't cause regressions there when we don't have LDRD in the
>>>>>>>>>>> instruction set.
>>>>>>>>>>
>>>>>>>>>> I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
>>>>>>>>>> Okay for trunk if no regressions?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Chung-Lin
>>>>>>>>>>
>>>>>>>>>> 	PR target/48250
>>>>>>>>>> 	* config/arm/arm.c (arm_legitimize_reload_address): Adjust
>>>>>>>>>> 	DImode constant index decomposing under TARGET_LDRD. Clear
>>>>>>>>>> 	lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
>>>>>>>>>> 	comment for !TARGET_LDRD case.
>>>>>>>>>
>>>>>>>>> This looks technically correct, but I can't help feeling that trying to
>>>>>>>>> deal with the bottom two bits being non-zero is not really worthwhile.
>>>>>>>>> This hook is an optimization that's intended to generate better code
>>>>>>>>> than the default mechanisms that reload provides.  It is allowed to
>>>>>>>>> fail, but it must say so if it does (by returning false).
>>>>>>>>>
>>>>>>>>> What this hook is trying to do for DImode is to take an address of the
>>>>>>>>> form (reg + TOO_BIG_CONST) and turn it into two instructions that are
>>>>>>>>> legitimate:
>>>>>>>>>
>>>>>>>>> 	tmp = (REG + LEGAL_BIG_CONST)
>>>>>>>>> 	some_use_of (mem (tmp + SMALL_LEGAL_CONST))
>>>>>>>>>
>>>>>>>>> The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
>>>>>>>>> instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
>>>>>>>>> impossible in ARM state) that pushing the bottom two bits of the address
>>>>>>>>> into LEGAL_BIG_CONST part of the offset is going to lead to a better
>>>>>>>>> code sequence.  
>>>>>>>>>
>>>>>>>>> So I think it would be more sensible to just return false if we have a
>>>>>>>>> DImode address with the bottom 2 bits non-zero and then let the normal
>>>>>>>>> reload recovery mechanism take over.
>>>>>>>>
>>>>>>>> It is supposed to provide better utilization of the full range of
>>>>>>>> LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
>>>>>>>> will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
>>>>>>>> (reg)) for the load/store (correct me if I'm wrong).
>>>>>>>>
>>>>>>>> Also, the new code slighty improves the reloading, for example currently
>>>>>>>> [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
>>>>>>>> reload, which is certainly not good when we have ldrd/strd available.
>>>>>>>>
>>>>>>>
>>>>>>> Sorry, didn't mean to suggest that we don't want to do something better
>>>>>>> for addresses that are a multiple of 4, just that for addresses that
>>>>>>> aren't (at least) word-aligned that we should just bail as the code in
>>>>>>> that case won't benefit from the optimization.  So something like
>>>>>>>
>>>>>>>        if (mode == DImode || (mode == DFmode && TARGET_SOFT_FLOAT))
>>>>>>> 	 {
>>>>>>> 	    if (val & 3)
>>>>>>> 		return false;  /* No point in trying to handle this.  */
>>>>>>> 	    ... /* Cases that are useful to handle */
>>>>>>
>>>>>> I've looked at the reload code surrounding the call to
>>>>>> LEGITIMIZE_RELOAD_ADDRESS. It looks like for ARM, reload transforms the
>>>>>> address from [reg+#const] to newreg=#const; [reg+newreg]. ARM/Thumb-2
>>>>>> has 16-bits to move that constant, which is much more wider in range
>>>>>> than a 12-bit constant operand + 8-bit index. So I agree that simply
>>>>>> bailing out should be okay.
>>>>>>
>>>>>> OTOH, I'll still add that, for some micro-architectures, register read
>>>>>> ports may be a critical resource; for those cores, handling as many
>>>>>> reloads here as possible by breaking into an address add is still
>>>>>> slightly better than a 'move + [reg+reg]', for the latter load/store
>>>>>> uses one more register read.  So maybe the best should be, to handle
>>>>>> when the 'high' part is a valid add-immediate-operand, and bail out if
>>>>>> not...
>>>>>>
>>>>>> C.L.
>>>>>
>>>>> If the address is unaligned, then the access is going to be slow anyway;
>>>>> but this is all corner case stuff - the vast majority of accesses will
>>>>> be at natural alignment.  I think it's better to seek clarity in these
>>>>> cases than outright performance in theoretical micro-architectural
>>>>> corner cases.
>>>>>
>>>>> The largest number of read ports would be needed by store[reg+reg].
>>>>> That's only 3 ports for a normal store (four for a pair of registers),
>>>>> but cores can normally handle this without penalty by reading the
>>>>> address registers in one cycle and the data to be stored in a later
>>>>> cycle -- critical paths tend to be on address generation, not data to be
>>>>> stored.
>>>>
>>>> Actually, I was thinking of cores with dual-issue, where an additional
>>>> port read may prevent it from happening...
>>>>
>>>> Anyways, here's a new patch. I've removed the unaligned handling bits as
>>>> you suggested, simply returning false for those cases.
>>>>
>>>> The points above did inspire another improvement, I think. I have added
>>>> a test to also return false when the high part is not a valid immediate
>>>> operand.  The rationale is, after such a reg=reg+high address compute is
>>>> created, it will still have to be split into multiple adds later, so it
>>>> may be better to let reload turn it into the [reg+reg] form.
>>>>
>>>
>>> Hmm, I think you've missed the point with some of this, which is that
>>> not only is it generally more efficient to try and use offset addressing
>>> but careful selection of the immediate values used in the load and the
>>> ADD insns can often also lead to better reload CSE.  For example:
>>>
>>> 	ldr	r0, [r2, #4100]  // Offset too large
>>> 	ldr	r1, [r2, #4104]  // Offset too large
>>>
>>> is best reloaded as
>>> 	add	t1, r2, #4096
>>> 	ldr	r0, [t1, #4]
>>> 	add	t2, r2, #4096
>>> 	ldr	r1, [t2, #8]
>>>
>>> which of course post-reload CSE can simplify in most cases to eliminate
>>> the second add instruction:
>>>
>>> 	add	t1, r2, #4096
>>> 	ldr	r0, [t1, #4]
>>> 	ldr	r1, [t1, #8]
>>>
>>> This is true even if the amount of the offset being split out is larger
>>> than a simple legitimate_constant.
>>>
>>> The idea here is that we want to split out the bits of the constant as a
>>> mask rather than as subtracting the maximum offset that ldr can handle
>>> (the same principle applies to LDRD too).
>>>
>>> A further trick is that we can make use of negative offsets even if the
>>> overall offset is positive and that sometimes this may lead to an
>>> immediate that can be constructed with one fewer add instructions  For
>>> example,
>>>
>>> 	ldr	r0, [r2, #0x3FFFFC]
>>>
>>> This is best reloaded as:
>>>
>>> 	add	t1, r2, #0x400000
>>> 	ldr	r0, [t1, #-4]
>>>
>>> The trick for spotting this for a load instruction with N bits of offset
>>> (ie bits N-1:0) is to look at bit N: if it is set, then chose a negative
>>> offset that is going to make bit N and all the bits below it come to
>>> zero in the remainder part.
>>>
>>> The final thing to note is that offsets for negative values in Thumb2
>>> are asymmetrical from the positive values that are available.  That
>>> makes selecting the best offset more complicated, and thus using
>>> negative values is less likely to be worth while.
>>
>> Richard, thanks for the detailed explanation, you should really turn
>> this into a comment in arm.c some time.
>>
> 
> Feel free to put it in...
> 
>> I have revised the patch again, this time mostly in an analogous form
>> following the other cases, as you explained above.
>>
>> Thanks,
>> Chung-Lin
> 
> Except that the existing ARM cases aren't quite doing the best they can
> (in fact, the only case that does fully exploit the negative offsetting
> cases is the pre-v5 DImode, but that uses 2's complement addressing so
> it's fairly easy :-)
> 
> The correct calculation for low for the TARGET_LDRD case should be
> 
> #define SIGN_MAG_LOW_ADDR_BITS(VAL, N) \
>   (((VAL) & ((1 << (N)) - 1)) \
>    ? (((VAL) & ((1 << ((N) + 1)) - 1)) ^ (1 << (N))) - (1 << (N)) : 0)
> 
> Now you can simply write
> 
> 	low = SIGN_MAG_LOW_ADDR_BITS (val, 8);
> 
> 
> Really the other cases that use sign-magnitude addressing should be
> updated similarly.

Thanks for the convenient macro. I guess I'll rehaul the patch to do the
other cases together since this discussion has gone this far. I also
noticed significant things like TARGET_VFP && TARGET_HARD_FLOAT are not
handled here at all. Might as well add them too.

Thanks,
C.L.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-03-31 19:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-24  3:57 [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing Chung-Lin Tang
2011-03-24 10:52 ` Richard Earnshaw
2011-03-29 10:27   ` Chung-Lin Tang
2011-03-29 14:35     ` Richard Earnshaw
2011-03-29 15:46       ` Chung-Lin Tang
2011-03-29 16:42         ` Richard Earnshaw
2011-03-30  9:00           ` Chung-Lin Tang
2011-03-30  9:46             ` Richard Earnshaw
2011-03-31  6:31               ` Chung-Lin Tang
2011-03-31 10:16                 ` Richard Earnshaw
2011-03-31 14:36                   ` Chung-Lin Tang
2011-03-31 18:25                     ` Richard Earnshaw
2011-03-31 19:47                       ` Chung-Lin Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).