[Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
@ 2016-01-11 11:53 James Greenhalgh
  2016-01-11 12:05 ` [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning James Greenhalgh
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: James Greenhalgh @ 2016-01-11 11:53 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

[-- Attachment #1: Type: text/plain, Size: 2453 bytes --]

Hi,

I'd like to switch the logic around in aarch64.c such that
-mlow-precision-recip-sqrt causes us to always emit the low-precision
software expansion for reciprocal square root. I have two reasons to do
this; first is consistency across -mcpu targets, second is enabling more
-mcpu targets to use the flag for peak tuning.

I don't much like that the precision we use for -mlow-precision-recip-sqrt
differs between cores (and possibly compiler revisions). Yes, we're
under -ffast-math but I take this flag to mean the user explicitly wants the
low-precision expansion, and we should not diverge from that based on an
internal decision as to what is optimal for performance in the
high-precision case. I'd prefer to keep things as predictable as possible,
and here that means always emitting the low-precision expansion when asked.

Judging by the comments in the thread proposing the reciprocal square
root optimisation, this will benefit all cores currently supported by GCC.
To be clear, we would still not expand in the high-precision case for any
cores which do not explicitly ask for it. Currently that is Cortex-A57
and xgene, though I will be proposing a patch to remove Cortex-A57 from
that list shortly.

Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
is intended as a tuning flag for situations where performance is more
important than precision, but the current logic requires setting an
internal flag which also changes the performance characteristics where
high-precision is needed. This conflates two decisions the target might
want to make, and reduces the applicability of an option targets might
want to enable for performance. In particular, I'd still like to see
-mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
sequence for floats under Cortex-A57.

Based on that reasoning, this patch makes the appropriate change to the
logic. I've checked with the current -mcpu values to ensure that behaviour
without -mlow-precision-recip-sqrt does not change, and that behaviour
with -mlow-precision-recip-sqrt is to emit the low precision sequences.

I've also put this through bootstrap and test on aarch64-none-linux-gnu
with no issues.

OK?

Thanks,
James

---
2015-12-10  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64.c (use_rsqrt_p): Always use software
	reciprocal sqrt for -mlow-precision-recip-sqrt.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Patch-AArch64-Use-software-sqrt-expansion-always-for.patch --]
[-- Type: text/x-patch;  name=0001-Patch-AArch64-Use-software-sqrt-expansion-always-for.patch, Size: 567 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9142ac0..1d5d898 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
 {
   return (!flag_trapping_math
 	  && flag_unsafe_math_optimizations
-	  && (aarch64_tune_params.extra_tuning_flags
-	      & AARCH64_EXTRA_TUNE_RECIP_SQRT));
+	  && ((aarch64_tune_params.extra_tuning_flags
+	       & AARCH64_EXTRA_TUNE_RECIP_SQRT)
+	      || flag_mrecip_low_precision_sqrt));
 }

 /* Function to decide when to use

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-01-11 11:53 [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt James Greenhalgh
@ 2016-01-11 12:05 ` James Greenhalgh
  2016-01-11 13:31   ` Dr. Philipp Tomsich
                     ` (2 more replies)
  2016-01-11 22:58 ` [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt Evandro Menezes
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 21+ messages in thread
From: James Greenhalgh @ 2016-01-11 12:05 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

[-- Attachment #1: Type: text/plain, Size: 749 bytes --]


Hi,

I've seen a couple of large performance issues caused by expanding
the high-precision reciprocal square root for Cortex-A57, so I'd like
to turn it off by default.

This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
some private microbenchmark kernels which stress the divide/sqrt/multiply
units. It therefore seems to me to be the correct choice to make across
a number of workloads.

Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

OK?

Thanks,
James

---
2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
	AARCH64_EXTRA_TUNE_RECIP_SQRT.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-AArch64-Remove-AARCH64_EXTRA_TUNE_RECIP_SQRT-from-Co.patch --]
[-- Type: text/x-patch;  name=0001-AArch64-Remove-AARCH64_EXTRA_TUNE_RECIP_SQRT-from-Co.patch, Size: 598 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1d5d898..999c9fc 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
-   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
 };
 
 static const struct tune_params cortexa72_tunings =

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-01-11 12:05 ` [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning James Greenhalgh
@ 2016-01-11 13:31   ` Dr. Philipp Tomsich
  2016-01-25 11:20   ` James Greenhalgh
  2016-02-16  8:49   ` Marcus Shawcroft
  2 siblings, 0 replies; 21+ messages in thread
From: Dr. Philipp Tomsich @ 2016-01-11 13:31 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: GCC Patches, nd, Marcus Shawcroft, richard.earnshaw, Kumar,
	Venkataramanan, Andrew Pinski, Kyrylo.Tkachov, Evandro Menezes,
	Benedikt Huber

James,

ok from our side—good to see that this also benefits the A57.

Best,
Philipp.

> On 11 Jan 2016, at 13:04, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> 
> Hi,
> 
> I've seen a couple of large performance issues caused by expanding
> the high-precision reciprocal square root for Cortex-A57, so I'd like
> to turn it off by default.
> 
> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> some private microbenchmark kernels which stress the divide/sqrt/multiply
> units. It therefore seems to me to be the correct choice to make across
> a number of workloads.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> 
> OK?
> 
> Thanks,
> James
> 
> ---
> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> 
> <0001-AArch64-Remove-AARCH64_EXTRA_TUNE_RECIP_SQRT-from-Co.patch>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-11 11:53 [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt James Greenhalgh
  2016-01-11 12:05 ` [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning James Greenhalgh
@ 2016-01-11 22:58 ` Evandro Menezes
  2016-01-12 11:32   ` James Greenhalgh
  2016-01-12  5:53 ` Kumar, Venkataramanan
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Evandro Menezes @ 2016-01-11 22:58 UTC (permalink / raw)
  To: James Greenhalgh, gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov

On 01/11/2016 05:53 AM, James Greenhalgh wrote:
> I'd like to switch the logic around in aarch64.c such that
> -mlow-precision-recip-sqrt causes us to always emit the low-precision
> software expansion for reciprocal square root. I have two reasons to do
> this; first is consistency across -mcpu targets, second is enabling more
> -mcpu targets to use the flag for peak tuning.
>
> I don't much like that the precision we use for -mlow-precision-recip-sqrt
> differs between cores (and possibly compiler revisions). Yes, we're
> under -ffast-math but I take this flag to mean the user explicitly wants the
> low-precision expansion, and we should not diverge from that based on an
> internal decision as to what is optimal for performance in the
> high-precision case. I'd prefer to keep things as predictable as possible,
> and here that means always emitting the low-precision expansion when asked.
>
> Judging by the comments in the thread proposing the reciprocal square
> root optimisation, this will benefit all cores currently supported by GCC.
> To be clear, we would still not expand in the high-precision case for any
> cores which do not explicitly ask for it. Currently that is Cortex-A57
> and xgene, though I will be proposing a patch to remove Cortex-A57 from
> that list shortly.
>
> Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> is intended as a tuning flag for situations where performance is more
> important than precision, but the current logic requires setting an
> internal flag which also changes the performance characteristics where
> high-precision is needed. This conflates two decisions the target might
> want to make, and reduces the applicability of an option targets might
> want to enable for performance. In particular, I'd still like to see
> -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> sequence for floats under Cortex-A57.
>
> Based on that reasoning, this patch makes the appropriate change to the
> logic. I've checked with the current -mcpu values to ensure that behaviour
> without -mlow-precision-recip-sqrt does not change, and that behaviour
> with -mlow-precision-recip-sqrt is to emit the low precision sequences.
>
> I've also put this through bootstrap and test on aarch64-none-linux-gnu
> with no issues.
>
> OK?

Yes, it LGTM.

I appreciate the idea of uniformity whne an option is specified, which 
led me to think if it wouldn't be a good ide to add an option that would 
have the effect of focring the emission of the reciprocal square root, 
effectively forcing the flag AARCH64_EXTRA_TUNE_RECIP_SQRT on, 
regardless of the tuning flags for the given core.  I think that this 
flag would be particularly useful when specifying flags for specific 
functions, irrespective of the core.

Thoughts?

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-11 11:53 [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt James Greenhalgh
  2016-01-11 12:05 ` [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning James Greenhalgh
  2016-01-11 22:58 ` [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt Evandro Menezes
@ 2016-01-12  5:53 ` Kumar, Venkataramanan
  2016-01-12 11:48   ` James Greenhalgh
  2016-01-25 11:21 ` James Greenhalgh
  2016-02-16  8:40 ` Marcus Shawcroft
  4 siblings, 1 reply; 21+ messages in thread
From: Kumar, Venkataramanan @ 2016-01-12  5:53 UTC (permalink / raw)
  To: James Greenhalgh, gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, philipp.tomsich, pinskia,
	Kyrylo.Tkachov, e.menezes

Hi James,

> -----Original Message-----
> From: James Greenhalgh [mailto:james.greenhalgh@arm.com]
> Sent: Monday, January 11, 2016 5:24 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd@arm.com; marcus.shawcroft@arm.com;
> richard.earnshaw@arm.com; Kumar, Venkataramanan;
> philipp.tomsich@theobroma-systems.com; pinskia@gmail.com;
> Kyrylo.Tkachov@arm.com; e.menezes@samsung.com
> Subject: [Patch AArch64] Use software sqrt expansion always for -mlow-
> precision-recip-sqrt
> 
> 
> Hi,
> 
> I'd like to switch the logic around in aarch64.c such that -mlow-precision-
> recip-sqrt causes us to always emit the low-precision software expansion for
> reciprocal square root. I have two reasons to do this; first is consistency
> across -mcpu targets, second is enabling more -mcpu targets to use the flag
> for peak tuning.
> 
> I don't much like that the precision we use for -mlow-precision-recip-sqrt
> differs between cores (and possibly compiler revisions). Yes, we're under -
> ffast-math but I take this flag to mean the user explicitly wants the low-
> precision expansion, and we should not diverge from that based on an
> internal decision as to what is optimal for performance in the high-precision
> case. I'd prefer to keep things as predictable as possible, and here that
> means always emitting the low-precision expansion when asked.
> 
> Judging by the comments in the thread proposing the reciprocal square root
> optimisation, this will benefit all cores currently supported by GCC.
> To be clear, we would still not expand in the high-precision case for any cores
> which do not explicitly ask for it. Currently that is Cortex-A57 and xgene,
> though I will be proposing a patch to remove Cortex-A57 from that list
> shortly.
> 
> Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> is intended as a tuning flag for situations where performance is more
> important than precision, but the current logic requires setting an internal
> flag which also changes the performance characteristics where high-precision
> is needed. This conflates two decisions the target might want to make, and
> reduces the applicability of an option targets might want to enable for
> performance. In particular, I'd still like to see -mlow-precision-recip-sqrt
> continue to emit the cheaper, low-precision sequence for floats under
> Cortex-A57.
> 
> Based on that reasoning, this patch makes the appropriate change to the
> logic. I've checked with the current -mcpu values to ensure that behaviour
> without -mlow-precision-recip-sqrt does not change, and that behaviour
> with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> 
> I've also put this through bootstrap and test on aarch64-none-linux-gnu with
> no issues.
> 
> OK?
> 
> Thanks,
> James
> 

Yes I like enabling this optimization for all cpus target via -mlow-precision-recip-sqrt .
 
If my understanding is correct for cortex-a57 we now need to use only -mlow-precision-recip-sqrt to emit software sqrt expansion?

In the below code 
---snip---
void
aarch64_emit_swrsqrt (rtx dst, rtx src)
{
............
............
  int iterations = double_mode ? 3 : 2;

  if (flag_mrecip_low_precision_sqrt)
    iterations--;
 ---snip---

Now cortex-a57 case we will always do  2 and 1 steps  for double and float  and  3 and 2 will never be used.     
Should we make it 2 and 1 as default? Or any target still needs to use 3 and 2. 

Ps: I remember reducing iterations benefited gromacs but caused some VE in other FP benchmarks.  

Regards,
Venkat.



> ---
> 2015-12-10  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* config/aarch64/aarch64.c (use_rsqrt_p): Always use software
> 	reciprocal sqrt for -mlow-precision-recip-sqrt.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-11 22:58 ` [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt Evandro Menezes
@ 2016-01-12 11:32   ` James Greenhalgh
  2016-01-12 11:44     ` Kyrill Tkachov
  0 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-01-12 11:32 UTC (permalink / raw)
  To: Evandro Menezes
  Cc: gcc-patches, nd, marcus.shawcroft, richard.earnshaw,
	Venkataramanan.Kumar, philipp.tomsich, pinskia, Kyrylo.Tkachov

On Mon, Jan 11, 2016 at 04:57:56PM -0600, Evandro Menezes wrote:
> On 01/11/2016 05:53 AM, James Greenhalgh wrote:
> >I'd like to switch the logic around in aarch64.c such that
> >-mlow-precision-recip-sqrt causes us to always emit the low-precision
> >software expansion for reciprocal square root. I have two reasons to do
> >this; first is consistency across -mcpu targets, second is enabling more
> >-mcpu targets to use the flag for peak tuning.
> >
> >I don't much like that the precision we use for -mlow-precision-recip-sqrt
> >differs between cores (and possibly compiler revisions). Yes, we're
> >under -ffast-math but I take this flag to mean the user explicitly wants the
> >low-precision expansion, and we should not diverge from that based on an
> >internal decision as to what is optimal for performance in the
> >high-precision case. I'd prefer to keep things as predictable as possible,
> >and here that means always emitting the low-precision expansion when asked.
> >
> >Judging by the comments in the thread proposing the reciprocal square
> >root optimisation, this will benefit all cores currently supported by GCC.
> >To be clear, we would still not expand in the high-precision case for any
> >cores which do not explicitly ask for it. Currently that is Cortex-A57
> >and xgene, though I will be proposing a patch to remove Cortex-A57 from
> >that list shortly.
> >
> >Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> >is intended as a tuning flag for situations where performance is more
> >important than precision, but the current logic requires setting an
> >internal flag which also changes the performance characteristics where
> >high-precision is needed. This conflates two decisions the target might
> >want to make, and reduces the applicability of an option targets might
> >want to enable for performance. In particular, I'd still like to see
> >-mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> >sequence for floats under Cortex-A57.
> >
> >Based on that reasoning, this patch makes the appropriate change to the
> >logic. I've checked with the current -mcpu values to ensure that behaviour
> >without -mlow-precision-recip-sqrt does not change, and that behaviour
> >with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> >
> >I've also put this through bootstrap and test on aarch64-none-linux-gnu
> >with no issues.
> >
> >OK?
> 
> Yes, it LGTM.

Thanks.

> I appreciate the idea of uniformity whne an option is specified,
> which led me to think if it wouldn't be a good ide to add an option
> that would have the effect of focring the emission of the reciprocal
> square root, effectively forcing the flag
> AARCH64_EXTRA_TUNE_RECIP_SQRT on, regardless of the tuning flags for
> the given core.  I think that this flag would be particularly useful
> when specifying flags for specific functions, irrespective of the
> core.
> 
> Thoughts?

Currently you can do this using the (mostly unsupported) -moverride
mechanism as -moverride=tune=recip_sqrt from the command line.
I'm not sure how reliable using this from
__attribute__((target("override=tune=recip_sqrt"))) would be, I wrote a small
testcase that didn't work as intended, but whether that is a bug or a
design decision I'm not yet sure. I think the logic for parsing the
target attribute is set up to reapply the command-line override string
over whichever tuning options you apply through the attribute, rather than
to allow you to apply a per-function override.

As to whether we'd want to expose this as a fully supported,
user-visible setting, I'd rather not. Our claim is that for the
higher-precision sequences the results are close enough that we can
consider this like reassociation width or other core-specific tuning
parameters that we don't expose. What I'm hoping to avoid is a
proliferation of supported options which are not in anybody's regular
testing matrix. This one would not be so bad as it is automatically
enabled by some cores. For now I'd rather not add the option.

Thanks,
James

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-12 11:32   ` James Greenhalgh
@ 2016-01-12 11:44     ` Kyrill Tkachov
  0 siblings, 0 replies; 21+ messages in thread
From: Kyrill Tkachov @ 2016-01-12 11:44 UTC (permalink / raw)
  To: James Greenhalgh, Evandro Menezes
  Cc: gcc-patches, nd, marcus.shawcroft, richard.earnshaw,
	Venkataramanan.Kumar, philipp.tomsich, pinskia

Hi all,

On 12/01/16 11:32, James Greenhalgh wrote:
> On Mon, Jan 11, 2016 at 04:57:56PM -0600, Evandro Menezes wrote:
>> On 01/11/2016 05:53 AM, James Greenhalgh wrote:
>>> I'd like to switch the logic around in aarch64.c such that
>>> -mlow-precision-recip-sqrt causes us to always emit the low-precision
>>> software expansion for reciprocal square root. I have two reasons to do
>>> this; first is consistency across -mcpu targets, second is enabling more
>>> -mcpu targets to use the flag for peak tuning.
>>>
>>> I don't much like that the precision we use for -mlow-precision-recip-sqrt
>>> differs between cores (and possibly compiler revisions). Yes, we're
>>> under -ffast-math but I take this flag to mean the user explicitly wants the
>>> low-precision expansion, and we should not diverge from that based on an
>>> internal decision as to what is optimal for performance in the
>>> high-precision case. I'd prefer to keep things as predictable as possible,
>>> and here that means always emitting the low-precision expansion when asked.
>>>
>>> Judging by the comments in the thread proposing the reciprocal square
>>> root optimisation, this will benefit all cores currently supported by GCC.
>>> To be clear, we would still not expand in the high-precision case for any
>>> cores which do not explicitly ask for it. Currently that is Cortex-A57
>>> and xgene, though I will be proposing a patch to remove Cortex-A57 from
>>> that list shortly.
>>>
>>> Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
>>> is intended as a tuning flag for situations where performance is more
>>> important than precision, but the current logic requires setting an
>>> internal flag which also changes the performance characteristics where
>>> high-precision is needed. This conflates two decisions the target might
>>> want to make, and reduces the applicability of an option targets might
>>> want to enable for performance. In particular, I'd still like to see
>>> -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
>>> sequence for floats under Cortex-A57.
>>>
>>> Based on that reasoning, this patch makes the appropriate change to the
>>> logic. I've checked with the current -mcpu values to ensure that behaviour
>>> without -mlow-precision-recip-sqrt does not change, and that behaviour
>>> with -mlow-precision-recip-sqrt is to emit the low precision sequences.
>>>
>>> I've also put this through bootstrap and test on aarch64-none-linux-gnu
>>> with no issues.
>>>
>>> OK?
>> Yes, it LGTM.
> Thanks.
>
>> I appreciate the idea of uniformity whne an option is specified,
>> which led me to think if it wouldn't be a good ide to add an option
>> that would have the effect of focring the emission of the reciprocal
>> square root, effectively forcing the flag
>> AARCH64_EXTRA_TUNE_RECIP_SQRT on, regardless of the tuning flags for
>> the given core.  I think that this flag would be particularly useful
>> when specifying flags for specific functions, irrespective of the
>> core.
>>
>> Thoughts?
> Currently you can do this using the (mostly unsupported) -moverride
> mechanism as -moverride=tune=recip_sqrt from the command line.
> I'm not sure how reliable using this from
> __attribute__((target("override=tune=recip_sqrt"))) would be, I wrote a small
> testcase that didn't work as intended, but whether that is a bug or a
> design decision I'm not yet sure. I think the logic for parsing the
> target attribute is set up to reapply the command-line override string
> over whichever tuning options you apply through the attribute, rather than
> to allow you to apply a per-function override.

As a clarification: we don't support an "override" target attribute on aarch64.
I had a patch earlier in the year to hook up the override string parsing machinery
into the attributes parsing code, but didn't end up proposing it.
IIRC the syntax of the override string (using '=' multiple times) would needlessly
complicate the parsing code for something that's not intended to be used by regular
users but rather by power users that are exploring gcc internals.

Thanks,
Kyrill

> As to whether we'd want to expose this as a fully supported,
> user-visible setting, I'd rather not. Our claim is that for the
> higher-precision sequences the results are close enough that we can
> consider this like reassociation width or other core-specific tuning
> parameters that we don't expose. What I'm hoping to avoid is a
> proliferation of supported options which are not in anybody's regular
> testing matrix. This one would not be so bad as it is automatically
> enabled by some cores. For now I'd rather not add the option.
>
> Thanks,
> James
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-12  5:53 ` Kumar, Venkataramanan
@ 2016-01-12 11:48   ` James Greenhalgh
  0 siblings, 0 replies; 21+ messages in thread
From: James Greenhalgh @ 2016-01-12 11:48 UTC (permalink / raw)
  To: Kumar, Venkataramanan
  Cc: gcc-patches, nd, marcus.shawcroft, richard.earnshaw,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

On Tue, Jan 12, 2016 at 05:53:21AM +0000, Kumar, Venkataramanan wrote:
> Hi James,
> 
> > -----Original Message-----
> > From: James Greenhalgh [mailto:james.greenhalgh@arm.com]
> > Sent: Monday, January 11, 2016 5:24 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: nd@arm.com; marcus.shawcroft@arm.com;
> > richard.earnshaw@arm.com; Kumar, Venkataramanan;
> > philipp.tomsich@theobroma-systems.com; pinskia@gmail.com;
> > Kyrylo.Tkachov@arm.com; e.menezes@samsung.com
> > Subject: [Patch AArch64] Use software sqrt expansion always for -mlow-
> > precision-recip-sqrt
> > 
> > 
> > Hi,
> > 
> > I'd like to switch the logic around in aarch64.c such that -mlow-precision-
> > recip-sqrt causes us to always emit the low-precision software expansion for
> > reciprocal square root. I have two reasons to do this; first is consistency
> > across -mcpu targets, second is enabling more -mcpu targets to use the flag
> > for peak tuning.
> > 
> > I don't much like that the precision we use for -mlow-precision-recip-sqrt
> > differs between cores (and possibly compiler revisions). Yes, we're under -
> > ffast-math but I take this flag to mean the user explicitly wants the low-
> > precision expansion, and we should not diverge from that based on an
> > internal decision as to what is optimal for performance in the high-precision
> > case. I'd prefer to keep things as predictable as possible, and here that
> > means always emitting the low-precision expansion when asked.
> > 
> > Judging by the comments in the thread proposing the reciprocal square root
> > optimisation, this will benefit all cores currently supported by GCC.
> > To be clear, we would still not expand in the high-precision case for any cores
> > which do not explicitly ask for it. Currently that is Cortex-A57 and xgene,
> > though I will be proposing a patch to remove Cortex-A57 from that list
> > shortly.
> > 
> > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> > is intended as a tuning flag for situations where performance is more
> > important than precision, but the current logic requires setting an internal
> > flag which also changes the performance characteristics where high-precision
> > is needed. This conflates two decisions the target might want to make, and
> > reduces the applicability of an option targets might want to enable for
> > performance. In particular, I'd still like to see -mlow-precision-recip-sqrt
> > continue to emit the cheaper, low-precision sequence for floats under
> > Cortex-A57.
> > 
> > Based on that reasoning, this patch makes the appropriate change to the
> > logic. I've checked with the current -mcpu values to ensure that behaviour
> > without -mlow-precision-recip-sqrt does not change, and that behaviour
> > with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> > 
> > I've also put this through bootstrap and test on aarch64-none-linux-gnu with
> > no issues.
> > 
> > OK?
> > 
> > Thanks,
> > James
> > 
> 
> Yes I like enabling this optimization for all cpus target via
> -mlow-precision-recip-sqrt .
>  
> If my understanding is correct for cortex-a57 we now need to use only
> -mlow-precision-recip-sqrt to emit software sqrt expansion?
> 
> In the below code 
> ---snip---
> void
> aarch64_emit_swrsqrt (rtx dst, rtx src)
> {
> ............
> ............
>   int iterations = double_mode ? 3 : 2;
> 
>   if (flag_mrecip_low_precision_sqrt)
>     iterations--;
>  ---snip---
> 
> Now cortex-a57 case we will always do  2 and 1 steps  for double and float
> and  3 and 2 will never be used.     Should we make it 2 and 1 as default? Or
> any target still needs to use 3 and 2. 

The code here should handle two cases:

  1) Normal -Ofast case -> Some targets use the estimate expansion with
     3 iterations for double, 2 for float. Other targets use the hardware
     fsqrt/fdiv instructions.
  2) -mlow-precision-recip-sqrt -> All targets use the estimate expansion
     with 2 iterations for double, 1 for float.

-mlow-precision-recip-sqrt is a specialisation to be used only when the
programmer knows the lower precision is acceptable. It should not be on
by default...

> Ps: I remember reducing iterations benefited gromacs but caused some VE in
> other FP benchmarks.  

... For exactly this reason :-)

Thanks,
James

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-01-11 12:05 ` [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning James Greenhalgh
  2016-01-11 13:31   ` Dr. Philipp Tomsich
@ 2016-01-25 11:20   ` James Greenhalgh
  2016-02-01 14:00     ` James Greenhalgh
  2016-02-16  8:49   ` Marcus Shawcroft
  2 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-01-25 11:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> 
> Hi,
> 
> I've seen a couple of large performance issues caused by expanding
> the high-precision reciprocal square root for Cortex-A57, so I'd like
> to turn it off by default.
> 
> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> some private microbenchmark kernels which stress the divide/sqrt/multiply
> units. It therefore seems to me to be the correct choice to make across
> a number of workloads.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> 
> OK?

*Ping*

Thanks,
James

> ---
> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 1d5d898..999c9fc 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
>  };
>  
>  static const struct tune_params cortexa72_tunings =

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-11 11:53 [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt James Greenhalgh
                   ` (2 preceding siblings ...)
  2016-01-12  5:53 ` Kumar, Venkataramanan
@ 2016-01-25 11:21 ` James Greenhalgh
  2016-02-01 13:59   ` James Greenhalgh
  2016-02-16  8:40 ` Marcus Shawcroft
  4 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-01-25 11:21 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote:
> 
> Hi,
> 
> I'd like to switch the logic around in aarch64.c such that
> -mlow-precision-recip-sqrt causes us to always emit the low-precision
> software expansion for reciprocal square root. I have two reasons to do
> this; first is consistency across -mcpu targets, second is enabling more
> -mcpu targets to use the flag for peak tuning.
> 
> I don't much like that the precision we use for -mlow-precision-recip-sqrt
> differs between cores (and possibly compiler revisions). Yes, we're
> under -ffast-math but I take this flag to mean the user explicitly wants the
> low-precision expansion, and we should not diverge from that based on an
> internal decision as to what is optimal for performance in the
> high-precision case. I'd prefer to keep things as predictable as possible,
> and here that means always emitting the low-precision expansion when asked.
> 
> Judging by the comments in the thread proposing the reciprocal square
> root optimisation, this will benefit all cores currently supported by GCC.
> To be clear, we would still not expand in the high-precision case for any
> cores which do not explicitly ask for it. Currently that is Cortex-A57
> and xgene, though I will be proposing a patch to remove Cortex-A57 from
> that list shortly.
> 
> Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> is intended as a tuning flag for situations where performance is more
> important than precision, but the current logic requires setting an
> internal flag which also changes the performance characteristics where
> high-precision is needed. This conflates two decisions the target might
> want to make, and reduces the applicability of an option targets might
> want to enable for performance. In particular, I'd still like to see
> -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> sequence for floats under Cortex-A57.
> 
> Based on that reasoning, this patch makes the appropriate change to the
> logic. I've checked with the current -mcpu values to ensure that behaviour
> without -mlow-precision-recip-sqrt does not change, and that behaviour
> with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> 
> I've also put this through bootstrap and test on aarch64-none-linux-gnu
> with no issues.
> 
> OK?

*Ping*

Thanks,
James

> 2015-12-10  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* config/aarch64/aarch64.c (use_rsqrt_p): Always use software
> 	reciprocal sqrt for -mlow-precision-recip-sqrt.
> 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 9142ac0..1d5d898 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
>  {
>    return (!flag_trapping_math
>  	  && flag_unsafe_math_optimizations
> -	  && (aarch64_tune_params.extra_tuning_flags
> -	      & AARCH64_EXTRA_TUNE_RECIP_SQRT));
> +	  && ((aarch64_tune_params.extra_tuning_flags
> +	       & AARCH64_EXTRA_TUNE_RECIP_SQRT)
> +	      || flag_mrecip_low_precision_sqrt));
>  }
>  
>  /* Function to decide when to use

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-25 11:21 ` James Greenhalgh
@ 2016-02-01 13:59   ` James Greenhalgh
  2016-02-08 10:57     ` James Greenhalgh
  0 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-02-01 13:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, marcus.shawcroft, richard.earnshaw

On Mon, Jan 25, 2016 at 11:21:25AM +0000, James Greenhalgh wrote:
> On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote:
> > 
> > Hi,
> > 
> > I'd like to switch the logic around in aarch64.c such that
> > -mlow-precision-recip-sqrt causes us to always emit the low-precision
> > software expansion for reciprocal square root. I have two reasons to do
> > this; first is consistency across -mcpu targets, second is enabling more
> > -mcpu targets to use the flag for peak tuning.
> > 
> > I don't much like that the precision we use for -mlow-precision-recip-sqrt
> > differs between cores (and possibly compiler revisions). Yes, we're
> > under -ffast-math but I take this flag to mean the user explicitly wants the
> > low-precision expansion, and we should not diverge from that based on an
> > internal decision as to what is optimal for performance in the
> > high-precision case. I'd prefer to keep things as predictable as possible,
> > and here that means always emitting the low-precision expansion when asked.
> > 
> > Judging by the comments in the thread proposing the reciprocal square
> > root optimisation, this will benefit all cores currently supported by GCC.
> > To be clear, we would still not expand in the high-precision case for any
> > cores which do not explicitly ask for it. Currently that is Cortex-A57
> > and xgene, though I will be proposing a patch to remove Cortex-A57 from
> > that list shortly.
> > 
> > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> > is intended as a tuning flag for situations where performance is more
> > important than precision, but the current logic requires setting an
> > internal flag which also changes the performance characteristics where
> > high-precision is needed. This conflates two decisions the target might
> > want to make, and reduces the applicability of an option targets might
> > want to enable for performance. In particular, I'd still like to see
> > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> > sequence for floats under Cortex-A57.
> > 
> > Based on that reasoning, this patch makes the appropriate change to the
> > logic. I've checked with the current -mcpu values to ensure that behaviour
> > without -mlow-precision-recip-sqrt does not change, and that behaviour
> > with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> > 
> > I've also put this through bootstrap and test on aarch64-none-linux-gnu
> > with no issues.
> > 
> > OK?
> 
> *Ping*

*Pingx2*

Thanks,
James

> 
> Thanks,
> James
> 
> > 2015-12-10  James Greenhalgh  <james.greenhalgh@arm.com>
> > 
> > 	* config/aarch64/aarch64.c (use_rsqrt_p): Always use software
> > 	reciprocal sqrt for -mlow-precision-recip-sqrt.
> > 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 9142ac0..1d5d898 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
> >  {
> >    return (!flag_trapping_math
> >  	  && flag_unsafe_math_optimizations
> > -	  && (aarch64_tune_params.extra_tuning_flags
> > -	      & AARCH64_EXTRA_TUNE_RECIP_SQRT));
> > +	  && ((aarch64_tune_params.extra_tuning_flags
> > +	       & AARCH64_EXTRA_TUNE_RECIP_SQRT)
> > +	      || flag_mrecip_low_precision_sqrt));
> >  }
> >  
> >  /* Function to decide when to use
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-01-25 11:20   ` James Greenhalgh
@ 2016-02-01 14:00     ` James Greenhalgh
  2016-02-08 10:57       ` James Greenhalgh
  0 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-02-01 14:00 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > 
> > Hi,
> > 
> > I've seen a couple of large performance issues caused by expanding
> > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > to turn it off by default.
> > 
> > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > units. It therefore seems to me to be the correct choice to make across
> > a number of workloads.
> > 
> > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > 
> > OK?
> 
> *Ping*

*pingx2*

Thanks,
James

> > ---
> > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> > 
> > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 1d5d898..999c9fc 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> >    0,	/* max_case_values.  */
> >    0,	/* cache_line_size.  */
> >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> >  };
> >  
> >  static const struct tune_params cortexa72_tunings =
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-02-01 14:00     ` James Greenhalgh
@ 2016-02-08 10:57       ` James Greenhalgh
  2016-02-15 10:50         ` James Greenhalgh
  0 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-02-08 10:57 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
> On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > > 
> > > Hi,
> > > 
> > > I've seen a couple of large performance issues caused by expanding
> > > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > > to turn it off by default.
> > > 
> > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > > units. It therefore seems to me to be the correct choice to make across
> > > a number of workloads.
> > > 
> > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > > 
> > > OK?
> > 
> > *Ping*
> 
> *pingx2*

*ping^3*

Thanks,
James

> > > ---
> > > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> > > 
> > > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > > 
> > 
> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > index 1d5d898..999c9fc 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> > >    0,	/* max_case_values.  */
> > >    0,	/* cache_line_size.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> > > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> > > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> > >  };
> > >  
> > >  static const struct tune_params cortexa72_tunings =
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-02-01 13:59   ` James Greenhalgh
@ 2016-02-08 10:57     ` James Greenhalgh
  2016-02-15 10:48       ` James Greenhalgh
  0 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-02-08 10:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, marcus.shawcroft, richard.earnshaw

On Mon, Feb 01, 2016 at 01:59:34PM +0000, James Greenhalgh wrote:
> On Mon, Jan 25, 2016 at 11:21:25AM +0000, James Greenhalgh wrote:
> > On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote:
> > > 
> > > Hi,
> > > 
> > > I'd like to switch the logic around in aarch64.c such that
> > > -mlow-precision-recip-sqrt causes us to always emit the low-precision
> > > software expansion for reciprocal square root. I have two reasons to do
> > > this; first is consistency across -mcpu targets, second is enabling more
> > > -mcpu targets to use the flag for peak tuning.
> > > 
> > > I don't much like that the precision we use for -mlow-precision-recip-sqrt
> > > differs between cores (and possibly compiler revisions). Yes, we're
> > > under -ffast-math but I take this flag to mean the user explicitly wants the
> > > low-precision expansion, and we should not diverge from that based on an
> > > internal decision as to what is optimal for performance in the
> > > high-precision case. I'd prefer to keep things as predictable as possible,
> > > and here that means always emitting the low-precision expansion when asked.
> > > 
> > > Judging by the comments in the thread proposing the reciprocal square
> > > root optimisation, this will benefit all cores currently supported by GCC.
> > > To be clear, we would still not expand in the high-precision case for any
> > > cores which do not explicitly ask for it. Currently that is Cortex-A57
> > > and xgene, though I will be proposing a patch to remove Cortex-A57 from
> > > that list shortly.
> > > 
> > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> > > is intended as a tuning flag for situations where performance is more
> > > important than precision, but the current logic requires setting an
> > > internal flag which also changes the performance characteristics where
> > > high-precision is needed. This conflates two decisions the target might
> > > want to make, and reduces the applicability of an option targets might
> > > want to enable for performance. In particular, I'd still like to see
> > > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> > > sequence for floats under Cortex-A57.
> > > 
> > > Based on that reasoning, this patch makes the appropriate change to the
> > > logic. I've checked with the current -mcpu values to ensure that behaviour
> > > without -mlow-precision-recip-sqrt does not change, and that behaviour
> > > with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> > > 
> > > I've also put this through bootstrap and test on aarch64-none-linux-gnu
> > > with no issues.
> > > 
> > > OK?
> > 
> > *Ping*
> 
> *Pingx2*

*Ping^3*

Thanks,
James

> > > 2015-12-10  James Greenhalgh  <james.greenhalgh@arm.com>
> > > 
> > > 	* config/aarch64/aarch64.c (use_rsqrt_p): Always use software
> > > 	reciprocal sqrt for -mlow-precision-recip-sqrt.
> > > 
> > 
> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > index 9142ac0..1d5d898 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
> > >  {
> > >    return (!flag_trapping_math
> > >  	  && flag_unsafe_math_optimizations
> > > -	  && (aarch64_tune_params.extra_tuning_flags
> > > -	      & AARCH64_EXTRA_TUNE_RECIP_SQRT));
> > > +	  && ((aarch64_tune_params.extra_tuning_flags
> > > +	       & AARCH64_EXTRA_TUNE_RECIP_SQRT)
> > > +	      || flag_mrecip_low_precision_sqrt));
> > >  }
> > >  
> > >  /* Function to decide when to use
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-02-08 10:57     ` James Greenhalgh
@ 2016-02-15 10:48       ` James Greenhalgh
  0 siblings, 0 replies; 21+ messages in thread
From: James Greenhalgh @ 2016-02-15 10:48 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, marcus.shawcroft, richard.earnshaw

On Mon, Feb 08, 2016 at 10:57:44AM +0000, James Greenhalgh wrote:
> On Mon, Feb 01, 2016 at 01:59:34PM +0000, James Greenhalgh wrote:
> > On Mon, Jan 25, 2016 at 11:21:25AM +0000, James Greenhalgh wrote:
> > > On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > I'd like to switch the logic around in aarch64.c such that
> > > > -mlow-precision-recip-sqrt causes us to always emit the low-precision
> > > > software expansion for reciprocal square root. I have two reasons to do
> > > > this; first is consistency across -mcpu targets, second is enabling more
> > > > -mcpu targets to use the flag for peak tuning.
> > > > 
> > > > I don't much like that the precision we use for -mlow-precision-recip-sqrt
> > > > differs between cores (and possibly compiler revisions). Yes, we're
> > > > under -ffast-math but I take this flag to mean the user explicitly wants the
> > > > low-precision expansion, and we should not diverge from that based on an
> > > > internal decision as to what is optimal for performance in the
> > > > high-precision case. I'd prefer to keep things as predictable as possible,
> > > > and here that means always emitting the low-precision expansion when asked.
> > > > 
> > > > Judging by the comments in the thread proposing the reciprocal square
> > > > root optimisation, this will benefit all cores currently supported by GCC.
> > > > To be clear, we would still not expand in the high-precision case for any
> > > > cores which do not explicitly ask for it. Currently that is Cortex-A57
> > > > and xgene, though I will be proposing a patch to remove Cortex-A57 from
> > > > that list shortly.
> > > > 
> > > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> > > > is intended as a tuning flag for situations where performance is more
> > > > important than precision, but the current logic requires setting an
> > > > internal flag which also changes the performance characteristics where
> > > > high-precision is needed. This conflates two decisions the target might
> > > > want to make, and reduces the applicability of an option targets might
> > > > want to enable for performance. In particular, I'd still like to see
> > > > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> > > > sequence for floats under Cortex-A57.
> > > > 
> > > > Based on that reasoning, this patch makes the appropriate change to the
> > > > logic. I've checked with the current -mcpu values to ensure that behaviour
> > > > without -mlow-precision-recip-sqrt does not change, and that behaviour
> > > > with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> > > > 
> > > > I've also put this through bootstrap and test on aarch64-none-linux-gnu
> > > > with no issues.
> > > > 
> > > > OK?
> > > 
> > > *Ping*
> > 
> > *Pingx2*
> 
> *Ping^3*

*ping^4*

Thanks,
James

> > > > 2015-12-10  James Greenhalgh  <james.greenhalgh@arm.com>
> > > > 
> > > > 	* config/aarch64/aarch64.c (use_rsqrt_p): Always use software
> > > > 	reciprocal sqrt for -mlow-precision-recip-sqrt.
> > > > 
> > > 
> > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > > index 9142ac0..1d5d898 100644
> > > > --- a/gcc/config/aarch64/aarch64.c
> > > > +++ b/gcc/config/aarch64/aarch64.c
> > > > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
> > > >  {
> > > >    return (!flag_trapping_math
> > > >  	  && flag_unsafe_math_optimizations
> > > > -	  && (aarch64_tune_params.extra_tuning_flags
> > > > -	      & AARCH64_EXTRA_TUNE_RECIP_SQRT));
> > > > +	  && ((aarch64_tune_params.extra_tuning_flags
> > > > +	       & AARCH64_EXTRA_TUNE_RECIP_SQRT)
> > > > +	      || flag_mrecip_low_precision_sqrt));
> > > >  }
> > > >  
> > > >  /* Function to decide when to use
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-02-08 10:57       ` James Greenhalgh
@ 2016-02-15 10:50         ` James Greenhalgh
  2016-02-15 17:25           ` Evandro Menezes
  0 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-02-15 10:50 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov, e.menezes

On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:
> On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
> > On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> > > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > I've seen a couple of large performance issues caused by expanding
> > > > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > > > to turn it off by default.
> > > > 
> > > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > > > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > > > units. It therefore seems to me to be the correct choice to make across
> > > > a number of workloads.
> > > > 
> > > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > > > 
> > > > OK?
> > > 
> > > *Ping*
> > 
> > *pingx2*
> 
> *ping^3*

*ping^4*

Thanks,
James

> > > > ---
> > > > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> > > > 
> > > > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > > > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > > > 
> > > 
> > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > > index 1d5d898..999c9fc 100644
> > > > --- a/gcc/config/aarch64/aarch64.c
> > > > +++ b/gcc/config/aarch64/aarch64.c
> > > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> > > >    0,	/* max_case_values.  */
> > > >    0,	/* cache_line_size.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > > > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> > > > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> > > >  };
> > > >  
> > > >  static const struct tune_params cortexa72_tunings =
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-02-15 10:50         ` James Greenhalgh
@ 2016-02-15 17:25           ` Evandro Menezes
  2016-02-16 10:28             ` James Greenhalgh
  0 siblings, 1 reply; 21+ messages in thread
From: Evandro Menezes @ 2016-02-15 17:25 UTC (permalink / raw)
  To: James Greenhalgh, gcc-patches
  Cc: nd, marcus.shawcroft, richard.earnshaw, Venkataramanan.Kumar,
	philipp.tomsich, pinskia, Kyrylo.Tkachov

On 02/15/16 04:50, James Greenhalgh wrote:
> On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:
>> On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
>>> On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
>>>> On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen a couple of large performance issues caused by expanding
>>>>> the high-precision reciprocal square root for Cortex-A57, so I'd like
>>>>> to turn it off by default.
>>>>>
>>>>> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
>>>>> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
>>>>> some private microbenchmark kernels which stress the divide/sqrt/multiply
>>>>> units. It therefore seems to me to be the correct choice to make across
>>>>> a number of workloads.
>>>>>
>>>>> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
>>>>>
>>>>> OK?
>>>> *Ping*
>>> *pingx2*
>> *ping^3*
> *ping^4*
>
> Thanks,
> James
>
>>>>> ---
>>>>> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
>>>>>
>>>>> 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
>>>>> 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
>>>>>
>>>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>>>> index 1d5d898..999c9fc 100644
>>>>> --- a/gcc/config/aarch64/aarch64.c
>>>>> +++ b/gcc/config/aarch64/aarch64.c
>>>>> @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
>>>>>     0,	/* max_case_values.  */
>>>>>     0,	/* cache_line_size.  */
>>>>>     tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>>>>> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
>>>>> -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
>>>>> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
>>>>>   };
>>>>>   
>>>>>   static const struct tune_params cortexa72_tunings =
>

James,

There seem to be SPEC CPU2000fp validation issues on A57 when this flag 
is present too.  Though I evaluated the algorithm with a huge random set 
of values, always delivering accuracy around 1ulp, which should be 
enough for CPU2000fp (wit x86-64), I expected the benchmarks to pass.

My suspicion is that the Newton series on AArch64 is probably good only 
for SP.  Then, DP might require an extra round, probably exacerbating 
the performance penalty.

I'd like to try to split this tuning option into one for SP and another 
for DP.  Thoughts?

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt
  2016-01-11 11:53 [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt James Greenhalgh
                   ` (3 preceding siblings ...)
  2016-01-25 11:21 ` James Greenhalgh
@ 2016-02-16  8:40 ` Marcus Shawcroft
  4 siblings, 0 replies; 21+ messages in thread
From: Marcus Shawcroft @ 2016-02-16  8:40 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: gcc-patches, nd, Marcus Shawcroft, Richard Earnshaw,
	Venkataramanan.Kumar, philipp.tomsich, pinskia, Kyrill Tkachov,
	e.menezes

On 11 January 2016 at 11:53, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>

> ---
> 2015-12-10  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * config/aarch64/aarch64.c (use_rsqrt_p): Always use software
>         reciprocal sqrt for -mlow-precision-recip-sqrt.
>

OK /Marcus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-01-11 12:05 ` [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning James Greenhalgh
  2016-01-11 13:31   ` Dr. Philipp Tomsich
  2016-01-25 11:20   ` James Greenhalgh
@ 2016-02-16  8:49   ` Marcus Shawcroft
  2 siblings, 0 replies; 21+ messages in thread
From: Marcus Shawcroft @ 2016-02-16  8:49 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: gcc-patches, nd, Marcus Shawcroft, Richard Earnshaw,
	Venkataramanan.Kumar, philipp.tomsich, pinskia, Kyrill Tkachov,
	e.menezes

On 11 January 2016 at 12:04, James Greenhalgh <james.greenhalgh@arm.com> wrote:

> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * config/aarch64/aarch64.c (cortexa57_tunings): Remove
>         AARCH64_EXTRA_TUNE_RECIP_SQRT.
>

OK /Marcus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-02-15 17:25           ` Evandro Menezes
@ 2016-02-16 10:28             ` James Greenhalgh
  2016-02-16 20:46               ` Evandro Menezes
  0 siblings, 1 reply; 21+ messages in thread
From: James Greenhalgh @ 2016-02-16 10:28 UTC (permalink / raw)
  To: Evandro Menezes
  Cc: gcc-patches, nd, marcus.shawcroft, richard.earnshaw,
	Venkataramanan.Kumar, philipp.tomsich, pinskia, Kyrylo.Tkachov

On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote:
> On 02/15/16 04:50, James Greenhalgh wrote:
> >On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:
> >>On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
> >>>On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> >>>>On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> >>>>>Hi,
> >>>>>
> >>>>>I've seen a couple of large performance issues caused by expanding
> >>>>>the high-precision reciprocal square root for Cortex-A57, so I'd like
> >>>>>to turn it off by default.
> >>>>>
> >>>>>This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> >>>>>Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> >>>>>some private microbenchmark kernels which stress the divide/sqrt/multiply
> >>>>>units. It therefore seems to me to be the correct choice to make across
> >>>>>a number of workloads.
> >>>>>
> >>>>>Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> >>>>>
> >>>>>OK?
> >>>>*Ping*
> >>>*pingx2*
> >>*ping^3*
> >*ping^4*
> >
> >Thanks,
> >James
> >
> >>>>>---
> >>>>>2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> >>>>>
> >>>>>	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> >>>>>	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> >>>>>
> >>>>>diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >>>>>index 1d5d898..999c9fc 100644
> >>>>>--- a/gcc/config/aarch64/aarch64.c
> >>>>>+++ b/gcc/config/aarch64/aarch64.c
> >>>>>@@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> >>>>>    0,	/* max_case_values.  */
> >>>>>    0,	/* cache_line_size.  */
> >>>>>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> >>>>>-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> >>>>>-   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> >>>>>+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> >>>>>  };
> >>>>>  static const struct tune_params cortexa72_tunings =
> >
> 
> James,
> 
> There seem to be SPEC CPU2000fp validation issues on A57 when this
> flag is present too.  Though I evaluated the algorithm with a huge
> random set of values, always delivering accuracy around 1ulp, which
> should be enough for CPU2000fp (wit x86-64), I expected the
> benchmarks to pass.
> 
> My suspicion is that the Newton series on AArch64 is probably good
> only for SP.  Then, DP might require an extra round, probably
> exacerbating the performance penalty.
> 
> I'd like to try to split this tuning option into one for SP and
> another for DP.  Thoughts?

I haven't seen validation issues with the default expansion, but with
-mlow-precision-recip-sqrt I do see failures. I think this is to be
expected. I don't support splitting the low-precision flag to
"-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt",
I think that is pushing a particular set of Spec tuning flags over any
meaningful use case.

I could imagine a case for splitting the internal tuning flag to give
AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but
I'm not sure I understand the benefits of this? Certainly, I think your goals
for performance (turn on for 64-bit divide/sqrt) would contradict your goals
for accuracy (turn off for 64-bit divide/sqrt).

I'm happy with these flags as they are, but I might be missing a subtelty in
your argument?

Thanks,
James

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning
  2016-02-16 10:28             ` James Greenhalgh
@ 2016-02-16 20:46               ` Evandro Menezes
  0 siblings, 0 replies; 21+ messages in thread
From: Evandro Menezes @ 2016-02-16 20:46 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: gcc-patches, nd, marcus.shawcroft, richard.earnshaw,
	Venkataramanan.Kumar, philipp.tomsich, pinskia, Kyrylo.Tkachov

On 02/16/16 04:28, James Greenhalgh wrote:
> On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote:
>> James,
>>
>> There seem to be SPEC CPU2000fp validation issues on A57 when this
>> flag is present too.  Though I evaluated the algorithm with a huge
>> random set of values, always delivering accuracy around 1ulp, which
>> should be enough for CPU2000fp (wit x86-64), I expected the
>> benchmarks to pass.
>>
>> My suspicion is that the Newton series on AArch64 is probably good
>> only for SP.  Then, DP might require an extra round, probably
>> exacerbating the performance penalty.
>>
>> I'd like to try to split this tuning option into one for SP and
>> another for DP.  Thoughts?
> I haven't seen validation issues with the default expansion, but with
> -mlow-precision-recip-sqrt I do see failures. I think this is to be
> expected. I don't support splitting the low-precision flag to
> "-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt",
> I think that is pushing a particular set of Spec tuning flags over any
> meaningful use case.
>
> I could imagine a case for splitting the internal tuning flag to give
> AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but
> I'm not sure I understand the benefits of this? Certainly, I think your goals
> for performance (turn on for 64-bit divide/sqrt) would contradict your goals
> for accuracy (turn off for 64-bit divide/sqrt).
>
> I'm happy with these flags as they are, but I might be missing a subtelty in
> your argument?

James,

I'm still in sorting out the data, but, indeed, I see no validation 
issues with the approximate reciprocal square root in CPU2000.

However, working on a patch to use the Newton series for square root 
based on the approximate reciprocal square root (d^1/2 = d * d^-1/2), I 
stumbled at validation errors.  I'll take the discussion to that thread.

Stand by...

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-02-16 20:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-11 11:53 [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt James Greenhalgh
2016-01-11 12:05 ` [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning James Greenhalgh
2016-01-11 13:31   ` Dr. Philipp Tomsich
2016-01-25 11:20   ` James Greenhalgh
2016-02-01 14:00     ` James Greenhalgh
2016-02-08 10:57       ` James Greenhalgh
2016-02-15 10:50         ` James Greenhalgh
2016-02-15 17:25           ` Evandro Menezes
2016-02-16 10:28             ` James Greenhalgh
2016-02-16 20:46               ` Evandro Menezes
2016-02-16  8:49   ` Marcus Shawcroft
2016-01-11 22:58 ` [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt Evandro Menezes
2016-01-12 11:32   ` James Greenhalgh
2016-01-12 11:44     ` Kyrill Tkachov
2016-01-12  5:53 ` Kumar, Venkataramanan
2016-01-12 11:48   ` James Greenhalgh
2016-01-25 11:21 ` James Greenhalgh
2016-02-01 13:59   ` James Greenhalgh
2016-02-08 10:57     ` James Greenhalgh
2016-02-15 10:48       ` James Greenhalgh
2016-02-16  8:40 ` Marcus Shawcroft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).