Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.
@ 2020-02-25 15:27 Wilco Dijkstra
  2020-02-26 13:01 ` =?gb18030?B?QnUgTGU=?=
  0 siblings, 1 reply; 7+ messages in thread
From: Wilco Dijkstra @ 2020-02-25 15:27 UTC (permalink / raw)
  To: gcc-help, cityubule

Hi,

> I found that the mlow-precision-div option have a fix number of newton iterations, 
> which is 2 for float type and 3 for double type.
>
> I noticed that if I alter the numbers of newton iterations as following, it could leads
> to faster performance in SPEC2017 fpspeed test &nbsp;on AArch64, with less but
> acceptable precision.

Which CPU did you try this on? Those results look suspicious - lbm hardly does any
divisions for example, so either the computation has gone wrong due to the lower
accuracy or your CPU has a really slow divide...

On modern cores it is faster to do a division than to use the division approximation
instructions. Eg. on Neoverse N1 a float division takes at most 10 cycles while the
reduced approximation takes 13 cycles (and needs 3 extra instructions which take up
decode and issue slots).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.
  2020-02-25 15:27 [AArch64][Spec2017]Question about mlow-precision-div optimization Wilco Dijkstra
@ 2020-02-26 13:01 ` =?gb18030?B?QnUgTGU=?=
  2020-02-27 22:08   ` Wilco Dijkstra
  0 siblings, 1 reply; 7+ messages in thread
From: =?gb18030?B?QnUgTGU=?= @ 2020-02-26 13:01 UTC (permalink / raw)
  To: =?gb18030?B?V2lsY28gRGlqa3N0cmE=?=, =?gb18030?B?Z2NjLWhlbHA=?=

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain;	charset="gb18030", Size: 2805 bytes --]

Hi,

Thanks for the reply.

These data I presented is acquired from a&nbsp;cortex-a57&nbsp;CPU.&nbsp;&nbsp; &nbsp;

Since spec2017 does result check and will give a test report which indicates miscomputed cases, I suppose the performance improvement is valid.

The point that you mentioned in some modern CPU, fdiv is faster than the reciprocal approximation is a new aspect I haven¡¯t come cross.

Nevertheless, in a CPU that reciprocal approximation make a profit, like my case, may I ask why the number of newton iteration is fixed to 2 and 3?

And do you think it worth us providing a parameter to alter the iteration so that the accuracy can be a trade-off of speed.

By the way, the original data is as following.&nbsp;

Test case&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp;Improvement
603.bwaves_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;7.92%
607.cactuBSSN_s&nbsp; &nbsp; &nbsp;&nbsp;Output miscompare
619.lbm_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;32.34%
621.wrf_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Output miscompare
627.cam4_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;Output miscompare
628.pop2_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Output miscompare
638.imagick_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-0.97%
644.nab_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;9.09%
649.fotonik3d_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Output miscompare
654.roms_s&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-3.45%

------------------&nbsp;Original&nbsp;------------------
From:&nbsp;"Wilco Dijkstra"<Wilco.Dijkstra@arm.com&gt;;
Date:&nbsp;Mon, Feb 24, 2020 08:59 PM
To:&nbsp;"gcc-help@gcc.gnu.org"<gcc-help@gcc.gnu.org&gt;;"Bu Le"<cityubule@qq.com&gt;;

Subject:&nbsp;Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.

Hi,

&gt; I found that the mlow-precision-div option have a fix number of newton iterations, 
&gt; which is 2 for float type and 3 for double type.
&gt;
&gt; I noticed that if I alter the numbers of newton iterations as following, it could leads
&gt; to faster performance in SPEC2017 fpspeed test &amp;nbsp;on AArch64, with less but
&gt; acceptable precision.
&nbsp;
Which CPU did you try this on? Those results look suspicious - lbm hardly does any
divisions for example, so either the computation has gone wrong due to the lower
accuracy or your CPU has a really slow divide...

On modern cores it is faster to do a division than to use the division approximation
instructions. Eg. on Neoverse N1 a float division takes at most 10 cycles while the
reduced approximation takes 13 cycles (and needs 3 extra instructions which take up
decode and issue slots).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.
  2020-02-26 13:01 ` =?gb18030?B?QnUgTGU=?=
@ 2020-02-27 22:08   ` Wilco Dijkstra
  2020-03-03 16:22     ` Richard Sandiford
  0 siblings, 1 reply; 7+ messages in thread
From: Wilco Dijkstra @ 2020-02-27 22:08 UTC (permalink / raw)
  To: Bu Le, gcc-help

Hi,

> These data I presented is acquired from a cortex-a57 CPU.

>The point that you mentioned in some modern CPU, fdiv is faster than the reciprocal 
> approximation is a new aspect I haven’t come cross.

Well on Cortex-A57 division is also faster, eg. lbm_r is ~3% slower using reciprocal divide.

> And do you think it worth us providing a parameter to alter the iteration so that the
> accuracy can be a trade-off of speed.

What do you mean? We already have -mlow-precision-div (and -sqrt/-recip-sqrt).

> Since spec2017 does result check and will give a test report which indicates miscomputed cases, 
> I suppose the performance improvement is valid.

Try perf stat to show instruction counts, and if they are not increasing due to the extra reciprocal
operations, the benchmark is running incorrectly even if it passes basic checks.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.
  2020-02-27 22:08   ` Wilco Dijkstra
@ 2020-03-03 16:22     ` Richard Sandiford
  2020-03-04 13:26       ` Wilco Dijkstra
       [not found]       ` <tencent_5C7FA4816F6BB9D3236327A73C9BA5A39105@qq.com>
  0 siblings, 2 replies; 7+ messages in thread
From: Richard Sandiford @ 2020-03-03 16:22 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Bu Le, gcc-help

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>> And do you think it worth us providing a parameter to alter the iteration so that the
>> accuracy can be a trade-off of speed.
>
> What do you mean? We already have -mlow-precision-div (and -sqrt/-recip-sqrt).

The suggestion was to have a parameter to control the number of steps,
rather than always use the values that are currently hard-coded into
aarch64.c.

That sounds OK in principle.  It would fix one of the downsides of the
current code, in which users can force reciprocal approximation to be
used at low precision, but can't force it to be used at the precisions
normally chosen by -mtune.

It's probably not worth promoting to a full -m option that in theory
would be supported for evermore.  But now that targets can define their
own --params, it might make sense to use --params here.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.
  2020-03-03 16:22     ` Richard Sandiford
@ 2020-03-04 13:26       ` Wilco Dijkstra
       [not found]       ` <tencent_5C7FA4816F6BB9D3236327A73C9BA5A39105@qq.com>
  1 sibling, 0 replies; 7+ messages in thread
From: Wilco Dijkstra @ 2020-03-04 13:26 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Bu Le, gcc-help

Hi Richard,

> The suggestion was to have a parameter to control the number of steps,
> rather than always use the values that are currently hard-coded into
> aarch64.c.
>
> That sounds OK in principle.  It would fix one of the downsides of the
> current code, in which users can force reciprocal approximation to be
> used at low precision, but can't force it to be used at the precisions
> normally chosen by -mtune.

Well there are only 2 possible settings, so I don't see how a param would be
reasonable... However given there is no performance improvement for the low
precision versions, what is the point of adding new options for the even slower,
higher precision variant?!?

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <tencent_5C7FA4816F6BB9D3236327A73C9BA5A39105@qq.com>]

* Re: 回复： [AArch64][Spec2017]Question about mlow-precision-div optimization.
       [not found]       ` <tencent_5C7FA4816F6BB9D3236327A73C9BA5A39105@qq.com>
@ 2020-03-06 15:24         ` Richard Sandiford
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Sandiford @ 2020-03-06 15:24 UTC (permalink / raw)
  To: Bu Le; +Cc: Wilco Dijkstra, gcc-help

Hi,

Sorry for the slow reply, the last few days have been a bit hectic.

"Bu Le" <cityubule@qq.com> writes:
>>It's probably not worth promoting to a full -m option that in theory
>>would be supported for evermore.  But now that targets can define their
>>own --params, it might make sense to use --params here.
> Thanks for the reply.
> I tried the patch in the attachment, it works as we expected. Do you mean like
> this?
>
> A simple example :
> Double foo(double a, double b) { return a /b;}
>
> -O2 -ffast-math -mlow-precision-div foo.c will give:
>   Foo:
>  frecpe d2, d1
>  frecps d3, d2, d1
>  fmul d2, d2, d3
>  frecps d3, d2, d1
>  fmul d2, d2, d0
>  fmul d0, d2, d3
>  ret
> -O2 -ffast-math -mlow-precision-div --param=aarch64-double-recp-precision=2
> foo.c result in one less step
>  Foo:
>  frecpe d2, d1
>  frecps d3, d2, d1
>  fmul d2, d2, d0
>  fmul d0, d2, d3
>  ret

Yeah, this is the kind of thing I had in mind.  However, rather than
calculating the value here:

> diff -Nurp a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> --- a/gcc/config/aarch64/aarch64.c	2020-02-11 11:51:04.000000000 +0800
> +++ b/gcc/config/aarch64/aarch64.c	2020-03-04 23:01:16.600403598 +0800
> @@ -12851,8 +12851,8 @@ aarch64_emit_approx_div (rtx quo, rtx nu
>    rtx xrcp = gen_reg_rtx (mode);
>    emit_insn (gen_aarch64_frecpe (mode, xrcp, den));
>  
> -  /* Iterate over the series twice for SF and thrice for DF.  */
> -  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
> +  /* Iterate over the series twice for SF and thrice for DF by default.  */
> +  int iterations = (GET_MODE_INNER (mode) == DFmode) ? aarch64_double_recp_precision : aarch64_float_recp_precision;

and then decrementing it here:

>    /* Optionally iterate over the series once less for faster performance,
>       while sacrificing the accuracy.  */
>    if ((recp && flag_mrecip_low_precision_sqrt)
>        || (!recp && flag_mlow_precision_sqrt))
>      iterations--;

it might better to keep the original 3 : 2 calculation above and
then override it with the param values:

    if ((recp && flag_mrecip_low_precision_sqrt)
        || (!recp && flag_mlow_precision_sqrt))
      iterations = ...param values...;

That way, the --param value reflects the actual number of steps.

Minor formatting point, but GCC code uses a maximum line length
of 80 characters, so the conventional way of formatting the
calculation above would be:

  int iterations = (GET_MODE_INNER (mode) == DFmode
		    ? aarch64_double_recp_precision
		    : aarch64_float_recp_precision);

> diff -Nurp a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> --- a/gcc/config/aarch64/aarch64.opt	2020-02-04 09:23:30.000000000 +0800
> +++ b/gcc/config/aarch64/aarch64.opt	2020-03-04 23:01:18.173777158 +0800
> @@ -262,3 +262,12 @@ Generate local calls to out-of-line atom
>  -param=aarch64-sve-compare-costs=
>  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
>  When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
> +
> +-param=aarch64-float-recp-precision=
> +Target Joined UInteger Var(aarch64_float_recp_precision) Init(2) IntegerRange(1, 5) Param
> +The number of Newton-iteration for calculating the reciprocal for float type. The precision of division is propotional to this param when division approximation is enabled. The default value is 2.
> +
> +-param=aarch64-double-recp-precision=
> +Target Joined UInteger Var(aarch64_double_recp_precision) Init(3) IntegerRange(1, 5) Param
> +The number of Newton-iteration for calculating the reciprocal for double type. The precision of division is propotional to this param when division approximation is enabled. The default value is 3.
> +

typo: s/propotional/proportional/.  Also, maybe
s/Newton-iteration/Newton iterations/

Looks good otherwise.  However, the patch is unfortunately big enough to
need a copyright assignment to the FSF.  Do you already have one on file
(either a personal or a corporate one, depending on your circumstances)?
If not, would you be willing to sign one?  I can send you the forms
off-list if so.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [AArch64][Spec2017]Question about mlow-precision-div optimization.
@ 2020-02-23  4:06 Bu Le
  0 siblings, 0 replies; 7+ messages in thread
From: Bu Le @ 2020-02-23  4:06 UTC (permalink / raw)
  To: gcc-help

Hello world,

I found that the mlow-precision-div option have a fix number of newton iterations, which is 2 for float type and 3 for double type.

I noticed that if I alter the numbers of newton iterations as following, it could leads to faster performance in SPEC2017 fpspeed test &nbsp;on AArch64, with less but acceptable precision.

Before change: 

frecpe&nbsp; s2, s8

frecps&nbsp; s4, s2, s8

fmul&nbsp; &nbsp; s2, s2, s4

frecps&nbsp; s4, s2, s8

fmul&nbsp; &nbsp; s2, s2, s4

fmul &nbsp; s10, s2

&nbsp;

after change:

frecpe&nbsp; s2, s8

frecps&nbsp; s4, s2, s8

fmul&nbsp; &nbsp; s2, s2, s4

fmul &nbsp; s10, s2

&nbsp;

The detail of the improvement is shown as following: (change the number of newton iterations for float to 1 and double to 2)

Test case

Improvement

603.bwaves_s

7.92%

607.cactuBSSN_s

Output miscompare

619.lbm_s

32.34%

621.wrf_s

Output miscompare

627.cam4_s

Output miscompare

628.pop2_s

Output miscompare 

638.imagick_s

-0.97%

644.nab_s

9.09%

649.fotonik3d_s

Output miscompare

654.roms_s

-3.45%

This may benefit the performance of some test cases which do not have a high demand on precision. 

Considering the precision of div is already lower than the IEEE standard when this option is on. Why the precision is fixed by the magic number 2 and 3? 

Should we provide a parameter so that users can alter this value according to their needs?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-03-06 15:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-25 15:27 [AArch64][Spec2017]Question about mlow-precision-div optimization Wilco Dijkstra
2020-02-26 13:01 ` =?gb18030?B?QnUgTGU=?=
2020-02-27 22:08   ` Wilco Dijkstra
2020-03-03 16:22     ` Richard Sandiford
2020-03-04 13:26       ` Wilco Dijkstra
     [not found]       ` <tencent_5C7FA4816F6BB9D3236327A73C9BA5A39105@qq.com>
2020-03-06 15:24         ` 回复： " Richard Sandiford
  -- strict thread matches above, loose matches on Subject: below --
2020-02-23  4:06 Bu Le

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).