RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Evandro Menezes <e.menezes@samsung.com>
To: "'Kumar, Venkataramanan'" <Venkataramanan.Kumar@amd.com>,
	pinskia@gmail.com,
	"'Dr. Philipp Tomsich'" <philipp.tomsich@theobroma-systems.com>
Cc: 'James Greenhalgh' <james.greenhalgh@arm.com>,
	'Benedikt Huber' <benedikt.huber@theobroma-systems.com>,
	gcc-patches@gcc.gnu.org,
	'Marcus Shawcroft' <Marcus.Shawcroft@arm.com>,
	'Ramana Radhakrishnan' <ramrad01@arm.com>,
	'Richard Earnshaw' <rearnsha@arm.com>
Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
Date: Mon, 20 Jul 2015 15:58:00 -0000	[thread overview]
Message-ID: <026e01d0c300$6db61450$49223cf0$@samsung.com> (raw)
In-Reply-To: <7794A52CE4D579448B959EED7DD0A4723DD1E464@satlexdag06.amd.com>

Hi, Venkat.

Since x^1/2 = x * x^-1/2, the Newton series can also be used for the regular square root with an extra multiplication, as it is done in x86.  That's what I was trying to estimate below.

Cheers,

-- 
Evandro Menezes                              Austin, TX

> -----Original Message-----
> From: Kumar, Venkataramanan [mailto:Venkataramanan.Kumar@amd.com]
> Sent: Monday, July 20, 2015 2:53
> To: Evandro Menezes; pinskia@gmail.com; 'Dr. Philipp Tomsich'
> Cc: 'James Greenhalgh'; 'Benedikt Huber'; gcc-patches@gcc.gnu.org; 'Marcus
> Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw'
> Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Hi,
> 
> I missed your email and noticed it this week.
> 
> What does column 2  tests?  Are you trying to implement square roots  using
> reciprocal estimate and step?
> 
> But reciprocal square root  using reciprocal estimate and (2 for fp 3 for dp)
> step seems  to be better that using fdiv and fsqrt in your case.
> 
> Regards,
> Venkat.
> 
> > -----Original Message-----
> > From: Evandro Menezes [mailto:e.menezes@samsung.com]
> > Sent: Wednesday, July 15, 2015 3:45 AM
> > To: Kumar, Venkataramanan; pinskia@gmail.com; 'Dr. Philipp Tomsich'
> > Cc: 'James Greenhalgh'; 'Benedikt Huber'; gcc-patches@gcc.gnu.org;
> > 'Marcus Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw'
> > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root
> > (rsqrt) estimation in -ffast-math
> >
> > I ran a simple test on A57 rev. 0, looping a million times around
> > sqrt{,f} and the respective series iterations with the values in the
> > sequence 1..1000000 and got these results:
> >
> > sqrt(x):        36593844/s      1/sqrt(x):      18283875/s
> > 3 Steps:        47922557/s      3 Steps:        49005194/s
> >
> > sqrtf(x):       143988480/s     1/sqrtf(x):     69516857/s
> > 2 Steps:        78740157/s      2 Steps:        80385852/s
> >
> > I'm a bit surprised that the 3-iteration series for DP is faster than
> > sqrt(), but not that it's much faster for the reciprocal of sqrt().
> > As for SP, the 2-iteration series is faster only for the reciprocal for
> sqrtf().
> >
> > There might still be some leg for this patch in real-world cases which
> > I'd like to investigate.
> >
> > --
> > Evandro Menezes                              Austin, TX
> >
> >
> > > -----Original Message-----
> > > From: gcc-patches-owner@gcc.gnu.org
> > > [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Kumar,
> > > Venkataramanan
> > > Sent: Monday, June 29, 2015 13:50
> > > To: pinskia@gmail.com; Dr. Philipp Tomsich
> > > Cc: James Greenhalgh; Benedikt Huber; gcc-patches@gcc.gnu.org;
> > > Marcus Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> > > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root
> > > (rsqrt) estimation in -ffast-math
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> > > > Sent: Monday, June 29, 2015 10:23 PM
> > > > To: Dr. Philipp Tomsich
> > > > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> > > > patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan;
> > Richard
> > > > Earnshaw
> > > > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > > > (rsqrt) estimation in -ffast-math
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> > > > <philipp.tomsich@theobroma-systems.com> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > >> On 29 Jun 2015, at 13:36, James Greenhalgh
> > > > <james.greenhalgh@arm.com> wrote:
> > > > >>
> > > > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar,
> > > > >>> Venkataramanan
> > > > wrote:
> > > > >>>
> > > > >>>> -----Original Message-----
> > > > >>>> From: Dr. Philipp Tomsich
> > > > >>>> [mailto:philipp.tomsich@theobroma-systems.com]
> > > > >>>> Sent: Monday, June 29, 2015 2:17 PM
> > > > >>>> To: Kumar, Venkataramanan
> > > > >>>> Cc: pinskia@gmail.com; Benedikt Huber;
> > > > >>>> gcc-patches@gcc.gnu.org
> > > > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square
> > > > >>>> root
> > > > >>>> (rsqrt) estimation in -ffast-math
> > > > >>>>
> > > > >>>> Kumar,
> > > > >>>>
> > > > >>>> This does not come unexpected, as the initial estimation and
> > > > >>>> each iteration will add an architecturally-defined number of
> > > > >>>> bits of precision (ARMv8 guarantuees only a minimum number of
> > > > >>>> bits
> > > > provided
> > > > >>>> per operation… the exact number is specific to each
> > > > >>>> micro-arch,
> > > > though).
> > > > >>>> Depending on your architecture and on the required number of
> > > > >>>> precise bits by any given benchmark, one may see miscompares.
> > > > >>>
> > > > >>> True.
> > > > >>
> > > > >> I would be very uncomfortable with this approach.
> > > > >
> > > > > Same here. The default must be safe. Always.
> > > > > Unlike other architectures, we don’t have a problem with making
> > > > > the proper defaults for “safety”, as the ARMv8 ISA guarantees a
> > > > > minimum number of precise bits per iteration.
> > > > >
> > > > >> From Richard Biener's post in the thread Michael Matz linked
> > > > >> earlier in the thread:
> > > > >>
> > > > >>   It would follow existing practice of things we allow in
> > > > >>   -funsafe-math-optimizations.  Existing practice in that we
> > > > >>   want to allow -ffast-math use with common benchmarks we care
> > > > >>   about.
> > > > >>
> > > > >>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> > > > >>
> > > > >> With the solution you seem to be converging on (2-steps for
> > > > >> some microarchitectures, 3 for others), a binary generated for
> > > > >> one micro-arch may drop below a minimum guarantee of precision
> > > > >> when run on another. This seems to go against the spirit of the
> > > > >> practice above. I would only support adding this optimization
> > > > >> to -Ofast if we could keep to architectural guarantees of
> > > > >> precision in the generated code
> > > > (i.e. 3-steps everywhere).
> > > > >>
> > > > >> I don't object to adding a "-mlow-precision-recip-sqrt" style
> > > > >> option, which would be off by default, would enable the 2-step
> > > > >> mode, and would need to be explicitly enabled (i.e. not implied
> > > > >> by
> > > > >> -mcpu=foo) but I don't see what this buys you beyond the
> > > > >> Gromacs boost (and even there you would be creating an Invalid
> > > > >> Run as optimization flags must be applied across all workloads).
> > > > >
> > > > > Any flag that reduces precision (and thus breaks IEEE
> > > > > floating-point
> > > > > semantics) needs to be gated with an “unsafe” flag (i.e. one
> > > > > that is never
> > > > on by default).
> > > > > As a consequence, the “peak”-tuning for SPEC will turn this on…
> > > > > but barely anyone else would.
> > > > >
> > > > >> For the 3-step optimization, it is clear to me that for "generic"
> > > > >> tuning we don't want this to be enabled by default experimental
> > > > >> results and advice in this thread argues against it for
> > > > >> thunderx and cortex-
> > > > a57 targets.
> > > > >> However, enabling it based on the CPU tuning selected seems
> > > > >> fine to
> > me.
> > > > >
> > > > > I do not agree on this one, as I would like to see the safe form
> (i.e.
> > > > > 3 and 5 iterations respectively) to become the default. Most
> > > > > “server-type” chips should not see a performance regression,
> > > > > while it will be easier to optimise for this in hardware than
> > > > > for a (potentially microcoded) sqrt-instruction (and subsequent,
> > > > > dependent
> > > > divide).
> > > > >
> > > > > I have not heard anyone claim a performance regression (either
> > > > > on thunderx or on cortex-a57), but merely heard a “no speed-up”.
> > > >
> > > > Actually it does regress performance on thunderX, I just assumed
> > > > that when I said not going to be a win it was taken as a slow down.
> > > > It regress gromacs by more than 10% on thunderX but I can't
> > > > remember how much as i had someone else run it. The latency
> > > > difference is also over 40%; for example single precision: 29
> > > > cycles with div (12)
> > > > sqrt(17) directly vs 42 cycles with the rsqrte and 2 iterations of
> > > > 2mul/rsqrts (double is 53 vs 60). That is huge difference right
> > > > there.  ThunderX has a
> > > fast div and a fast sqrt for 32bit and a
> > > > reasonable one for double.   So again this is not just not a win but
> rather
> > > a
> > > > regression for thunderX. I suspect cortex-a57 is also true.
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > >
> > > Yes theoretically  should be  true for cortex-57 case as well.   But  I
> > > believe hardware pipelining with instruction scheduling in compiler
> > > helps a little for gromacs case  ~3% to 4% with the original patch.
> > >
> > > I have not tested other FP benchmarks.   As James said a flag -mlow-
> > > precision-recip-sqrt if allowed can be used as a peak flag.
> > >
> > > > >
> > > > > So I am strongly in favor of defaulting to the ‘safe’ number of
> > > > > iterations, even when compiling for a generic target.
> > > > >
> > > > > Best,
> > > > > Philipp.
> > > > >
> > >
> > > Regards,
> > > Venkat.

next prev parent reply	other threads:[~2015-07-20 15:26 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-18 11:57 Benedikt Huber
2015-06-18 12:03 ` [PATCH] 2015-06-15 Benedikt Huber <benedikt.huber@theobroma-systems.com> Benedikt Huber
2015-06-27  8:12   ` Andrew Pinski
2015-06-18 12:36 ` [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Kumar, Venkataramanan
2015-06-24 16:49 ` Evandro Menezes
2015-06-24 16:55   ` Dr. Philipp Tomsich
2015-06-24 17:16     ` Benedikt Huber
2015-06-24 18:37       ` Evandro Menezes
2015-06-24 20:11         ` Dr. Philipp Tomsich
2015-06-24 20:54           ` Evandro Menezes
2015-06-25 11:52             ` Benedikt Huber
2015-06-25  7:01     ` Kumar, Venkataramanan
2015-06-25  7:03 ` pinskia
2015-06-25  9:43   ` Ramana Radhakrishnan
2015-06-27  2:01     ` Andrew Pinski
2015-06-25 11:07   ` Benedikt Huber
2015-06-25 13:27     ` Michael Matz
2015-06-25 15:43     ` Kumar, Venkataramanan
2015-06-25 15:52       ` Dr. Philipp Tomsich
2015-06-25 16:47         ` Kumar, Venkataramanan
2015-06-28 15:13           ` pinskia
2015-06-29  8:30             ` Kumar, Venkataramanan
2015-06-29  9:07               ` Dr. Philipp Tomsich
2015-06-29  9:22                 ` Kumar, Venkataramanan
2015-06-29 11:44                   ` James Greenhalgh
2015-06-29 11:56                     ` Dr. Philipp Tomsich
2015-06-29 16:57                       ` pinskia
2015-06-29 19:07                         ` Kumar, Venkataramanan
2015-07-14 22:26                           ` Evandro Menezes
2015-07-20  9:46                             ` Kumar, Venkataramanan
2015-07-20 15:58                               ` Evandro Menezes [this message]
2015-07-13 19:09                       ` Evandro Menezes
2015-07-14 22:20                 ` Evandro Menezes
2015-06-29 14:20               ` Benedikt Huber
2015-06-29 17:35               ` Benedikt Huber
2015-06-29 17:44                 ` Kumar, Venkataramanan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='026e01d0c300$6db61450$49223cf0$@samsung.com' \
    --to=e.menezes@samsung.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Venkataramanan.Kumar@amd.com \
    --cc=benedikt.huber@theobroma-systems.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=james.greenhalgh@arm.com \
    --cc=philipp.tomsich@theobroma-systems.com \
    --cc=pinskia@gmail.com \
    --cc=ramrad01@arm.com \
    --cc=rearnsha@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).