From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-402816-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 9299 invoked by alias); 13 Jul 2015 19:09:09 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 9285 invoked by uid 89); 13 Jul 2015 19:09:07 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.0 required=5.0 tests=AWL,BAYES_50,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-HELO: usmailout1.samsung.com
Received: from mailout1.w2.samsung.com (HELO usmailout1.samsung.com) (211.189.100.11) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Mon, 13 Jul 2015 19:09:05 +0000
Received: from uscpsbgm2.samsung.com (u115.gpu85.samsung.co.kr [203.254.195.115]) by mailout1.w2.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May  5 2014)) with ESMTP id <0NRF009YHX72OC60@mailout1.w2.samsung.com> for gcc-patches@gcc.gnu.org; Mon, 13 Jul 2015 15:09:02 -0400 (EDT)
Received: from ussync4.samsung.com ( [203.254.195.84])	by uscpsbgm2.samsung.com (USCPMTA) with SMTP id CD.8D.29819.ECC04A55; Mon, 13 Jul 2015 15:09:02 -0400 (EDT)
Received: from WEMENEZES ([105.140.33.224]) by ussync4.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May  5 2014)) with ESMTPA id <0NRF00K0QX71P1A0@ussync4.samsung.com>; Mon, 13 Jul 2015 15:09:02 -0400 (EDT)
From: Evandro Menezes <e.menezes@samsung.com>
To: "'Dr. Philipp Tomsich'" <philipp.tomsich@theobroma-systems.com>, 'James Greenhalgh' <james.greenhalgh@arm.com>
Cc: "'Kumar, Venkataramanan'" <Venkataramanan.Kumar@amd.com>, pinskia@gmail.com, 'Benedikt Huber' <benedikt.huber@theobroma-systems.com>, gcc-patches@gcc.gnu.org, 'Marcus Shawcroft' <Marcus.Shawcroft@arm.com>, 'Ramana Radhakrishnan' <ramrad01@arm.com>, 'Richard Earnshaw' <rearnsha@arm.com>
References: <1434629045-24650-1-git-send-email-benedikt.huber@theobroma-systems.com> <8B73CF78-11D4-4963-A60A-E1C2A3B219E2@gmail.com> <F2FF9755-1DF9-4000-8602-77AB12077240@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD10430@satlexdag06.amd.com> <1E4680F0-02C8-4999-958C-8B531BC850DA@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD104AF@satlexdag06.amd.com> <08D3EBD5-B67B-4D97-9940-3CAE6D020DC6@gmail.com> <7794A52CE4D579448B959EED7DD0A4723DD109D3@satlexdag06.amd.com> <1FEA8C0A-15E0-4309-B10D-B45032A68306@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD10A1C@satlexdag06.amd.com> <20150629113635.GA14400@arm.com> <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com>
In-reply-to: <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com>
Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
Date: Mon, 13 Jul 2015 19:09:00 -0000
Message-id: <06ba01d0bd9f$6152a400$23f7ec00$@samsung.com>
MIME-version: 1.0
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2015-07/txt/msg01077.txt.bz2

FWIW, I was curious about the precision of the results using such instructi=
ons for the standard sqrt{,f} functions.  This is not a wide sample, but it=
 does point to a floor of series iterations to 3 for DP and 2 for SP:

x               sqrt(x)         1 Step (ulps)           2 Steps (ulps)     =
     3 Steps (ulps)
2.2251e-308     1.4917e-154     1.4917e-154 (999)       1.4917e-154 (999)  =
     1.4917e-154 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (999)        4.0027e-10 (999)   =
     4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)   =
     1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)   =
     1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)   =
     1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (999)        1.4142e+00 (999)   =
     1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (999)        1.5000e+00 (999)   =
     1.5000e+00 (000)
2.5600e+00      1.6000e+00      1.6000e+00 (000)        1.6000e+00 (000)   =
     1.6000e+00 (000)
3.1416e+00      1.7725e+00      1.7725e+00 (999)        1.7725e+00 (999)   =
     1.7725e+00 (000)
6.0221e+23      7.7602e+11      7.7602e+11 (999)        7.7602e+11 (999)   =
     7.7602e+11 (000)
1.7977e+308     1.3408e+154     1.3408e+154 (000)       1.3408e+154 (000)  =
     1.3408e+154 (000)

x               sqrtf(x)        1 Step (ulps)           2 Steps (ulps)     =
     3 Steps (ulps)
1.1755e-38      1.0842e-19      1.0842e-19 (096)        1.0842e-19 (000)   =
     1.0842e-19 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (008)        4.0027e-10 (000)   =
     4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)   =
     1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (096)        1.0000e+00 (000)   =
     1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (094)        1.0000e+00 (001)   =
     1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (146)        1.4142e+00 (001)   =
     1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (018)        1.5000e+00 (000)   =
     1.5000e+00 (001)
2.5600e+00      1.6000e+00      1.6000e+00 (001)        1.6000e+00 (001)   =
     1.6000e+00 (001)
3.1416e+00      1.7725e+00      1.7725e+00 (006)        1.7725e+00 (001)   =
     1.7725e+00 (001)
6.0221e+23      7.7602e+11      7.7602e+11 (069)        7.7602e+11 (001)   =
     7.7602e+11 (000)
3.4028e+38      1.8447e+19      1.8447e+19 (000)        1.8447e+19 (000)   =
     1.8447e+19 (000)

The error in ULPs saturates at 999 above.

The result of having to use so many iterations to achieve accuracy would de=
feat using the Newton series, as it would likely be slower than the FSQRT i=
nstruction.

Unlike in x86, I have the impression that the initial estimate in AArch64 i=
s meant to be used in applications that do not require precision, like grap=
hics, etc.  Then, a single series iteration for SP would perhaps be good en=
ough.

--=20
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org=
] On
> Behalf Of Dr. Philipp Tomsich
> Sent: Monday, June 29, 2015 6:45
> To: James Greenhalgh
> Cc: Kumar, Venkataramanan; pinskia@gmail.com; Benedikt Huber; gcc-
> patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard Earn=
shaw
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
>=20
> James,
>=20
> On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wro=
te:
> >
> > On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
> >>
> >>> -----Original Message-----
> >>> From: Dr. Philipp Tomsich
> >>> [mailto:philipp.tomsich@theobroma-systems.com]
> >>> Sent: Monday, June 29, 2015 2:17 PM
> >>> To: Kumar, Venkataramanan
> >>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> >>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>> (rsqrt) estimation in -ffast-math
> >>>
> >>> Kumar,
> >>>
> >>> This does not come unexpected, as the initial estimation and each
> >>> iteration will add an architecturally-defined number of bits of
> >>> precision (ARMv8 guarantuees only a minimum number of bits provided
> >>> per operation=E2=80=A6 the exact number is specific to each micro-arc=
h, though).
> >>> Depending on your architecture and on the required number of precise
> >>> bits by any given benchmark, one may see miscompares.
> >>
> >> True.
> >
> > I would be very uncomfortable with this approach.
>=20
> Same here. The default must be safe. Always.
> Unlike other architectures, we don=E2=80=99t have a problem with making t=
he proper
> defaults for =E2=80=9Csafety=E2=80=9D, as the ARMv8 ISA guarantees a mini=
mum number of
> precise bits per iteration.
>=20
> > From Richard Biener's post in the thread Michael Matz linked earlier
> > in the thread:
> >
> >    It would follow existing practice of things we allow in
> >    -funsafe-math-optimizations.  Existing practice in that we
> >    want to allow -ffast-math use with common benchmarks we care
> >    about.
> >
> >    https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> >
> > With the solution you seem to be converging on (2-steps for some
> > microarchitectures, 3 for others), a binary generated for one
> > micro-arch may drop below a minimum guarantee of precision when run on
> > another. This seems to go against the spirit of the practice above. I
> > would only support adding this optimization to -Ofast if we could keep
> > to architectural guarantees of precision in the generated code (i.e. 3-
> steps everywhere).
> >
> > I don't object to adding a "-mlow-precision-recip-sqrt" style option,
> > which would be off by default, would enable the 2-step mode, and would
> > need to be explicitly enabled (i.e. not implied by -mcpu=3Dfoo) but I
> > don't see what this buys you beyond the Gromacs boost (and even there
> > you would be creating an Invalid Run as optimization flags must be
> > applied across all workloads).
>=20
> Any flag that reduces precision (and thus breaks IEEE floating-point
> semantics) needs to be gated with an =E2=80=9Cunsafe=E2=80=9D flag (i.e. =
one that is never on
> by default).
> As a consequence, the =E2=80=9Cpeak=E2=80=9D-tuning for SPEC will turn th=
is on=E2=80=A6 but barely
> anyone else would.
>=20
> > For the 3-step optimization, it is clear to me that for "generic"
> > tuning we don't want this to be enabled by default experimental
> > results and advice in this thread argues against it for thunderx and
> cortex-a57 targets.
> > However, enabling it based on the CPU tuning selected seems fine to me.
>=20
> I do not agree on this one, as I would like to see the safe form (i.e. 3 =
and
> 5 iterations respectively) to become the default. Most =E2=80=9Cserver-ty=
pe=E2=80=9D chips
> should not see a performance regression, while it will be easier to optim=
ise
> for this in hardware than for a (potentially microcoded) sqrt-instruction
> (and subsequent, dependent divide).
>=20
> I have not heard anyone claim a performance regression (either on thunder=
x or
> on cortex-a57), but merely heard a =E2=80=9Cno speed-up=E2=80=9D.
>=20
> So I am strongly in favor of defaulting to the =E2=80=98safe=E2=80=99 num=
ber of iterations,
> even when compiling for a generic target.
>=20
> Best,
> Philipp.