From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-401562-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 128123 invoked by alias); 29 Jun 2015 16:53:31 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 128081 invoked by uid 89); 29 Jun 2015 16:53:29 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-pa0-f51.google.com
Received: from mail-pa0-f51.google.com (HELO mail-pa0-f51.google.com) (209.85.220.51) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Mon, 29 Jun 2015 16:53:27 +0000
Received: by pabvl15 with SMTP id vl15so108020446pab.1        for <gcc-patches@gcc.gnu.org>; Mon, 29 Jun 2015 09:53:25 -0700 (PDT)
X-Received: by 10.70.134.198 with SMTP id pm6mr34276825pdb.17.1435596805326;        Mon, 29 Jun 2015 09:53:25 -0700 (PDT)
Received: from ?IPv6:2602:304:cfd0:15a0:4947:fcf9:72f7:7ccf? ([2602:304:cfd0:15a0:4947:fcf9:72f7:7ccf])        by mx.google.com with ESMTPSA id cz1sm42752102pbc.84.2015.06.29.09.53.22        (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);        Mon, 29 Jun 2015 09:53:23 -0700 (PDT)
References: <1434629045-24650-1-git-send-email-benedikt.huber@theobroma-systems.com> <8B73CF78-11D4-4963-A60A-E1C2A3B219E2@gmail.com> <F2FF9755-1DF9-4000-8602-77AB12077240@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD10430@satlexdag06.amd.com> <1E4680F0-02C8-4999-958C-8B531BC850DA@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD104AF@satlexdag06.amd.com> <08D3EBD5-B67B-4D97-9940-3CAE6D020DC6@gmail.com> <7794A52CE4D579448B959EED7DD0A4723DD109D3@satlexdag06.amd.com> <1FEA8C0A-15E0-4309-B10D-B45032A68306@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD10A1C@satlexdag06.amd.com> <20150629113635.GA14400@arm.com> <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com>
Mime-Version: 1.0 (1.0)
In-Reply-To: <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com>
Content-Type: text/plain;	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Message-Id: <326A6111-183B-4F72-BEF9-4FE1AA708DE4@gmail.com>
Cc: James Greenhalgh <james.greenhalgh@arm.com>, "Kumar, Venkataramanan" <Venkataramanan.Kumar@amd.com>, Benedikt Huber <benedikt.huber@theobroma-systems.com>, "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, Marcus Shawcroft <Marcus.Shawcroft@arm.com>, Ramana Radhakrishnan <ramrad01@arm.com>, Richard Earnshaw <rearnsha@arm.com>
From: pinskia@gmail.com
Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
Date: Mon, 29 Jun 2015 16:57:00 -0000
To: "Dr. Philipp Tomsich" <philipp.tomsich@theobroma-systems.com>
X-IsSubscribed: yes
X-SW-Source: 2015-06/txt/msg02097.txt.bz2


> On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich <philipp.tomsich@theobro=
ma-systems.com> wrote:
>=20
> James,
>=20
>> On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wr=
ote:
>>=20
>>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
>>>=20
>>>> -----Original Message-----
>>>> From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.co=
m]
>>>> Sent: Monday, June 29, 2015 2:17 PM
>>>> To: Kumar, Venkataramanan
>>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsq=
rt)
>>>> estimation in -ffast-math
>>>>=20
>>>> Kumar,
>>>>=20
>>>> This does not come unexpected, as the initial estimation and each iter=
ation
>>>> will add an architecturally-defined number of bits of precision (ARMv8
>>>> guarantuees only a minimum number of bits provided per operation=E2=80=
=A6 the
>>>> exact number is specific to each micro-arch, though).
>>>> Depending on your architecture and on the required number of precise b=
its
>>>> by any given benchmark, one may see miscompares.
>>>=20
>>> True.=20=20
>>=20
>> I would be very uncomfortable with this approach.
>=20
> Same here. The default must be safe. Always.
> Unlike other architectures, we don=E2=80=99t have a problem with making t=
he proper
> defaults for =E2=80=9Csafety=E2=80=9D, as the ARMv8 ISA guarantees a mini=
mum number of
> precise bits per iteration.
>=20
>> From Richard Biener's post in the thread Michael Matz linked earlier
>> in the thread:
>>=20
>>   It would follow existing practice of things we allow in
>>   -funsafe-math-optimizations.  Existing practice in that we
>>   want to allow -ffast-math use with common benchmarks we care
>>   about.
>>=20
>>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
>>=20
>> With the solution you seem to be converging on (2-steps for some
>> microarchitectures, 3 for others), a binary generated for one micro-arch
>> may drop below a minimum guarantee of precision when run on another. This
>> seems to go against the spirit of the practice above. I would only suppo=
rt
>> adding this optimization to -Ofast if we could keep to architectural
>> guarantees of precision in the generated code (i.e. 3-steps everywhere).
>>=20
>> I don't object to adding a "-mlow-precision-recip-sqrt" style option,
>> which would be off by default, would enable the 2-step mode, and would
>> need to be explicitly enabled (i.e. not implied by -mcpu=3Dfoo) but I do=
n't
>> see what this buys you beyond the Gromacs boost (and even there you would
>> be creating an Invalid Run as optimization flags must be applied across
>> all workloads).
>=20
> Any flag that reduces precision (and thus breaks IEEE floating-point sema=
ntics)
> needs to be gated with an =E2=80=9Cunsafe=E2=80=9D flag (i.e. one that is=
 never on by default).
> As a consequence, the =E2=80=9Cpeak=E2=80=9D-tuning for SPEC will turn th=
is on=E2=80=A6 but barely=20
> anyone else would.
>=20
>> For the 3-step optimization, it is clear to me that for "generic" tuning
>> we don't want this to be enabled by default experimental results and adv=
ice
>> in this thread argues against it for thunderx and cortex-a57 targets.
>> However, enabling it based on the CPU tuning selected seems fine to me.
>=20
> I do not agree on this one, as I would like to see the safe form (i.e. 3 =
and 5
> iterations respectively) to become the default. Most =E2=80=9Cserver-type=
=E2=80=9D chips
> should not see a performance regression, while it will be easier to optim=
ise for
> this in hardware than for a (potentially microcoded) sqrt-instruction (an=
d=20
> subsequent, dependent divide).
>=20
> I have not heard anyone claim a performance regression (either on thunderx
> or on cortex-a57), but merely heard a =E2=80=9Cno speed-up=E2=80=9D.

Actually it does regress performance on thunderX, I just assumed that when =
I said not going to be a win it was taken as a slow down. It regress gromac=
s by more than 10% on thunderX but I can't remember how much as i had someo=
ne else run it. The latency difference is also over 40%; for example single=
 precision: 29 cycles with div (12) sqrt(17) directly vs 42 cycles with the=
 rsqrte and 2 iterations of 2mul/rsqrts (double is 53 vs 60). That is huge =
difference right there.  ThunderX has a fast div and a fast sqrt for 32bit =
and a reasonable one for double.   So again this is not just not a win but =
rather a regression for thunderX. I suspect cortex-a57 is also true.=20

Thanks,
Andrew

>=20
> So I am strongly in favor of defaulting to the =E2=80=98safe=E2=80=99 num=
ber of iterations, even
> when compiling for a generic target.
>=20
> Best,
> Philipp.
>=20