From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 9299 invoked by alias); 13 Jul 2015 19:09:09 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 9285 invoked by uid 89); 13 Jul 2015 19:09:07 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.0 required=5.0 tests=AWL,BAYES_50,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: usmailout1.samsung.com Received: from mailout1.w2.samsung.com (HELO usmailout1.samsung.com) (211.189.100.11) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Mon, 13 Jul 2015 19:09:05 +0000 Received: from uscpsbgm2.samsung.com (u115.gpu85.samsung.co.kr [203.254.195.115]) by mailout1.w2.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0NRF009YHX72OC60@mailout1.w2.samsung.com> for gcc-patches@gcc.gnu.org; Mon, 13 Jul 2015 15:09:02 -0400 (EDT) Received: from ussync4.samsung.com ( [203.254.195.84]) by uscpsbgm2.samsung.com (USCPMTA) with SMTP id CD.8D.29819.ECC04A55; Mon, 13 Jul 2015 15:09:02 -0400 (EDT) Received: from WEMENEZES ([105.140.33.224]) by ussync4.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTPA id <0NRF00K0QX71P1A0@ussync4.samsung.com>; Mon, 13 Jul 2015 15:09:02 -0400 (EDT) From: Evandro Menezes To: "'Dr. Philipp Tomsich'" , 'James Greenhalgh' Cc: "'Kumar, Venkataramanan'" , pinskia@gmail.com, 'Benedikt Huber' , gcc-patches@gcc.gnu.org, 'Marcus Shawcroft' , 'Ramana Radhakrishnan' , 'Richard Earnshaw' References: <1434629045-24650-1-git-send-email-benedikt.huber@theobroma-systems.com> <8B73CF78-11D4-4963-A60A-E1C2A3B219E2@gmail.com> <7794A52CE4D579448B959EED7DD0A4723DD10430@satlexdag06.amd.com> <1E4680F0-02C8-4999-958C-8B531BC850DA@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD104AF@satlexdag06.amd.com> <08D3EBD5-B67B-4D97-9940-3CAE6D020DC6@gmail.com> <7794A52CE4D579448B959EED7DD0A4723DD109D3@satlexdag06.amd.com> <1FEA8C0A-15E0-4309-B10D-B45032A68306@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD10A1C@satlexdag06.amd.com> <20150629113635.GA14400@arm.com> <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com> In-reply-to: <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com> Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Date: Mon, 13 Jul 2015 19:09:00 -0000 Message-id: <06ba01d0bd9f$6152a400$23f7ec00$@samsung.com> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2015-07/txt/msg01077.txt.bz2 FWIW, I was curious about the precision of the results using such instructi= ons for the standard sqrt{,f} functions. This is not a wide sample, but it= does point to a floor of series iterations to 3 for DP and 2 for SP: x sqrt(x) 1 Step (ulps) 2 Steps (ulps) = 3 Steps (ulps) 2.2251e-308 1.4917e-154 1.4917e-154 (999) 1.4917e-154 (999) = 1.4917e-154 (000) 1.6022e-19 4.0027e-10 4.0027e-10 (999) 4.0027e-10 (999) = 4.0027e-10 (000) 1.0000e+00 1.0000e+00 1.0000e+00 (001) 1.0000e+00 (001) = 1.0000e+00 (001) 1.0000e+00 1.0000e+00 9.9999e-01 (999) 1.0000e+00 (999) = 1.0000e+00 (000) 1.0000e+00 1.0000e+00 9.9999e-01 (999) 1.0000e+00 (999) = 1.0000e+00 (000) 2.0000e+00 1.4142e+00 1.4142e+00 (999) 1.4142e+00 (999) = 1.4142e+00 (000) 2.2500e+00 1.5000e+00 1.5000e+00 (999) 1.5000e+00 (999) = 1.5000e+00 (000) 2.5600e+00 1.6000e+00 1.6000e+00 (000) 1.6000e+00 (000) = 1.6000e+00 (000) 3.1416e+00 1.7725e+00 1.7725e+00 (999) 1.7725e+00 (999) = 1.7725e+00 (000) 6.0221e+23 7.7602e+11 7.7602e+11 (999) 7.7602e+11 (999) = 7.7602e+11 (000) 1.7977e+308 1.3408e+154 1.3408e+154 (000) 1.3408e+154 (000) = 1.3408e+154 (000) x sqrtf(x) 1 Step (ulps) 2 Steps (ulps) = 3 Steps (ulps) 1.1755e-38 1.0842e-19 1.0842e-19 (096) 1.0842e-19 (000) = 1.0842e-19 (000) 1.6022e-19 4.0027e-10 4.0027e-10 (008) 4.0027e-10 (000) = 4.0027e-10 (000) 1.0000e+00 1.0000e+00 1.0000e+00 (001) 1.0000e+00 (001) = 1.0000e+00 (001) 1.0000e+00 1.0000e+00 9.9999e-01 (096) 1.0000e+00 (000) = 1.0000e+00 (000) 1.0000e+00 1.0000e+00 9.9999e-01 (094) 1.0000e+00 (001) = 1.0000e+00 (000) 2.0000e+00 1.4142e+00 1.4142e+00 (146) 1.4142e+00 (001) = 1.4142e+00 (000) 2.2500e+00 1.5000e+00 1.5000e+00 (018) 1.5000e+00 (000) = 1.5000e+00 (001) 2.5600e+00 1.6000e+00 1.6000e+00 (001) 1.6000e+00 (001) = 1.6000e+00 (001) 3.1416e+00 1.7725e+00 1.7725e+00 (006) 1.7725e+00 (001) = 1.7725e+00 (001) 6.0221e+23 7.7602e+11 7.7602e+11 (069) 7.7602e+11 (001) = 7.7602e+11 (000) 3.4028e+38 1.8447e+19 1.8447e+19 (000) 1.8447e+19 (000) = 1.8447e+19 (000) The error in ULPs saturates at 999 above. The result of having to use so many iterations to achieve accuracy would de= feat using the Newton series, as it would likely be slower than the FSQRT i= nstruction. Unlike in x86, I have the impression that the initial estimate in AArch64 i= s meant to be used in applications that do not require precision, like grap= hics, etc. Then, a single series iteration for SP would perhaps be good en= ough. --=20 Evandro Menezes Austin, TX > -----Original Message----- > From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org= ] On > Behalf Of Dr. Philipp Tomsich > Sent: Monday, June 29, 2015 6:45 > To: James Greenhalgh > Cc: Kumar, Venkataramanan; pinskia@gmail.com; Benedikt Huber; gcc- > patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard Earn= shaw > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) > estimation in -ffast-math >=20 > James, >=20 > On 29 Jun 2015, at 13:36, James Greenhalgh wro= te: > > > > On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote: > >> > >>> -----Original Message----- > >>> From: Dr. Philipp Tomsich > >>> [mailto:philipp.tomsich@theobroma-systems.com] > >>> Sent: Monday, June 29, 2015 2:17 PM > >>> To: Kumar, Venkataramanan > >>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org > >>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >>> (rsqrt) estimation in -ffast-math > >>> > >>> Kumar, > >>> > >>> This does not come unexpected, as the initial estimation and each > >>> iteration will add an architecturally-defined number of bits of > >>> precision (ARMv8 guarantuees only a minimum number of bits provided > >>> per operation=E2=80=A6 the exact number is specific to each micro-arc= h, though). > >>> Depending on your architecture and on the required number of precise > >>> bits by any given benchmark, one may see miscompares. > >> > >> True. > > > > I would be very uncomfortable with this approach. >=20 > Same here. The default must be safe. Always. > Unlike other architectures, we don=E2=80=99t have a problem with making t= he proper > defaults for =E2=80=9Csafety=E2=80=9D, as the ARMv8 ISA guarantees a mini= mum number of > precise bits per iteration. >=20 > > From Richard Biener's post in the thread Michael Matz linked earlier > > in the thread: > > > > It would follow existing practice of things we allow in > > -funsafe-math-optimizations. Existing practice in that we > > want to allow -ffast-math use with common benchmarks we care > > about. > > > > https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html > > > > With the solution you seem to be converging on (2-steps for some > > microarchitectures, 3 for others), a binary generated for one > > micro-arch may drop below a minimum guarantee of precision when run on > > another. This seems to go against the spirit of the practice above. I > > would only support adding this optimization to -Ofast if we could keep > > to architectural guarantees of precision in the generated code (i.e. 3- > steps everywhere). > > > > I don't object to adding a "-mlow-precision-recip-sqrt" style option, > > which would be off by default, would enable the 2-step mode, and would > > need to be explicitly enabled (i.e. not implied by -mcpu=3Dfoo) but I > > don't see what this buys you beyond the Gromacs boost (and even there > > you would be creating an Invalid Run as optimization flags must be > > applied across all workloads). >=20 > Any flag that reduces precision (and thus breaks IEEE floating-point > semantics) needs to be gated with an =E2=80=9Cunsafe=E2=80=9D flag (i.e. = one that is never on > by default). > As a consequence, the =E2=80=9Cpeak=E2=80=9D-tuning for SPEC will turn th= is on=E2=80=A6 but barely > anyone else would. >=20 > > For the 3-step optimization, it is clear to me that for "generic" > > tuning we don't want this to be enabled by default experimental > > results and advice in this thread argues against it for thunderx and > cortex-a57 targets. > > However, enabling it based on the CPU tuning selected seems fine to me. >=20 > I do not agree on this one, as I would like to see the safe form (i.e. 3 = and > 5 iterations respectively) to become the default. Most =E2=80=9Cserver-ty= pe=E2=80=9D chips > should not see a performance regression, while it will be easier to optim= ise > for this in hardware than for a (potentially microcoded) sqrt-instruction > (and subsequent, dependent divide). >=20 > I have not heard anyone claim a performance regression (either on thunder= x or > on cortex-a57), but merely heard a =E2=80=9Cno speed-up=E2=80=9D. >=20 > So I am strongly in favor of defaulting to the =E2=80=98safe=E2=80=99 num= ber of iterations, > even when compiling for a generic target. >=20 > Best, > Philipp.