From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 128123 invoked by alias); 29 Jun 2015 16:53:31 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 128081 invoked by uid 89); 29 Jun 2015 16:53:29 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-pa0-f51.google.com Received: from mail-pa0-f51.google.com (HELO mail-pa0-f51.google.com) (209.85.220.51) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Mon, 29 Jun 2015 16:53:27 +0000 Received: by pabvl15 with SMTP id vl15so108020446pab.1 for ; Mon, 29 Jun 2015 09:53:25 -0700 (PDT) X-Received: by 10.70.134.198 with SMTP id pm6mr34276825pdb.17.1435596805326; Mon, 29 Jun 2015 09:53:25 -0700 (PDT) Received: from ?IPv6:2602:304:cfd0:15a0:4947:fcf9:72f7:7ccf? ([2602:304:cfd0:15a0:4947:fcf9:72f7:7ccf]) by mx.google.com with ESMTPSA id cz1sm42752102pbc.84.2015.06.29.09.53.22 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 29 Jun 2015 09:53:23 -0700 (PDT) References: <1434629045-24650-1-git-send-email-benedikt.huber@theobroma-systems.com> <8B73CF78-11D4-4963-A60A-E1C2A3B219E2@gmail.com> <7794A52CE4D579448B959EED7DD0A4723DD10430@satlexdag06.amd.com> <1E4680F0-02C8-4999-958C-8B531BC850DA@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD104AF@satlexdag06.amd.com> <08D3EBD5-B67B-4D97-9940-3CAE6D020DC6@gmail.com> <7794A52CE4D579448B959EED7DD0A4723DD109D3@satlexdag06.amd.com> <1FEA8C0A-15E0-4309-B10D-B45032A68306@theobroma-systems.com> <7794A52CE4D579448B959EED7DD0A4723DD10A1C@satlexdag06.amd.com> <20150629113635.GA14400@arm.com> <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com> Mime-Version: 1.0 (1.0) In-Reply-To: <00DB569E-D1C5-4CC5-AA2A-7572DCFEDB11@theobroma-systems.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-Id: <326A6111-183B-4F72-BEF9-4FE1AA708DE4@gmail.com> Cc: James Greenhalgh , "Kumar, Venkataramanan" , Benedikt Huber , "gcc-patches@gcc.gnu.org" , Marcus Shawcroft , Ramana Radhakrishnan , Richard Earnshaw From: pinskia@gmail.com Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Date: Mon, 29 Jun 2015 16:57:00 -0000 To: "Dr. Philipp Tomsich" X-IsSubscribed: yes X-SW-Source: 2015-06/txt/msg02097.txt.bz2 > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich wrote: >=20 > James, >=20 >> On 29 Jun 2015, at 13:36, James Greenhalgh wr= ote: >>=20 >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote: >>>=20 >>>> -----Original Message----- >>>> From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.co= m] >>>> Sent: Monday, June 29, 2015 2:17 PM >>>> To: Kumar, Venkataramanan >>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsq= rt) >>>> estimation in -ffast-math >>>>=20 >>>> Kumar, >>>>=20 >>>> This does not come unexpected, as the initial estimation and each iter= ation >>>> will add an architecturally-defined number of bits of precision (ARMv8 >>>> guarantuees only a minimum number of bits provided per operation=E2=80= =A6 the >>>> exact number is specific to each micro-arch, though). >>>> Depending on your architecture and on the required number of precise b= its >>>> by any given benchmark, one may see miscompares. >>>=20 >>> True.=20=20 >>=20 >> I would be very uncomfortable with this approach. >=20 > Same here. The default must be safe. Always. > Unlike other architectures, we don=E2=80=99t have a problem with making t= he proper > defaults for =E2=80=9Csafety=E2=80=9D, as the ARMv8 ISA guarantees a mini= mum number of > precise bits per iteration. >=20 >> From Richard Biener's post in the thread Michael Matz linked earlier >> in the thread: >>=20 >> It would follow existing practice of things we allow in >> -funsafe-math-optimizations. Existing practice in that we >> want to allow -ffast-math use with common benchmarks we care >> about. >>=20 >> https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html >>=20 >> With the solution you seem to be converging on (2-steps for some >> microarchitectures, 3 for others), a binary generated for one micro-arch >> may drop below a minimum guarantee of precision when run on another. This >> seems to go against the spirit of the practice above. I would only suppo= rt >> adding this optimization to -Ofast if we could keep to architectural >> guarantees of precision in the generated code (i.e. 3-steps everywhere). >>=20 >> I don't object to adding a "-mlow-precision-recip-sqrt" style option, >> which would be off by default, would enable the 2-step mode, and would >> need to be explicitly enabled (i.e. not implied by -mcpu=3Dfoo) but I do= n't >> see what this buys you beyond the Gromacs boost (and even there you would >> be creating an Invalid Run as optimization flags must be applied across >> all workloads). >=20 > Any flag that reduces precision (and thus breaks IEEE floating-point sema= ntics) > needs to be gated with an =E2=80=9Cunsafe=E2=80=9D flag (i.e. one that is= never on by default). > As a consequence, the =E2=80=9Cpeak=E2=80=9D-tuning for SPEC will turn th= is on=E2=80=A6 but barely=20 > anyone else would. >=20 >> For the 3-step optimization, it is clear to me that for "generic" tuning >> we don't want this to be enabled by default experimental results and adv= ice >> in this thread argues against it for thunderx and cortex-a57 targets. >> However, enabling it based on the CPU tuning selected seems fine to me. >=20 > I do not agree on this one, as I would like to see the safe form (i.e. 3 = and 5 > iterations respectively) to become the default. Most =E2=80=9Cserver-type= =E2=80=9D chips > should not see a performance regression, while it will be easier to optim= ise for > this in hardware than for a (potentially microcoded) sqrt-instruction (an= d=20 > subsequent, dependent divide). >=20 > I have not heard anyone claim a performance regression (either on thunderx > or on cortex-a57), but merely heard a =E2=80=9Cno speed-up=E2=80=9D. Actually it does regress performance on thunderX, I just assumed that when = I said not going to be a win it was taken as a slow down. It regress gromac= s by more than 10% on thunderX but I can't remember how much as i had someo= ne else run it. The latency difference is also over 40%; for example single= precision: 29 cycles with div (12) sqrt(17) directly vs 42 cycles with the= rsqrte and 2 iterations of 2mul/rsqrts (double is 53 vs 60). That is huge = difference right there. ThunderX has a fast div and a fast sqrt for 32bit = and a reasonable one for double. So again this is not just not a win but = rather a regression for thunderX. I suspect cortex-a57 is also true.=20 Thanks, Andrew >=20 > So I am strongly in favor of defaulting to the =E2=80=98safe=E2=80=99 num= ber of iterations, even > when compiling for a generic target. >=20 > Best, > Philipp. >=20