From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-220708-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 17720 invoked by alias); 10 Jun 2007 16:25:11 -0000
Received: (qmail 17483 invoked by uid 48); 10 Jun 2007 16:24:59 -0000
Date: Sun, 10 Jun 2007 16:25:00 -0000
Message-ID: <20070610162459.17482.qmail@sourceware.org>
X-Bugzilla-Reason: CC
References: <bug-31723-11659@http.gcc.gnu.org/bugzilla/>
Subject: [Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
In-Reply-To: <bug-31723-11659@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "ubizjak at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2007-06/txt/msg00686.txt.bz2


------- Comment #16 from ubizjak at gmail dot com  2007-06-10 16:24 -------
(In reply to comment #13)

> > x1 = 0.5 X0 (3.0 - A x0 x0 x0)

Whops! One x0 too much above. Correct calcualtion reads:

rsqrt = 0.5 rsqrt(a) (3.0 - a rsqrt(a) rsqrt(a)).

> Well, I suppose it depends on the hardware. IIRC older cpu:s did division with
> microcode whereas at least core2 and K8 do it in hardware, so I guess the
> hundreds of cycles doesn't apply to current cpu:s. 
> 
> Also, supposedly Penryn will have a much improved divider..

Well, mubench says for my Core2Duo that _all_ sqrt and div functions have
latency of 6 clocks and rcp throughput of 5 clks. By _all_ I mean divss, divps,
divsd, divpd, sqrtss, sqrtps, sqrtsd and sqrtpd. OTOH, rsqrtss and rcpss have
latency of 3 clks and rcp throughput of 2 clks. This is just amazing.

> That being said, I think there is still a case for the reciprocal square root,
> as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn
> linked to in the first message in this PR (in short, ifort does sqrt(a/b) about
> twice as fast as gfortran by using reciprocal approximations + NR). If indeed
> div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it
> suggests almost all the performance benefit ifort gets is due to the
> rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the
> sqrt(a/b) loop fills an array, whereas your benchmark accumulates..

It is true, that only a trivial accumulation function is benchmarked by my
"benchmark". I can prepare a bunch of expanders to expand:

a / b <=> a [rcpss(b) (2.0 - b rcpss(b))]

a / sqrtss(b) <=> a [0.5 rsqrtss(b) (3.0 - b rsqrtss(b) rsqrtss(b))].

sqrtss (a) <=> a 0.5 rsqrtss(a) (3.0 - a rsqrtss(a) rsqrtss(a))

second and third case indeed look similar...

> I hear that it's possible to pass spec2k6/gromacs without the NR step. As most
> MD programs, gromacs spends almost all it's time in the force calculations,
> where the majority of time is spent calculating 1/sqrt(...). So perhaps one
> should watch out for compilers that get suspiciously high scores on that
> benchmark. :)

Yes, look at hpcwire article in Comment #12

> No, I'm not suggesting gcc should do this.

;))


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723