From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17720 invoked by alias); 10 Jun 2007 16:25:11 -0000 Received: (qmail 17483 invoked by uid 48); 10 Jun 2007 16:24:59 -0000 Date: Sun, 10 Jun 2007 16:25:00 -0000 Message-ID: <20070610162459.17482.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "ubizjak at gmail dot com" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2007-06/txt/msg00686.txt.bz2 ------- Comment #16 from ubizjak at gmail dot com 2007-06-10 16:24 ------- (In reply to comment #13) > > x1 = 0.5 X0 (3.0 - A x0 x0 x0) Whops! One x0 too much above. Correct calcualtion reads: rsqrt = 0.5 rsqrt(a) (3.0 - a rsqrt(a) rsqrt(a)). > Well, I suppose it depends on the hardware. IIRC older cpu:s did division with > microcode whereas at least core2 and K8 do it in hardware, so I guess the > hundreds of cycles doesn't apply to current cpu:s. > > Also, supposedly Penryn will have a much improved divider.. Well, mubench says for my Core2Duo that _all_ sqrt and div functions have latency of 6 clocks and rcp throughput of 5 clks. By _all_ I mean divss, divps, divsd, divpd, sqrtss, sqrtps, sqrtsd and sqrtpd. OTOH, rsqrtss and rcpss have latency of 3 clks and rcp throughput of 2 clks. This is just amazing. > That being said, I think there is still a case for the reciprocal square root, > as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn > linked to in the first message in this PR (in short, ifort does sqrt(a/b) about > twice as fast as gfortran by using reciprocal approximations + NR). If indeed > div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it > suggests almost all the performance benefit ifort gets is due to the > rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the > sqrt(a/b) loop fills an array, whereas your benchmark accumulates.. It is true, that only a trivial accumulation function is benchmarked by my "benchmark". I can prepare a bunch of expanders to expand: a / b <=> a [rcpss(b) (2.0 - b rcpss(b))] a / sqrtss(b) <=> a [0.5 rsqrtss(b) (3.0 - b rsqrtss(b) rsqrtss(b))]. sqrtss (a) <=> a 0.5 rsqrtss(a) (3.0 - a rsqrtss(a) rsqrtss(a)) second and third case indeed look similar... > I hear that it's possible to pass spec2k6/gromacs without the NR step. As most > MD programs, gromacs spends almost all it's time in the force calculations, > where the majority of time is spent calculating 1/sqrt(...). So perhaps one > should watch out for compilers that get suspiciously high scores on that > benchmark. :) Yes, look at hpcwire article in Comment #12 > No, I'm not suggesting gcc should do this. ;)) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723