From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4993 invoked by alias); 7 Aug 2006 21:57:06 -0000 Received: (qmail 4947 invoked by uid 48); 7 Aug 2006 21:56:57 -0000 Date: Mon, 07 Aug 2006 21:57:00 -0000 Message-ID: <20060807215657.4946.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "whaley at cs dot utsa dot edu" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2006-08/txt/msg00542.txt.bz2 List-Id: ------- Comment #44 from whaley at cs dot utsa dot edu 2006-08-07 21:56 ------- Guys, OK, the mystery of why my hand-patched gcc didn't work is now cleared up. My first clue was that neither did the SVN-build gcc! Turns out, your peephole opt is only done if I throw the flag -O3 rather than -O, which is what my tarfile used. Any reason it's done at only the high levels, since it makes such a performance difference? FYI, in gcc3 -O gets better performance than -O3, which is why that's my default flags. However, it appears that gcc4 gets very nice performance with -O3. Its fairly common for -O to give better performance than -O3, however (since the ATLAS code is already aggressively optimized, gcc's max optimization often de-optimize an optimal code), so turning this on at the default level, or being able to turn it off and on manually would be ideal . . . >That's why you should compare 4.2 before and after my patch, instead. Yeah, except 4.2 w/o your patch has horrible performance. Our goal is not to beat horrible performance, but rather to get good performance! Gcc 3 provides a measure of good performance. However, I take your point that it'd be nice to see the new stuff put a headlock on the crap performance, so I include that below as well :) Here's some initial data. I report MFLOPS achieved by the kernel as compiled by : gcc3 (usually gcc 3.2 or 3.4.3), gccS (current SVN gcc), and gcc4 (usually gcc 4.1.1). I will try to get more data later, but this is pretty suggestive, IMHO. DOUBLE SINGLE PEAK gcc3/gccS/gcc4 gcc3/gccS/gcc4 ==== ============== ============== Pentium-D : 2800 2359/2417/2067 2685/2684/2362 Ath64-X2 : 5600 3677/3585/2102 3680/3914/2207 Opteron : 3200 2590/2517/1507 2625/2800/1580 So, it appears to me we are seeing the same pattern I previously saw in my hand-tuned SSE code: Intel likes the new pattern of doing the last load as part of the FMUL instruction, but AMD is hampered by it. Note that gccS is the best compiler for both single & double on the Intel. On both AMD machines, however, it wins only for single, where the cost of the load is lower. It loses to gcc3 for double, where load performance more completely determines matmul performance. This is consistant with the view that gcc 4 does some other optimizations better than gcc 3, and so if we got the fldl removed, gcc 4 would win for all precisions . . . Don't get me wrong, your patch has already removed the emergency: in the worst case so far you are less than 3% slower. However, I suspect if we added the optional (for amd chips only) peephole step to get rid of all possible fmul[s,l], then we'd win for double, and win even more for single on AMD chips . . . So, any chance of an AMD-only or flag-controlled peephole step to get rid of the last fmul[s,l]? >Or you can disable the fmul[sl] instructions altogether. As I mentioned, my own hand-tuning has indicated that the final fmul[sl] is good for Intel netburst archs, but bad for AMD hammer archs. I'll see about posting some vectorization data ASAP. Can someone create a new bug report so that the two threads of inquiry don't get mixed up, or do you want to just intermix them here? Thanks, Clint P.S.: I tried to run this on the Core by hand-translating gccS-genned assembly to OS X assembly. The double precision gccS runs at the same speed as apple's gcc. However, the single precision is an order of magnitude slower, as I experienced this morning on the P4E. This is almost certainly an error in my makefile, but damned if I can find it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827