From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 25918 invoked by alias); 7 Aug 2006 15:32:41 -0000 Received: (qmail 25840 invoked by uid 48); 7 Aug 2006 15:32:30 -0000 Date: Mon, 07 Aug 2006 15:32:00 -0000 Message-ID: <20060807153230.25839.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "whaley at cs dot utsa dot edu" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2006-08/txt/msg00494.txt.bz2 List-Id: ------- Comment #38 from whaley at cs dot utsa dot edu 2006-08-07 15:32 ------- Paolo, Thanks for all the help. I'm not sure I understand everything perfectly though, so there's some questions below . . . >I don't see how the last fmul[sl] can be removed without increasing code size. Since the flags are asking for performance, not size optimization, this should only be an argument if the fmul[s,l]'s are performance-neutral. A lot of performance optimizations increase code size, after all . . . Obviously, no fmul[sl] is possible, since gcc 3 achieves it. However, I can see that the peephole phase might not be able to change the register usage. >Can you please try re-running the tests? It takes skill^W^W Yes, I found the results confusing as well, which is why I reran them 50 times before posting. I also posted the tarfile (wt Makefile and assemblies) that built them, so that my mistakes could be caught by someone with more skill. Just as a check, maybe you can confirm the .s you posted is the right one? I can't find the loads of the matrix C anywhere in its assembly, and I can find them in the double version . . . Anyway, I like your suggestion (below) of getting the compiler so we won't have to worry about assemblies, so that's probably the way to go. On this front, is there some reason you cannot post the patch(es) as attachments, just to rule out copy problems, as I've asked in last several messages? Note there's no need if I can grab your stuff from SVN, as below . . . >because my tests were run on a similar Prescott (P4e) You didn't post the gcc 3 performance numbers. What were those like? If you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big deal. If gcc 3 is still winning, on the other hand . . . >It also would be interesting to re-run your code generator on a compiler built from svn trunk. Are your changes on a branch I could check out? If so, give me the commands to get that branch, as we are scoping assemblies only because of the patching problem. Having a full compiler would indeed enable more detailed investigations, including loosing the full code generator on the improved compiler. >Also, I strongly believe that you should implement vectorization, ATLAS implements vectorization, by writing the entire GEMM kernel in assembly and directly using SSE. However, there are cases where generated C code must be called, and that's where gcc comes in . . . >or at least find out *why* GCC does not vectorize your code. It may be simply that it does not have any guarantee on the alignment. I'm all for this. info gcc says that w/o a guarantee of alignment, loops are duped, with an if selecting between vector and scalar loops, is this not accurate? I spent a day trying to get gcc to vectorize any of the generator's loops, and did not succeed (can you make it vectorize the provided benchmark code?). I also tried various unrollings of the inner loop, particularly no unrolling and unroll=2 (vector length). I was unable to truly decipher the warning messages explaining the lack of vectorization, and I would truly welcome some help in fixing this. This is a separate issue from the x87 code, and this tracker item is already fairly complex :) I'm assuming if I attempted to open a bug tracker of "gcc will not vectorize atlas's generated code" it would be closed pretty quickly. Maybe you can recommend how to approach this, or open another report that we can exchange info on? I would truly appreciate the opportunity to get some feedback from gcc authors to help guide me to solving this problem. Thanks for all the info, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827