From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19365 invoked by alias); 12 Sep 2009 23:32:19 -0000 Received: (qmail 19356 invoked by uid 22791); 12 Sep 2009 23:32:18 -0000 X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: sourceware.org Received: from artax.karlin.mff.cuni.cz (HELO artax.karlin.mff.cuni.cz) (195.113.26.195) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sat, 12 Sep 2009 23:32:12 +0000 Received: by artax.karlin.mff.cuni.cz (Postfix, from userid 17421) id 39E18980E2; Sun, 13 Sep 2009 01:32:09 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by artax.karlin.mff.cuni.cz (Postfix) with ESMTP id 35CAE9809E; Sun, 13 Sep 2009 01:32:09 +0200 (CEST) Date: Sat, 12 Sep 2009 23:32:00 -0000 From: Mikulas Patocka To: "H.J. Lu" cc: Jakub Jelinek , gcc-patches@gcc.gnu.org, ubizjak@gmail.com Subject: Re: PATCH: PR target/40838: gcc shouldn't assume that the stack is aligned In-Reply-To: <6dc9ffc80908240900l73d3c97fo2c31fbd0142e75d2@mail.gmail.com> Message-ID: References: <20090807071305.GX4462@tyan-ft48-01.lab.bos.redhat.com> <6dc9ffc80908070553q6f9b1b78lc19e6e4a4a5ec73b@mail.gmail.com> <6dc9ffc80908071530x7d4a3965u8021df66a142a0bf@mail.gmail.com> <6dc9ffc80908240900l73d3c97fo2c31fbd0142e75d2@mail.gmail.com> X-Personality-Disorder: Schizoid MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="1908654874-1471531285-1252798328=:6268" Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2009-09/txt/msg00892.txt.bz2 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --1908654874-1471531285-1252798328=:6268 Content-Type: TEXT/PLAIN; charset=ISO-8859-2 Content-Transfer-Encoding: QUOTED-PRINTABLE Content-length: 9519 On Mon, 24 Aug 2009, H.J. Lu wrote: > On Fri, Aug 7, 2009 at 3:30 PM, H.J. Lu wrote: > > On Fri, Aug 7, 2009 at 5:53 AM, H.J. Lu wrote: > >> On Fri, Aug 7, 2009 at 12:13 AM, Jakub Jelinek wrote: > >>> On Fri, Aug 07, 2009 at 02:54:46AM +0200, Mikulas Patocka wrote: > >>>> > > In 32bit, the incoming stack may not be 16 byte aligned. =A0This= patch > >>>> > > assumes the incoming stack is 4 byte aligned and realigns stack = if any > >>>> > > SSE variable is put on stack. Any comments? > >>>> > > >>>> > IMHO this is wrong, I could live with a non-default option for tho= se who > >>>> > don't care about performance and think a SCO document from 1996 ha= s any > >>>> > relevance to Linux these days. =A0In reality a Linux ABI for years= assumes > >>>> > 16 byte stack alignment for 32-bit code. > >>>> > >>>> Tell me which Linux distribution did you run with 16-byte stack alig= nment > >>>> checking (as proposed in bug 40838) and what was the result? > >>>> > >>>> For me, the result was that 75% of binaries in /bin in Debian Lenny = do not > >>>> align the stack on 16-byte boundary. > >>> > >>> Besides the obstack glibc bug which has been fixed since then you hav= en't > >>> reported anything particular. =A0It is true that parts of i?86 glibc = is > >>> compiled with -mpreferered-stack-boundary=3D2, but only parts that do= n't call > >>> callbacks. =A0Async signals AFAIK will align the stack properly. > >>> > >>> I simply don't trust your 75% claim, lots of stuff would break if thi= ngs > >>> weren't aligned properly. > >>> > >> > >> From gcc 3.4: > >> > >> =A0/* Validate -mpreferred-stack-boundary=3D value, or provide default. > >> =A0 =A0 The default of 128 bits is for Pentium III's SSE __m128, but we > >> =A0 =A0 don't want additional code to keep the stack aligned when > >> =A0 =A0 optimizing for code size. =A0*/ > >> =A0ix86_preferred_stack_boundary =3D (optimize_size > >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ? = TARGET_64BIT ? 128 : 32 > >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : = 128); > >> > >> If you compile code with -Os, you will get 4 byte stack alignment. > >> Just step back, we changed stack alignment from 4 byte to 16byte > >> for SSE since we couldn't realign stack at the time. Now we can > >> realign the stack very efficiently. I think we should do it for SSE > >> to support the existing Linux binaries which have 4 byte stack > >> alignment. If it helps, I can compare -m32 -O3 -msse2 -mfp-math=3Dsse > >> results with SPEC CPU 2006, before and after my patch. > >> > > > > Here are the differences of -m32 -O3 -msse2 -mfpmath=3Dsse -ffast-math > > -funroll-loops > > before and after my patch: > > > > 400.perlbench =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.384615% > > 401.bzip2 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00% > > 403.gcc =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.362319% > > 429.mcf =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.813008% > > 445.gobmk =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.921659% > > 456.hmmer =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.549451% > > 458.sjeng =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.438596% > > 462.libquantum =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0% > > 464.h264ref =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00% > > 471.omnetpp =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.478469% > > 473.astar =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.645161% > > 483.xalancbmk =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.727273% > > SPECint(R)_base2006 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.41152= 3% > > 410.bwaves =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.406504% > > 416.gamess =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0% > > 433.milc =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -1.36986% > > 434.zeusmp =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.44843% > > 435.gromacs =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00% > > 436.cactusADM =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00% > > 437.leslie3d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.888889% > > 444.namd =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 1.20482% > > 447.dealII =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.350877% > > 450.soplex =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.31746% > > 453.povray =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.458716% > > 454.calculix =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0% > > 459.GemsFDTD =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0% > > 465.tonto =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00% > > 470.lbm =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00% > > 481.wrf =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.480769% > > 482.sphinx3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.940439% > > SPECfp(R)_base2006 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0% > > > > I think we should align stack if SSE variables are put on stack. > > >=20 > Darwin ia32 psABI specifies 16byte stack alignment and enforces it > with >=20 > #define PREFERRED_STACK_BOUNDARY \ > MAX (STACK_BOUNDARY, ix86_preferred_stack_boundary) >=20 > On other ia32 targets, 4byte outgoing stack alignment is > correct and allowed. My patch assumes 4 byte incoming > stack alignment only when SSE variables are put on stack. > Automatic stack alignment implementation is quite efficient. > Its performance impact is very limited as show in SPEC CPU > 2006 results. It also fixed a regression: >=20 > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D41156 >=20 > OK for trunk? >=20 > Thanks. >=20 > --=20 > H.J. I tried this patch for 4.4.1 (that one posted in bugzilla 40838) with=20 seamonkey 1.1.17 and it relly misses alignment in some parts of it. I used= =20 CFLAGS and CXXFLAGS "-O3 -fomit-frame-pointer -frename-registers=20 -march=3Dbarcelona" I found the functions that miscompile, the trivial way to find them is to=20 compile seamonkey, do find . -name "*.o"|xargs objdump -d|less, search for= =20 "movdqa.*esp" or "movaps.*esp" (or even "esp.*xmm" and "xmm.*esp", but it=20 gives more false positives) and check the function that it has aligment=20 code. There are a few functions that don't. Please look at these=20 functions, the idea of your patch is good, it just needs to be fixed. BTW. gentoo documentation lists -O3 as problematic flag that leads to=20 instability with gcc 4 (they don't way why, but the most likely reason is=20 this alignment issue), it would be good if we could fix it and allow=20 people to use SSE. Mikulas Some of the miscompiled functions are: pow5mult: a00: 55 push %ebp a01: 57 push %edi a02: 56 push %esi a03: 53 push %ebx a04: 89 d3 mov %edx,%ebx a06: 83 ec 6c sub $0x6c,%esp ... a7d: 0f 29 44 24 10 movaps %xmm0,0x10(%esp) a82: e8 79 f8 ff ff call 300 a87: 85 c0 test %eax,%eax a89: 89 44 24 4c mov %eax,0x4c(%esp) a8d: 66 0f 6f 44 24 10 movdqa 0x10(%esp),%xmm0 PR_strtod: 10a0: 55 push %ebp 10a1: 57 push %edi 10a2: 56 push %esi 10a3: 53 push %ebx 10a4: 81 ec fc 00 00 00 sub $0xfc,%esp ... 172c: 66 0f 6f 44 24 20 movdqa 0x20(%esp),%xmm0 PR_select: 3da0: 55 push %ebp 3da1: 57 push %edi 3da2: 56 push %esi 3da3: 89 ce mov %ecx,%esi 3da5: 53 push %ebx 3da6: 89 d3 mov %edx,%ebx 3da8: 81 ec ec 01 00 00 sub $0x1ec,%esp ... 3dcb: 0f 29 84 24 50 01 00 movaps %xmm0,0x150(%esp) 3dd2: 00 3dd3: 0f 29 84 24 60 01 00 movaps %xmm0,0x160(%esp) 3dda: 00 3ddb: 0f 29 84 24 70 01 00 movaps %xmm0,0x170(%esp) 3de2: 00 3de3: 0f 29 84 24 80 01 00 movaps %xmm0,0x180(%esp) 3dea: 00 3deb: 0f 29 84 24 90 01 00 movaps %xmm0,0x190(%esp) 3df2: 00 _ZL18wait_for_retrievalP13_GtkClipboardP17retrieval_context: 560: 55 push %ebp 561: 57 push %edi 562: 56 push %esi 563: 89 d6 mov %edx,%esi 565: 53 push %ebx 566: 81 ec bc 01 00 00 sub $0x1bc,%esp ... 5b0: 0f 29 44 24 20 movaps %xmm0,0x20(%esp) 5b5: 0f 29 44 24 30 movaps %xmm0,0x30(%esp) 5ba: 0f 29 44 24 40 movaps %xmm0,0x40(%esp) 5bf: 0f 29 44 24 50 movaps %xmm0,0x50(%esp) 5c4: 0f 29 44 24 60 movaps %xmm0,0x60(%esp) 5c9: 0f 29 44 24 70 movaps %xmm0,0x70(%esp) 5ce: 0f 29 84 24 80 00 00 movaps %xmm0,0x80(%esp) 5d5: 00 5d6: 0f 29 84 24 90 00 00 movaps %xmm0,0x90(%esp) 5dd: 00 _ZN13XRemoteClient7GetLockEmPi: 680: 55 push %ebp 681: 57 push %edi 682: 56 push %esi 683: 53 push %ebx 684: 89 c3 mov %eax,%ebx 686: 81 ec 0c 02 00 00 sub $0x20c,%esp ... 6ee: 0f 29 44 24 30 movaps %xmm0,0x30(%esp) --1908654874-1471531285-1252798328=:6268--