From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-248636-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 19365 invoked by alias); 12 Sep 2009 23:32:19 -0000
Received: (qmail 19356 invoked by uid 22791); 12 Sep 2009 23:32:18 -0000
X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 	tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Received: from artax.karlin.mff.cuni.cz (HELO artax.karlin.mff.cuni.cz) (195.113.26.195)     by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sat, 12 Sep 2009 23:32:12 +0000
Received: by artax.karlin.mff.cuni.cz (Postfix, from userid 17421) 	id 39E18980E2; Sun, 13 Sep 2009 01:32:09 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1]) 	by artax.karlin.mff.cuni.cz (Postfix) with ESMTP id 35CAE9809E; 	Sun, 13 Sep 2009 01:32:09 +0200 (CEST)
Date: Sat, 12 Sep 2009 23:32:00 -0000
From: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
To: "H.J. Lu" <hjl.tools@gmail.com>
cc: Jakub Jelinek <jakub@redhat.com>, gcc-patches@gcc.gnu.org,      ubizjak@gmail.com
Subject: Re: PATCH: PR target/40838: gcc shouldn't assume that the stack is   aligned
In-Reply-To: <6dc9ffc80908240900l73d3c97fo2c31fbd0142e75d2@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0909130108530.6268@artax.karlin.mff.cuni.cz>
References: <Pine.LNX.4.64.0908070254030.30134@artax.karlin.mff.cuni.cz>   <20090807071305.GX4462@tyan-ft48-01.lab.bos.redhat.com>   <6dc9ffc80908070553q6f9b1b78lc19e6e4a4a5ec73b@mail.gmail.com>   <6dc9ffc80908071530x7d4a3965u8021df66a142a0bf@mail.gmail.com>  <6dc9ffc80908240900l73d3c97fo2c31fbd0142e75d2@mail.gmail.com>
X-Personality-Disorder: Schizoid
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="1908654874-1471531285-1252798328=:6268"
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2009-09/txt/msg00892.txt.bz2

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--1908654874-1471531285-1252798328=:6268
Content-Type: TEXT/PLAIN; charset=ISO-8859-2
Content-Transfer-Encoding: QUOTED-PRINTABLE
Content-length: 9519


On Mon, 24 Aug 2009, H.J. Lu wrote:

> On Fri, Aug 7, 2009 at 3:30 PM, H.J. Lu<hjl.tools@gmail.com> wrote:
> > On Fri, Aug 7, 2009 at 5:53 AM, H.J. Lu<hjl.tools@gmail.com> wrote:
> >> On Fri, Aug 7, 2009 at 12:13 AM, Jakub Jelinek<jakub@redhat.com> wrote:
> >>> On Fri, Aug 07, 2009 at 02:54:46AM +0200, Mikulas Patocka wrote:
> >>>> > > In 32bit, the incoming stack may not be 16 byte aligned. =A0This=
 patch
> >>>> > > assumes the incoming stack is 4 byte aligned and realigns stack =
if any
> >>>> > > SSE variable is put on stack. Any comments?
> >>>> >
> >>>> > IMHO this is wrong, I could live with a non-default option for tho=
se who
> >>>> > don't care about performance and think a SCO document from 1996 ha=
s any
> >>>> > relevance to Linux these days. =A0In reality a Linux ABI for years=
 assumes
> >>>> > 16 byte stack alignment for 32-bit code.
> >>>>
> >>>> Tell me which Linux distribution did you run with 16-byte stack alig=
nment
> >>>> checking (as proposed in bug 40838) and what was the result?
> >>>>
> >>>> For me, the result was that 75% of binaries in /bin in Debian Lenny =
do not
> >>>> align the stack on 16-byte boundary.
> >>>
> >>> Besides the obstack glibc bug which has been fixed since then you hav=
en't
> >>> reported anything particular. =A0It is true that parts of i?86 glibc =
is
> >>> compiled with -mpreferered-stack-boundary=3D2, but only parts that do=
n't call
> >>> callbacks. =A0Async signals AFAIK will align the stack properly.
> >>>
> >>> I simply don't trust your 75% claim, lots of stuff would break if thi=
ngs
> >>> weren't aligned properly.
> >>>
> >>
> >> From gcc 3.4:
> >>
> >> =A0/* Validate -mpreferred-stack-boundary=3D value, or provide default.
> >> =A0 =A0 The default of 128 bits is for Pentium III's SSE __m128, but we
> >> =A0 =A0 don't want additional code to keep the stack aligned when
> >> =A0 =A0 optimizing for code size. =A0*/
> >> =A0ix86_preferred_stack_boundary =3D (optimize_size
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ? =
TARGET_64BIT ? 128 : 32
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 : =
128);
> >>
> >> If you compile code with -Os, you will get 4 byte stack alignment.
> >> Just step back, we changed stack alignment from 4 byte to 16byte
> >> for SSE since we couldn't realign stack at the time. Now we can
> >> realign the stack very efficiently. I think we should do it for SSE
> >> to support the existing Linux binaries which have 4 byte stack
> >> alignment. If it helps, I can compare -m32 -O3 -msse2 -mfp-math=3Dsse
> >> results with SPEC CPU 2006, before and after my patch.
> >>
> >
> > Here are the differences of -m32 -O3 -msse2 -mfpmath=3Dsse -ffast-math
> > -funroll-loops
> > before and after my patch:
> >
> > 400.perlbench =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.384615%
> > 401.bzip2 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00%
> > 403.gcc =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.362319%
> > 429.mcf =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.813008%
> > 445.gobmk =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.921659%
> > 456.hmmer =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.549451%
> > 458.sjeng =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.438596%
> > 462.libquantum =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0%
> > 464.h264ref =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00%
> > 471.omnetpp =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.478469%
> > 473.astar =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.645161%
> > 483.xalancbmk =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.727273%
> > SPECint(R)_base2006 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-0.41152=
3%
> > 410.bwaves =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.406504%
> > 416.gamess =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0%
> > 433.milc =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -1.36986%
> > 434.zeusmp =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.44843%
> > 435.gromacs =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00%
> > 436.cactusADM =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00%
> > 437.leslie3d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.888889%
> > 444.namd =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 1.20482%
> > 447.dealII =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.350877%
> > 450.soplex =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -0.31746%
> > 453.povray =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.458716%
> > 454.calculix =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0%
> > 459.GemsFDTD =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0%
> > 465.tonto =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00%
> > 470.lbm =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00%
> > 481.wrf =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.480769%
> > 482.sphinx3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00.940439%
> > SPECfp(R)_base2006 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0%
> >
> > I think we should align stack if SSE variables are put on stack.
> >
>=20
> Darwin ia32 psABI specifies 16byte stack alignment and enforces it
> with
>=20
> #define PREFERRED_STACK_BOUNDARY                        \
>   MAX (STACK_BOUNDARY, ix86_preferred_stack_boundary)
>=20
> On other ia32 targets, 4byte outgoing stack alignment is
> correct and allowed. My patch assumes 4 byte incoming
> stack alignment only when SSE variables are put on stack.
> Automatic stack alignment implementation is quite efficient.
> Its performance impact is very limited as show in SPEC CPU
> 2006 results. It also fixed a regression:
>=20
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D41156
>=20
> OK for trunk?
>=20
> Thanks.
>=20
> --=20
> H.J.

I tried this patch for 4.4.1 (that one posted in bugzilla 40838) with=20
seamonkey 1.1.17 and it relly misses alignment in some parts of it. I used=
=20
CFLAGS and CXXFLAGS "-O3 -fomit-frame-pointer -frename-registers=20
-march=3Dbarcelona"

I found the functions that miscompile, the trivial way to find them is to=20
compile seamonkey, do find . -name "*.o"|xargs objdump -d|less, search for=
=20
"movdqa.*esp" or "movaps.*esp" (or even "esp.*xmm" and "xmm.*esp", but it=20
gives more false positives) and check the function that it has aligment=20
code. There are a few functions that don't. Please look at these=20
functions, the idea of your patch is good, it just needs to be fixed.

BTW. gentoo documentation lists -O3 as problematic flag that leads to=20
instability with gcc 4 (they don't way why, but the most likely reason is=20
this alignment issue), it would be good if we could fix it and allow=20
people to use SSE.

Mikulas


Some of the miscompiled functions are:

pow5mult:
a00:       55                      push   %ebp
a01:       57                      push   %edi
a02:       56                      push   %esi
a03:       53                      push   %ebx
a04:       89 d3                   mov    %edx,%ebx
a06:       83 ec 6c                sub    $0x6c,%esp
...
a7d:       0f 29 44 24 10          movaps %xmm0,0x10(%esp)
a82:       e8 79 f8 ff ff          call   300 <Balloc>
a87:       85 c0                   test   %eax,%eax
a89:       89 44 24 4c             mov    %eax,0x4c(%esp)
a8d:       66 0f 6f 44 24 10       movdqa 0x10(%esp),%xmm0

PR_strtod:
10a0:       55                      push   %ebp
10a1:       57                      push   %edi
10a2:       56                      push   %esi
10a3:       53                      push   %ebx
10a4:       81 ec fc 00 00 00       sub    $0xfc,%esp
...
172c:       66 0f 6f 44 24 20       movdqa 0x20(%esp),%xmm0

PR_select:
3da0:       55                      push   %ebp
3da1:       57                      push   %edi
3da2:       56                      push   %esi
3da3:       89 ce                   mov    %ecx,%esi
3da5:       53                      push   %ebx
3da6:       89 d3                   mov    %edx,%ebx
3da8:       81 ec ec 01 00 00       sub    $0x1ec,%esp
...
3dcb:       0f 29 84 24 50 01 00    movaps %xmm0,0x150(%esp)
3dd2:       00
3dd3:       0f 29 84 24 60 01 00    movaps %xmm0,0x160(%esp)
3dda:       00
3ddb:       0f 29 84 24 70 01 00    movaps %xmm0,0x170(%esp)
3de2:       00
3de3:       0f 29 84 24 80 01 00    movaps %xmm0,0x180(%esp)
3dea:       00
3deb:       0f 29 84 24 90 01 00    movaps %xmm0,0x190(%esp)
3df2:       00

_ZL18wait_for_retrievalP13_GtkClipboardP17retrieval_context:
560:       55                      push   %ebp
561:       57                      push   %edi
562:       56                      push   %esi
563:       89 d6                   mov    %edx,%esi
565:       53                      push   %ebx
566:       81 ec bc 01 00 00       sub    $0x1bc,%esp
...
5b0:       0f 29 44 24 20          movaps %xmm0,0x20(%esp)
5b5:       0f 29 44 24 30          movaps %xmm0,0x30(%esp)
5ba:       0f 29 44 24 40          movaps %xmm0,0x40(%esp)
5bf:       0f 29 44 24 50          movaps %xmm0,0x50(%esp)
5c4:       0f 29 44 24 60          movaps %xmm0,0x60(%esp)
5c9:       0f 29 44 24 70          movaps %xmm0,0x70(%esp)
5ce:       0f 29 84 24 80 00 00    movaps %xmm0,0x80(%esp)
5d5:       00
5d6:       0f 29 84 24 90 00 00    movaps %xmm0,0x90(%esp)
5dd:       00

_ZN13XRemoteClient7GetLockEmPi:
680:       55                      push   %ebp
681:       57                      push   %edi
682:       56                      push   %esi
683:       53                      push   %ebx
684:       89 c3                   mov    %eax,%ebx
686:       81 ec 0c 02 00 00       sub    $0x20c,%esp
...
6ee:       0f 29 44 24 30          movaps %xmm0,0x30(%esp)

--1908654874-1471531285-1252798328=:6268--