From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18155 invoked by alias); 2 Nov 2011 20:37:03 -0000 Received: (qmail 18105 invoked by uid 22791); 2 Nov 2011 20:37:02 -0000 X-SWARE-Spam-Status: No, hits=-2.2 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,TW_AV X-Spam-Check-By: sourceware.org Received: from mail-ey0-f175.google.com (HELO mail-ey0-f175.google.com) (209.85.215.175) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Wed, 02 Nov 2011 20:36:47 +0000 Received: by eyd9 with SMTP id 9so585718eyd.20 for ; Wed, 02 Nov 2011 13:36:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.232.38 with SMTP id tl6mr1227394obc.22.1320266204969; Wed, 02 Nov 2011 13:36:44 -0700 (PDT) Received: by 10.182.17.232 with HTTP; Wed, 2 Nov 2011 13:36:44 -0700 (PDT) In-Reply-To: <63EE40A00BA43F49B85FACBB03F078B6090AC041A7@sausexmbp02.amd.com> References: <20110712212201.23194.45716.sendpatchset@gccpike4.amd.com> <4E1CC32B.3060004@redhat.com> <63EE40A00BA43F49B85FACBB03F078B60821086630@sausexmbp02.amd.com> <63EE40A00BA43F49B85FACBB03F078B6090A9D90F2@sausexmbp02.amd.com> <63EE40A00BA43F49B85FACBB03F078B6090AC041A7@sausexmbp02.amd.com> Date: Wed, 02 Nov 2011 20:50:00 -0000 Message-ID: Subject: Re: AVX generic mode tuning discussion. From: Richard Guenther To: "Jagasia, Harsha" Cc: Richard Henderson , "gcc-patches@gcc.gnu.org" , "hubicka@ucw.cz" , "ubizjak@gmail.com" , "hjl.tools@gmail.com" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2011-11/txt/msg00238.txt.bz2 On Wed, Nov 2, 2011 at 5:57 PM, Jagasia, Harsha wr= ote: >> >> > > We would like to propose changing AVX generic mode tuning to >> >> generate >> >> > 128-bit >> >> > > AVX instead of 256-bit AVX. >> >> > >> >> > You indicate a 3% reduction on bulldozer with avx256. >> >> > How does avx128 compare to -mno-avx -msse4.2? >> >> >> >> We see these % differences going from SSE42 to AVX128 to AVX256 on >> >> Bulldozer with "-mtune=3Dgeneric -Ofast". >> >> (Positive is improvement, negative is degradation) >> >> >> >> Bulldozer: >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 AVX128/SSE42 =A0 =A0AVX25= 6/AVX-128 >> >> 410.bwaves =A0 =A0 =A0 =A0 =A0 =A0-1.4% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 -1.4% >> >> 416.gamess =A0 =A0 =A0 =A0 =A0 =A0-1.1% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 0.0% >> >> 433.milc =A0 =A0 =A0 =A0 =A0 =A0 =A00.5% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0-2.4% >> >> 434.zeusmp =A0 =A0 =A0 =A0 =A0 =A09.7% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0-2.1% >> >> 435.gromacs =A0 =A0 =A0 =A0 =A0 5.1% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A00.5% >> >> 436.cactusADM =A0 =A0 =A0 =A0 8.2% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0-23.8% >> >> 437.leslie3d =A0 =A0 =A0 =A0 =A08.1% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A00.4% >> >> 444.namd =A0 =A0 =A0 =A0 =A0 =A0 =A03.6% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A00.0% >> >> 447.dealII =A0 =A0 =A0 =A0 =A0 =A0-1.4% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 -0.4% >> >> 450.soplex =A0 =A0 =A0 =A0 =A0 =A0-0.4% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 -0.4% >> >> 453.povray =A0 =A0 =A0 =A0 =A0 =A00.0% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0-1.5% >> >> 454.calculix =A0 =A0 =A0 =A0 =A015.7% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 -8.3% >> >> 459.GemsFDTD =A0 =A0 =A0 =A0 =A04.9% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A01.4% >> >> 465.tonto =A0 =A0 =A0 =A0 =A0 =A0 1.3% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0-0.6% >> >> 470.lbm =A0 =A0 =A0 =A0 =A0 =A0 =A0 0.9% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A00.3% >> >> 481.wrf =A0 =A0 =A0 =A0 =A0 =A0 =A0 7.3% =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0-3.6% >> >> 482.sphinx3 =A0 =A0 =A0 =A0 =A0 5.0% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0-9.8% >> >> SPECFP =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A03.8% =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0-3.2% >> >> >> >> > Will the next AMD generation have a useable avx256? >> >> > I'm not keen on the idea of generic mode being tune >> >> > for a single processor revision that maybe shouldn't >> >> > actually be using avx at all. >> >> >> >> We see a substantial gain in several SPECFP benchmarks going from >> SSE42 >> >> to AVX128 on Bulldozer. >> >> IMHO, accomplishing even a 5% gain in an individual benchmark takes >> a >> >> hardware company several man months. >> >> The loss with AVX256 for Bulldozer is much more significant than the >> >> gain for SandyBridge. >> >> While the general trend in the industry is a move toward AVX256, for >> >> now we would be disadvantaging Bulldozer with this choice. >> >> >> >> We have several customers who use -mtune=3Dgeneric and it is default, >> >> unless a user explicitly overrides it with -mtune=3Dnative. They are >> the >> >> ones who want to experiment with latest ISA using gcc, but want to >> keep >> >> their ISA selection and tuning agnostic on x86/64. IMHO, it is with >> >> these customers in mind that generic was introduced in the first >> place. >> > >> > Since stage 1 closure is around the corner, just wanted to ping to >> see if the maintainers have made up their mind on this one. >> > AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes >> out pretty much all of that gain in generic mode. >> > Until there is a convergence on AVX-256 for x86/64, we would like to >> propose having generic generate avx-128 by default and have a user >> override to avx-256 manually when known to benefit performance. >> >> Did somebody spend the time analyzing why CactusADM shows so much of a >> difference? >> With the recent improvements in vectorizing for AVX, did >> you >> re-do the measurements with a recent trunk? >> >> I don't think disabling avx-256 by default is a good idea until we >> understand why these numbers happen and are convinced we cannot fix >> this by proper >> cost modeling. > > We have observed cases where AVX-256 bit code is slower than AVX-128 bit = code on Bulldozer. This is because internally the front end, data paths etc= for Bulldozer are designed for optimal AVX 128-bit. Throwing densely packe= d 256-bit code at the pipeline can congest the front end causing stalls and= hence slowdowns. We expect the behavior of cactus, calculix and sphinx, wh= ich are the 3 benchmarks with the biggest avx-256 gaps, to be in the same v= ein. In general, the hardware design engineers recommend running AVX 128-bi= t code on Bulldozer. Given the underlying hardware design, software tuning = can't really change the results here. Any further analysis of cactus would = be a cycle sink at our end and we may not even be able to discuss the detai= ls on a public mailing list. x86/64 has not yet converged on avx-256 and ge= neric mode should reflect that. Well, generic hasn't converged on AVX at all. Cost modeling can deal with code density just fine - are there any differences between code density issues of say, loads vs. stores vs. arithmetic? I specifically ask about analysis because AVX-256 has instruction set issues for certain patterns the vectorizer generates and the cost model currently does not reflect these at all. Richard. > Posting the re-measurements on trunk for cactus, calculix and sphinx on B= ulldozer: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0AVX128/SSE42 =A0 =A0AVX256/AVX-128 > 436.cactusADM =A0 10% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -30% > 454.calculix =A0 =A014.7% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -6% > 482.sphinx3 =A0 =A0 =A0 =A0 7% =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-9% > > All positive % above are improvements, all negative % are degradations. > > I will post re-measurements for all of Spec with latest trunk as soon as = I have them. > > Thoughts? > > Thanks, > Harsha > > >