From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 72947 invoked by alias); 12 Nov 2019 08:29:06 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 72938 invoked by uid 89); 12 Nov 2019 08:29:05 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=convinced, H*i:pd6Z7zT, H*f:pd6Z7zT, H*f:sk:AP_1dSB X-HELO: mail-lf1-f68.google.com Received: from mail-lf1-f68.google.com (HELO mail-lf1-f68.google.com) (209.85.167.68) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 12 Nov 2019 08:29:04 +0000 Received: by mail-lf1-f68.google.com with SMTP id q5so6076148lfo.10 for ; Tue, 12 Nov 2019 00:29:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jBnfKvSn9wtcAxb3Kgj+ZGI7+mOQPanJX2I4gMJ5Uc8=; b=tOxCtGHLlQ7iJICGP5yLwIEeg1KI72e9D8HqdR5jtYdp+wjcRsX0sXfLG4tyZxoEQd Zj8xU4O9kdioT8CKOVlxs4G46yzp7xwFxe8g2TAv+5lBz1Zol8SQuavhSci8+xGG4/Q1 MhE9jex02I0mW0yXgWs3irOS6bthhsYNE4Cq15l4ggXzTwhVqTPdAfIwc9HPL29pJB4s MpC9himXb/ZXctJectyDRdIX9fkNMag5D8KL9xhwBWFW0M1/2zkQ2k+T2gSxgC3z+0x1 NIctRXfynneITSo1NIMTXoeCNNFxDuWdYHTttR+IrKiJl4Kqh8nWVON16rQnPm27vS0r t50Q== MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Tue, 12 Nov 2019 08:29:00 -0000 Message-ID: Subject: Re: [PATCH] Set AVX128_OPTIMAL for all avx targets. To: Hongtao Liu Cc: "H. J. Lu" , GCC Patches , Uros Bizjak Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2019-11/txt/msg00833.txt.bz2 On Tue, Nov 12, 2019 at 9:19 AM Richard Biener wrote: > > On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu wrote: > > > > Hi: > > This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for > > all AVX target because we found there's still performance gap between > > 128-bit auto-vectorization and 256-bit auto-vectorization even with > > epilog vectorized. > > The performance influence of setting avx128_optimal as default on > > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on > > CLX is as bellow: > > > > INT rate > > 500.perlbench_r -0.32% > > 502.gcc_r -1.32% > > 505.mcf_r -0.12% > > 520.omnetpp_r -0.34% > > 523.xalancbmk_r -0.65% > > 525.x264_r 2.23% > > 531.deepsjeng_r 0.81% > > 541.leela_r -0.02% > > 548.exchange2_r 10.89% ----------> big improvement > > 557.xz_r 0.38% > > geomean for intrate 1.10% > > > > FP rate > > 503.bwaves_r 1.41% > > 507.cactuBSSN_r -0.14% > > 508.namd_r 1.54% > > 510.parest_r -0.87% > > 511.povray_r 0.28% > > 519.lbm_r 0.32% > > 521.wrf_r -0.54% > > 526.blender_r 0.59% > > 527.cam4_r -2.70% > > 538.imagick_r 3.92% > > 544.nab_r 0.59% > > 549.fotonik3d_r -5.44% -------------> regression > > 554.roms_r -2.34% > > geomean for fprate -0.28% > > > > The 10% improvement of 548.exchange_r is because there is 9-layer > > nested loop, and the loop count for innermost layer is small(enough > > for 128-bit vectorization, but not for 256-bit vectorization). > > Since loop count is not statically analyzed out, vectorizer will > > choose 256-bit vectorization which would never never be triggered. The > > vectorization of epilog will introduced some extra instructions, > > normally it will bring back some performance, but since it's 9-layer > > nested loop, costs of extra instructions will cover the gain. > > > > The 5.44% regression of 549.fotonik3d_r is because 256-bit > > vectorization is better than 128-bit vectorization. Generally when > > enabling 256-bit or 512-bit vectorization, there will be instruction > > clocksticks reduction also with frequency reduction. when frequency > > reduction is less than instructions clocksticks reduction, long vector > > width vectorization would be better than shorter one, otherwise the > > opposite. The regression of 549.fotonik3d_r is due to this, similar > > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit > > vectorization is best. > > > > Bootstrap and regression test on i386 is ok. > > Ok for trunk? > > I don't think 128_optimal does what you think it does. If you want to > prefer 128bit AVX adjust the preference, but 128_optimal describes > a microarchitectural detail (AVX256 ops are split into two AVX128 ops) > and is _not_ intended for "tuning". So yes, it's poorly named. A preparatory patch to clean this up (and maybe split it into TARGET_AVX256_SPLIT_REGS and TARGET_AVX128_OPTIMAL) would be nice. And I'm not convinced that a single SPEC benchmark is good enough to penaltize this for all users. GCC isn't a benchmark compiler and GCC does exactly what you expect it to do - try FDO if you want to tell it more. Richard. > Richard. > > > Changelog > > gcc/ > > * config/i386/i386-option.c (m_CORE_AVX): New macro. > > * config/i386/x86-tune.def: Enable 128_optimal for avx and > > replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX. > > * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase. > > * testsuite/gcc.target/i386/pr84413-2.c: Ditto. > > * testsuite/gcc.target/i386/pr84413-3.c: Ditto. > > * testsuite/gcc.target/i386/pr70021.c: Ditto. > > * testsuite/gcc.target/i386/pr90579.c: New test. > > > > > > -- > > BR, > > Hongtao