From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-513040-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 72947 invoked by alias); 12 Nov 2019 08:29:06 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 72938 invoked by uid 89); 12 Nov 2019 08:29:05 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=convinced, H*i:pd6Z7zT, H*f:pd6Z7zT, H*f:sk:AP_1dSB
X-HELO: mail-lf1-f68.google.com
Received: from mail-lf1-f68.google.com (HELO mail-lf1-f68.google.com) (209.85.167.68) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 12 Nov 2019 08:29:04 +0000
Received: by mail-lf1-f68.google.com with SMTP id q5so6076148lfo.10        for <gcc-patches@gcc.gnu.org>; Tue, 12 Nov 2019 00:29:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc;        bh=jBnfKvSn9wtcAxb3Kgj+ZGI7+mOQPanJX2I4gMJ5Uc8=;        b=tOxCtGHLlQ7iJICGP5yLwIEeg1KI72e9D8HqdR5jtYdp+wjcRsX0sXfLG4tyZxoEQd         Zj8xU4O9kdioT8CKOVlxs4G46yzp7xwFxe8g2TAv+5lBz1Zol8SQuavhSci8+xGG4/Q1         MhE9jex02I0mW0yXgWs3irOS6bthhsYNE4Cq15l4ggXzTwhVqTPdAfIwc9HPL29pJB4s         MpC9himXb/ZXctJectyDRdIX9fkNMag5D8KL9xhwBWFW0M1/2zkQ2k+T2gSxgC3z+0x1         NIctRXfynneITSo1NIMTXoeCNNFxDuWdYHTttR+IrKiJl4Kqh8nWVON16rQnPm27vS0r         t50Q==
MIME-Version: 1.0
References: <CAMZc-byz4N3PUqAk0RqZU+=DEJhYw_curYd1JDn_dNjun5xskw@mail.gmail.com> <CAFiYyc0WZpbEWs9Rqahv4rvdM=pd6Z7zT+AP_1dSBt1UUd70EA@mail.gmail.com>
In-Reply-To: <CAFiYyc0WZpbEWs9Rqahv4rvdM=pd6Z7zT+AP_1dSBt1UUd70EA@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 12 Nov 2019 08:29:00 -0000
Message-ID: <CAFiYyc0hzDBVton+caPSHgGzqkw5CXDpqejYrhGEASPtsNUBWQ@mail.gmail.com>
Subject: Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.
To: Hongtao Liu <crazylht@gmail.com>
Cc: "H. J. Lu" <hjl.tools@gmail.com>, GCC Patches <gcc-patches@gcc.gnu.org>, 	Uros Bizjak <ubizjak@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2019-11/txt/msg00833.txt.bz2

On Tue, Nov 12, 2019 at 9:19 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > Hi:
> >   This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> > all AVX target because we found there's still performance gap between
> > 128-bit auto-vectorization and 256-bit auto-vectorization even with
> > epilog vectorized.
> >   The performance influence of setting avx128_optimal as default on
> > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> > CLX is as bellow:
> >
> >     INT rate
> >     500.perlbench_r         -0.32%
> >     502.gcc_r                       -1.32%
> >     505.mcf_r                       -0.12%
> >     520.omnetpp_r                   -0.34%
> >     523.xalancbmk_r         -0.65%
> >     525.x264_r                      2.23%
> >     531.deepsjeng_r         0.81%
> >     541.leela_r                     -0.02%
> >     548.exchange2_r         10.89%  ----------> big improvement
> >     557.xz_r                        0.38%
> >     geomean for intrate             1.10%
> >
> >     FP rate
> >     503.bwaves_r                    1.41%
> >     507.cactuBSSN_r         -0.14%
> >     508.namd_r                      1.54%
> >     510.parest_r                    -0.87%
> >     511.povray_r                    0.28%
> >     519.lbm_r                       0.32%
> >     521.wrf_r                       -0.54%
> >     526.blender_r                   0.59%
> >     527.cam4_r                      -2.70%
> >     538.imagick_r                   3.92%
> >     544.nab_r                       0.59%
> >     549.fotonik3d_r         -5.44%  -------------> regression
> >     554.roms_r                      -2.34%
> >     geomean for fprate              -0.28%
> >
> > The 10% improvement of 548.exchange_r is because there is 9-layer
> > nested loop, and the loop count for innermost layer is small(enough
> > for 128-bit vectorization, but not for 256-bit vectorization).
> > Since loop count is not statically analyzed out, vectorizer will
> > choose 256-bit vectorization which would never never be triggered. The
> > vectorization of epilog will introduced some extra instructions,
> > normally it will bring back some performance, but since it's 9-layer
> > nested loop, costs of extra instructions will cover the gain.
> >
> > The 5.44% regression of 549.fotonik3d_r is because 256-bit
> > vectorization is better than 128-bit vectorization. Generally when
> > enabling 256-bit or 512-bit vectorization, there will be instruction
> > clocksticks reduction also with frequency reduction. when frequency
> > reduction is less than instructions clocksticks reduction, long vector
> > width vectorization would be better than shorter one, otherwise the
> > opposite. The regression of 549.fotonik3d_r is due to this, similar
> > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> > vectorization is best.
> >
> > Bootstrap and regression test on i386 is ok.
> > Ok for trunk?
>
> I don't think 128_optimal does what you think it does.  If you want to
> prefer 128bit AVX adjust the preference, but 128_optimal describes
> a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
> and is _not_ intended for "tuning".

So yes, it's poorly named.  A preparatory patch to clean this up
(and maybe split it into TARGET_AVX256_SPLIT_REGS and TARGET_AVX128_OPTIMAL)
would be nice.

And I'm not convinced that a single SPEC benchmark is good enough to
penaltize this for all users.  GCC isn't a benchmark compiler and GCC
does exactly what you expect it to do - try FDO if you want to tell it more.

Richard.

> Richard.
>
> > Changelog
> >     gcc/
> >             * config/i386/i386-option.c (m_CORE_AVX): New macro.
> >             * config/i386/x86-tune.def: Enable 128_optimal for avx and
> >             replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX.
> >             * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase.
> >             * testsuite/gcc.target/i386/pr84413-2.c: Ditto.
> >             * testsuite/gcc.target/i386/pr84413-3.c: Ditto.
> >             * testsuite/gcc.target/i386/pr70021.c: Ditto.
> >             * testsuite/gcc.target/i386/pr90579.c: New test.
> >
> >
> > --
> > BR,
> > Hongtao