From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-196424-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 26022 invoked by alias); 20 Jun 2018 21:12:18 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 26011 invoked by uid 89); 20 Jun 2018 21:12:17 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=HTo:U*joel, goals, scientists, chunk
X-HELO: mail-yb0-f195.google.com
Received: from mail-yb0-f195.google.com (HELO mail-yb0-f195.google.com) (209.85.213.195) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 20 Jun 2018 21:12:15 +0000
Received: by mail-yb0-f195.google.com with SMTP id q62-v6so387959ybg.5        for <gcc@gcc.gnu.org>; Wed, 20 Jun 2018 14:12:15 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20161025;        h=x-gm-message-state:mime-version:in-reply-to:references:from:date         :message-id:subject:to:cc:content-transfer-encoding;        bh=ASxAJPdNeCMLUolBu5QqDOks9cFfUgdzaPxeoHhd4Is=;        b=S15ZMePyaGTjlST9DawOfWzrId96p5PtiiRTxXJ903XMyhDIkyZEo5ZIyXYYK+Vlu0         cX0k+ksyGI9wt8R7F/pab1rPIMzb26RL2dYXGTCdbIr953TK8hx7rnCmRhCqshJ/ApNC         V/hXfVhywTCckbS3wHwPZ4yCL95RJj9TwIYBM+ZtuenRfpoOs5V0BROlu+/EPrhHMZ7+         h8pFvrIq87+vkDurUz1muvqvcB3Bh4lKsIeSat0pfgBiD8gluEJvDdX5QSUNC3D3ofms         h0B+qlw8v+6XMp+ntdzGub4rQWecxoreXlUcHl4NRu9SH/PoB7xAi3ggC2BQiK2Vmddc         spJg==
X-Gm-Message-State: APt69E0fdWqsMQa8y0jkPiXH1itjtA8ilq5LCS/BD4k6iKD4abuMZTJX	uVXZXeiDnZl726dJa2MBzqxWjNNcCu1H78J6xPI=
X-Google-Smtp-Source: ADUXVKI2WBqogxIAvoM0H8ixZ9keZmJQbZDMPBhpNBMq9AxTd0l3M1unGtgNBkpV0j3oVfC+10GVGmUHdUDqseCXpiY=
X-Received: by 2002:a25:3:: with SMTP id 3-v6mr1563207yba.324.1529529133332; Wed, 20 Jun 2018 14:12:13 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a25:cf87:0:0:0:0:0 with HTTP; Wed, 20 Jun 2018 14:11:52 -0700 (PDT)
In-Reply-To: <CAF9ehCVU26swweYZk+8cOKTk-vBfHyr0SrAMhmaXMBxuxGN92w@mail.gmail.com>
References: <01126195-f718-7dd0-063b-6997e5b82559@molgen.mpg.de> <CAF9ehCVU26swweYZk+8cOKTk-vBfHyr0SrAMhmaXMBxuxGN92w@mail.gmail.com>
From: NightStrike <nightstrike@gmail.com>
Date: Wed, 20 Jun 2018 22:42:00 -0000
Message-ID: <CAF1jjLuoNjPrpWQL4fiwCBs2-xJX2MR7GDdhWYPhWGvXdKG8kA@mail.gmail.com>
Subject: Re: How to get GCC on par with ICC?
To: joel@rtems.org
Cc: Paul Menzel <pmenzel+gcc.gnu.org@molgen.mpg.de>, "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2018-06/txt/msg00225.txt.bz2

On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill <joel@rtems.org> wrote:
>
> On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
> pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
>
> > Dear GCC folks,
> >
> >
> > Some scientists in our organization still want to use the Intel compile=
r,
> > as they say, it produces faster code, which is then executed on cluster=
s.
> > Some resources on the Web [1][2] confirm this. (I am aware, that it=E2=
=80=99s
> > heavily dependent on the actual program.)
> >
>
> Do they have specific examples where icc is better for them? Or can point
> to specific GCC PRs which impact them?
>
>
> GCC versions?
>
> Are there specific CPU model variants of concern?
>
> What flags are used to compile? Some times a bit of advice can produce
> improvements.
>
> Without specific examples, it is hard to set goals.

If I could perhaps jump in here for a moment...  Just today I hit upon
a series of small (in lines of code) loops that gcc can't vectorize,
and intel vectorizes like a madman.  They all involve a lot of heavy
use of std::vector<std::vector<float>>.  Comparisons were with gcc
8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
as sched_FIFO, mlockall, affinity set to its own core, and all
interrupts vectored off that core.  So, as close to not-noisy as
possible.

I was surprised at the results results, but using each compiler's methods of
dumping vectorization info, intel wins on two points:

1) It actually vectorizes
2) It's vectorizing output is much more easily readable

Options were:

gcc -Wall -ggdb3 -std=3Dgnu++17 -flto -Ofast -march=3Dnative

vs:

icc -Ofast -std=3Dgnu++14


So, not exactly exact, but pretty close.


So here's an example of a chunk of code (not very readable, sorry
about that) that intel can vectorize, and subsequently make about 50%
faster:

        std::size_t nLayers { input.nn.size() };
        //std::size_t ySize =3D std::max_element(input.nn.cbegin(),
input.nn.cend(), [](auto a, auto b){ return a.size() < b.size();
})->size();
        std::size_t ySize =3D 0;
        for (auto const & nn: input.nn)
                ySize =3D std::max(ySize, nn.size());

        float yNorm[ySize];
        for (auto & y: yNorm)
                y =3D 0.0f;
        for (std::size_t i =3D 0; i < xSize; ++i)
                yNorm[i] =3D xNorm[i];
        for (std::size_t layer =3D 0; layer < nLayers; ++layer) {
                auto & nn =3D input.nn[layer];
                auto & b =3D nn.back();
                float y[ySize];
                for (std::size_t i =3D 0; i < nn[0].size(); ++i) {
                        y[i] =3D b[i];
                        for (std::size_t j =3D 0; j < nn.size() - 1; ++j)
                                y[i] +=3D nn.at(j).at(i) * yNorm[j];
                }
                for (std::size_t i =3D 0; i < ySize; ++i) {
                        if (layer < nLayers - 1)
                                y[i] =3D std::max(y[i], 0.0f);
                        yNorm[i] =3D y[i];
                }
        }


If I was better at godbolt, I could show the asm, but I'm not.  I'm
willing to learn, though.