From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26022 invoked by alias); 20 Jun 2018 21:12:18 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 26011 invoked by uid 89); 20 Jun 2018 21:12:17 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=HTo:U*joel, goals, scientists, chunk X-HELO: mail-yb0-f195.google.com Received: from mail-yb0-f195.google.com (HELO mail-yb0-f195.google.com) (209.85.213.195) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 20 Jun 2018 21:12:15 +0000 Received: by mail-yb0-f195.google.com with SMTP id q62-v6so387959ybg.5 for ; Wed, 20 Jun 2018 14:12:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=ASxAJPdNeCMLUolBu5QqDOks9cFfUgdzaPxeoHhd4Is=; b=S15ZMePyaGTjlST9DawOfWzrId96p5PtiiRTxXJ903XMyhDIkyZEo5ZIyXYYK+Vlu0 cX0k+ksyGI9wt8R7F/pab1rPIMzb26RL2dYXGTCdbIr953TK8hx7rnCmRhCqshJ/ApNC V/hXfVhywTCckbS3wHwPZ4yCL95RJj9TwIYBM+ZtuenRfpoOs5V0BROlu+/EPrhHMZ7+ h8pFvrIq87+vkDurUz1muvqvcB3Bh4lKsIeSat0pfgBiD8gluEJvDdX5QSUNC3D3ofms h0B+qlw8v+6XMp+ntdzGub4rQWecxoreXlUcHl4NRu9SH/PoB7xAi3ggC2BQiK2Vmddc spJg== X-Gm-Message-State: APt69E0fdWqsMQa8y0jkPiXH1itjtA8ilq5LCS/BD4k6iKD4abuMZTJX uVXZXeiDnZl726dJa2MBzqxWjNNcCu1H78J6xPI= X-Google-Smtp-Source: ADUXVKI2WBqogxIAvoM0H8ixZ9keZmJQbZDMPBhpNBMq9AxTd0l3M1unGtgNBkpV0j3oVfC+10GVGmUHdUDqseCXpiY= X-Received: by 2002:a25:3:: with SMTP id 3-v6mr1563207yba.324.1529529133332; Wed, 20 Jun 2018 14:12:13 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a25:cf87:0:0:0:0:0 with HTTP; Wed, 20 Jun 2018 14:11:52 -0700 (PDT) In-Reply-To: References: <01126195-f718-7dd0-063b-6997e5b82559@molgen.mpg.de> From: NightStrike Date: Wed, 20 Jun 2018 22:42:00 -0000 Message-ID: Subject: Re: How to get GCC on par with ICC? To: joel@rtems.org Cc: Paul Menzel , "gcc@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2018-06/txt/msg00225.txt.bz2 On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill wrote: > > On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel < > pmenzel+gcc.gnu.org@molgen.mpg.de> wrote: > > > Dear GCC folks, > > > > > > Some scientists in our organization still want to use the Intel compile= r, > > as they say, it produces faster code, which is then executed on cluster= s. > > Some resources on the Web [1][2] confirm this. (I am aware, that it=E2= =80=99s > > heavily dependent on the actual program.) > > > > Do they have specific examples where icc is better for them? Or can point > to specific GCC PRs which impact them? > > > GCC versions? > > Are there specific CPU model variants of concern? > > What flags are used to compile? Some times a bit of advice can produce > improvements. > > Without specific examples, it is hard to set goals. If I could perhaps jump in here for a moment... Just today I hit upon a series of small (in lines of code) loops that gcc can't vectorize, and intel vectorizes like a madman. They all involve a lot of heavy use of std::vector>. Comparisons were with gcc 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running as sched_FIFO, mlockall, affinity set to its own core, and all interrupts vectored off that core. So, as close to not-noisy as possible. I was surprised at the results results, but using each compiler's methods of dumping vectorization info, intel wins on two points: 1) It actually vectorizes 2) It's vectorizing output is much more easily readable Options were: gcc -Wall -ggdb3 -std=3Dgnu++17 -flto -Ofast -march=3Dnative vs: icc -Ofast -std=3Dgnu++14 So, not exactly exact, but pretty close. So here's an example of a chunk of code (not very readable, sorry about that) that intel can vectorize, and subsequently make about 50% faster: std::size_t nLayers { input.nn.size() }; //std::size_t ySize =3D std::max_element(input.nn.cbegin(), input.nn.cend(), [](auto a, auto b){ return a.size() < b.size(); })->size(); std::size_t ySize =3D 0; for (auto const & nn: input.nn) ySize =3D std::max(ySize, nn.size()); float yNorm[ySize]; for (auto & y: yNorm) y =3D 0.0f; for (std::size_t i =3D 0; i < xSize; ++i) yNorm[i] =3D xNorm[i]; for (std::size_t layer =3D 0; layer < nLayers; ++layer) { auto & nn =3D input.nn[layer]; auto & b =3D nn.back(); float y[ySize]; for (std::size_t i =3D 0; i < nn[0].size(); ++i) { y[i] =3D b[i]; for (std::size_t j =3D 0; j < nn.size() - 1; ++j) y[i] +=3D nn.at(j).at(i) * yNorm[j]; } for (std::size_t i =3D 0; i < ySize; ++i) { if (layer < nLayers - 1) y[i] =3D std::max(y[i], 0.0f); yNorm[i] =3D y[i]; } } If I was better at godbolt, I could show the asm, but I'm not. I'm willing to learn, though.