From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-430021-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 78661 invoked by alias); 17 Jun 2016 10:41:21 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 78648 invoked by uid 89); 17 Jun 2016 10:41:20 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=losses, oriented, synthetic, suites
X-HELO: mail-vk0-f53.google.com
Received: from mail-vk0-f53.google.com (HELO mail-vk0-f53.google.com) (209.85.213.53) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Fri, 17 Jun 2016 10:41:17 +0000
Received: by mail-vk0-f53.google.com with SMTP id t129so110484434vka.1        for <gcc-patches@gcc.gnu.org>; Fri, 17 Jun 2016 03:41:17 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:mime-version:in-reply-to:references:from:date         :message-id:subject:to:cc;        bh=eaCIu2+bRtPs1/d6eGM/F3RyBXNdWT78BYQGvwi/Cvw=;        b=AmftxmxYBZ1qcOyConuP2G0fvicqz65DPgOs62NQQwy0X6p2J3iBHXmn1qmqRKSt45         u2vkNHwwzHSodvValAwJN4A/nnDxoDIdQRUfNpRn5cnLldu6bVdtgcFvgvYPtMUkLNsG         iRY6NMZ2tOVxyTuzURRLkbYNJEbv6XavJqEPlI+e3nuWwliRilN7f0htUMvybqXIWQi6         HEtzcrIleUy1szeP807+vontKENrPi49ScbfjZ1gPS1vGSqvyrfw6t1uyRNto7Btogx7         dRxeMt9yddO97vFeYRTbQpECLlOLoXj4vwDjDKN7SvYwhF/JWylin8gKoV7rMxt+2/2r         ql2g==
X-Gm-Message-State: ALyK8tJK7a9mmAgw8/Y4xkbJoPLDPOMxwrG4oaeN7Wqeb5K8kCp+RKR+TbXiTTAwRzuST/56/5QNV8lfUrzaWw==
X-Received: by 10.159.32.79 with SMTP id 73mr544574uam.29.1466160075209; Fri, 17 Jun 2016 03:41:15 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.176.5.201 with HTTP; Fri, 17 Jun 2016 03:41:14 -0700 (PDT)
In-Reply-To: <7ac2a8d9-0031-cdd0-17d2-7c00284e9e09@redhat.com>
References: <20160519193619.GB40563@msticlxl57.ims.intel.com> <CAFiYyc2xu9pdCfrtZO1LmETSc1k8ZGCq4w9hkx8ZrYG-6_zb9w@mail.gmail.com> <CAMbmDYazvwgn-FT=AYqH1ks+GGNRDLSP=PRAEqNk_fiRg47ncA@mail.gmail.com> <CAFiYyc0QqcujRJPQkfURhHuY3rgQ00kEw8y-9RvVt87tSgp8zA@mail.gmail.com> <CAMbmDYZCwaz7EkRcOa4i5xy6Kh4aZw+5oj-MM-2j16x8uX9G5g@mail.gmail.com> <7ac2a8d9-0031-cdd0-17d2-7c00284e9e09@redhat.com>
From: Ilya Enkovich <enkovich.gnu@gmail.com>
Date: Fri, 17 Jun 2016 10:41:00 -0000
Message-ID: <CAMbmDYbuCS77JfrZpEptYYf9_c1SYTWAXShpcaX9E8=wWYQDYw@mail.gmail.com>
Subject: Re: [PATCH, vec-tails 01/10] New compiler options
To: Jeff Law <law@redhat.com>
Cc: Richard Biener <richard.guenther@gmail.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset=UTF-8
X-IsSubscribed: yes
X-SW-Source: 2016-06/txt/msg01294.txt.bz2

2016-06-16 8:06 GMT+03:00 Jeff Law <law@redhat.com>:
> On 05/20/2016 05:40 AM, Ilya Enkovich wrote:
>>
>> 2016-05-20 14:17 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>
>>> On Fri, May 20, 2016 at 11:50 AM, Ilya Enkovich <enkovich.gnu@gmail.com>
>>> wrote:
>>>>
>>>> 2016-05-20 12:26 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>
>>>>> On Thu, May 19, 2016 at 9:36 PM, Ilya Enkovich <enkovich.gnu@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This patch introduces new options used for loop epilogues
>>>>>> vectorization.
>>>>>
>>>>>
>>>>> Why's that?  This is a bit too much for the casual user and if it is
>>>>> really necessary
>>>>> to control this via options then it is not fine-grained enough.
>>>>>
>>>>> Why doesn't the vectorizer/backend have enough info to decide this
>>>>> itself?
>>>>
>>>>
>>>> I don't expect casual user to decide which modes to choose.  These
>>>> controls are
>>>> added for debugging and performance measurement purposes.  I see now I
>>>> miss
>>>> -ftree-vectorize-epilogues aliased to -ftree-vectorize-epilogues=all.
>>>> Surely
>>>> I expect epilogues and short loops vectorization be enabled by default
>>>> on -O3
>>>> or by -ftree-vectorize-loops.
>>>
>>>
>>> Can you make all these --params then?  I think to be useful to users we'd
>>> want
>>> them to be loop pragmas rather than options.
>>
>>
>> OK, I'll change it to params.  I didn't think about control via
>> pragmas but will do now.
>
> So the questions I'd like to see answered:
>
> 1. You've got 3 modes for epilogue vectorization.  Is this an artifact of
> not really having good heuristics yet for which mode to apply to a
> particular loop at this time?
>
> 2. Similarly for cost models.

All three modes are profitable in different situations.  Profitable mode depends
on a loop structure and target capabilities.  Ultimate goal is to have all three
modes enabled by default.  I can't state current heuristics are good enough
for all cases and targets and therefore don't enable epilogues vectorization
by default for now.  This is to be measured, analyzed and tuned in
time for GCC 7.1.

I add cost model simply to have an ability to force epilogue vectorization for
stability testing (force some mode of epilogue vectorization and check nothing
fails) and performance testing/tuning (try to find cases where we may benefit
from epilogue vectorization but don't due to bad cost model).  Also I don't
want to force epilogue vectorization for all loops for which vectorization is
forced using unlimited cost model because that may hurt performance for
simd loops.

>
> In the cover message you indicated you were getting expected gains of KNL,
> but not on Haswell.  Do you have any sense yet why you're not getting good
> resuls on Haswell yet?  For KNL are you getting those speedups with a
> generic set of options or are those with a custom set of options to set the
> mode & cost models?

Currently I have numbers collected on various suites for KNL machine.  Masking
mode (-ftree-vectorize-epilogues=mask) shows not bad results (dynamic
cost model,
-Ofast -flto -funroll-loops).  I don't see significant losses and there are few
significant gains.  For combine and nomask modes the result is not good enough
yet - there are several significant performance losses.  My guess is that
current threshold for combine is way too high and for nomask variant we better
choose the smallest vector size for epilogues instead of the next available
(use zmm for body and xmm for epilogue instead of zmmm for body and ymm for
epilogue).

ICC shows better results in these modes which makes me believe we can tune them
as well.  Overall nomask mode shows worse results comparing to options with
masking which is quite expected for KNL.

Unfortunately some big gains demonstrated by ICC are not reproducible
using GCC because we originally can't vectorize required hot loops.  E.g. on
200.sixtrack GCC has nothing and ICC has ~40% for all three modes.

I don't have the whole statistics for Haswell but synthetic tests show the
situation is really different from KNL.  Even for the 'perfect' iterations count
number (VF * 2 - 1) scalar version of epilogue shows the same result as a masked
one.  It means ratio of vector code performance vs. scalar code performance is
not as high as for KNL (KNL is more vector oriented and has weaker
scalar performance,
double vector size also matters here) and masking cost is higher for Haswell.
We still focus on AVX-512 targets more because of their rich masking
capabilities
and wider vector.

Thanks,
Ilya

>
> jeff