From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-430193-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 66681 invoked by alias); 20 Jun 2016 22:33:43 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 66667 invoked by uid 89); 20 Jun 2016 22:33:41 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.3 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=stability, perfect
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Mon, 20 Jun 2016 22:33:40 +0000
Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id BFECD46203;	Mon, 20 Jun 2016 22:33:38 +0000 (UTC)
Received: from localhost.localdomain (ovpn-116-31.rdu2.redhat.com [10.10.116.31])	by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u5KMXb0f001756;	Mon, 20 Jun 2016 18:33:37 -0400
Subject: Re: [PATCH, vec-tails 01/10] New compiler options
To: Ilya Enkovich <enkovich.gnu@gmail.com>
References: <20160519193619.GB40563@msticlxl57.ims.intel.com> <CAFiYyc2xu9pdCfrtZO1LmETSc1k8ZGCq4w9hkx8ZrYG-6_zb9w@mail.gmail.com> <CAMbmDYazvwgn-FT=AYqH1ks+GGNRDLSP=PRAEqNk_fiRg47ncA@mail.gmail.com> <CAFiYyc0QqcujRJPQkfURhHuY3rgQ00kEw8y-9RvVt87tSgp8zA@mail.gmail.com> <CAMbmDYZCwaz7EkRcOa4i5xy6Kh4aZw+5oj-MM-2j16x8uX9G5g@mail.gmail.com> <7ac2a8d9-0031-cdd0-17d2-7c00284e9e09@redhat.com> <CAMbmDYbuCS77JfrZpEptYYf9_c1SYTWAXShpcaX9E8=wWYQDYw@mail.gmail.com>
Cc: Richard Biener <richard.guenther@gmail.com>,        GCC Patches <gcc-patches@gcc.gnu.org>
From: Jeff Law <law@redhat.com>
Message-ID: <a0482e03-5e1a-50f5-165a-c68594a13663@redhat.com>
Date: Mon, 20 Jun 2016 22:33:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <CAMbmDYbuCS77JfrZpEptYYf9_c1SYTWAXShpcaX9E8=wWYQDYw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2016-06/txt/msg01466.txt.bz2

On 06/17/2016 04:41 AM, Ilya Enkovich wrote:

>>
>> 1. You've got 3 modes for epilogue vectorization.  Is this an artifact of
>> not really having good heuristics yet for which mode to apply to a
>> particular loop at this time?
>>
>> 2. Similarly for cost models.
>
> All three modes are profitable in different situations.  Profitable mode depends
> on a loop structure and target capabilities.  Ultimate goal is to have all three
> modes enabled by default.  I can't state current heuristics are good enough
> for all cases and targets and therefore don't enable epilogues vectorization
> by default for now.  This is to be measured, analyzed and tuned in
> time for GCC 7.1.

>
> I add cost model simply to have an ability to force epilogue vectorization for
> stability testing (force some mode of epilogue vectorization and check nothing
> fails) and performance testing/tuning (try to find cases where we may benefit
> from epilogue vectorization but don't due to bad cost model).  Also I don't
> want to force epilogue vectorization for all loops for which vectorization is
> forced using unlimited cost model because that may hurt performance for
> simd loops.
Thanks.  That overview helps a lot.

We've done something similar to what you're doing with cost models for 
testing in the scheduler and other places in the past.   The costing 
models seem more geared towards us as developers rather than users.  you 
might consider keep those as local changes and not documenting them.

Understood completely on the modes.


>
> Currently I have numbers collected on various suites for KNL machine.  Masking
> mode (-ftree-vectorize-epilogues=mask) shows not bad results (dynamic
> cost model,
> -Ofast -flto -funroll-loops).  I don't see significant losses and there are few
> significant gains.  For combine and nomask modes the result is not good enough
> yet - there are several significant performance losses.  My guess is that
> current threshold for combine is way too high and for nomask variant we better
> choose the smallest vector size for epilogues instead of the next available
> (use zmm for body and xmm for epilogue instead of zmmm for body and ymm for
> epilogue).
>
> ICC shows better results in these modes which makes me believe we can tune them
> as well.  Overall nomask mode shows worse results comparing to options with
> masking which is quite expected for KNL.
>
> Unfortunately some big gains demonstrated by ICC are not reproducible
> using GCC because we originally can't vectorize required hot loops.  E.g. on
> 200.sixtrack GCC has nothing and ICC has ~40% for all three modes.
I hadn't pondered that case.  Certainly if GCC isn't vectorizing as 
much, we're not going to have as many opportunities for optimizing the 
vec-tails.

Given the results with ICC, we're probably best off keeping all 3 modes 
and working to get them tuned correctly.


>
> I don't have the whole statistics for Haswell but synthetic tests show the
> situation is really different from KNL.  Even for the 'perfect' iterations count
> number (VF * 2 - 1) scalar version of epilogue shows the same result as a masked
> one.  It means ratio of vector code performance vs. scalar code performance is
> not as high as for KNL (KNL is more vector oriented and has weaker
> scalar performance,
> double vector size also matters here) and masking cost is higher for Haswell.
> We still focus on AVX-512 targets more because of their rich masking
> capabilities and wider vector.
Understood.

Jeff