From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-412472-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 88186 invoked by alias); 3 Nov 2015 11:47:45 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 88170 invoked by uid 89); 3 Nov 2015 11:47:44 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-yk0-f182.google.com
Received: from mail-yk0-f182.google.com (HELO mail-yk0-f182.google.com) (209.85.160.182) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 03 Nov 2015 11:47:43 +0000
Received: by ykdr3 with SMTP id r3so12493960ykd.1        for <gcc-patches@gcc.gnu.org>; Tue, 03 Nov 2015 03:47:41 -0800 (PST)
MIME-Version: 1.0
X-Received: by 10.129.107.8 with SMTP id g8mr19838178ywc.267.1446551261331; Tue, 03 Nov 2015 03:47:41 -0800 (PST)
Received: by 10.37.117.136 with HTTP; Tue, 3 Nov 2015 03:47:41 -0800 (PST)
In-Reply-To: <CAEoMCqSmMRW1C2LniYShbfdA+JfSS6kzfrPYCcdd-rdVXa4mzg@mail.gmail.com>
References: <CAEoMCqSmMRW1C2LniYShbfdA+JfSS6kzfrPYCcdd-rdVXa4mzg@mail.gmail.com>
Date: Tue, 03 Nov 2015 11:47:00 -0000
Message-ID: <CAFiYyc2badGgiQDyAuW6N5CnD6qMGCNCHD3fFvqK=un5V5BmWg@mail.gmail.com>
Subject: Re: [RFC] Combine vectorized loops with its scalar remainder.
From: Richard Biener <richard.guenther@gmail.com>
To: Yuri Rumyantsev <ysrumyan@gmail.com>
Cc: gcc-patches <gcc-patches@gcc.gnu.org>, Jeff Law <law@redhat.com>, 	Igor Zamyatin <izamyatin@gmail.com>, =?UTF-8?B?0JjQu9GM0Y8g0K3QvdC60L7QstC40Yc=?= <enkovich.gnu@gmail.com>
Content-Type: text/plain; charset=UTF-8
X-IsSubscribed: yes
X-SW-Source: 2015-11/txt/msg00188.txt.bz2

On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi All,
>
> Here is a preliminary patch to combine vectorized loop with its scalar
> remainder, draft of which was proposed by Kirill Yukhin month ago:
> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
> It was tested wwith '-mavx2' option to run on Haswell processor.
> The main goal of it is to improve performance of vectorized loops for AVX512.
> Note that only loads/stores and simple reductions with binary operations are
> converted to masked form, e.g. load --> masked load and reduction like
> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
> creation of a new vector induction variable initialized with consequent values
> from 0.. VF-1, new const vector upper bound which contains number of iterations
> and the result of comparison which is considered as mask vector.
> This implementation has several restrictions:
>
> 1. Multiple types are not supported.
> 2. SLP is not supported.
> 3. Gather/Scatter's are also not supported.
> 4. Vectorization of the loops with low trip count is not implemented yet since
>    it requires additional design and tuning.
>
> We are planning to eleminate all these restrictions in GCCv7.
>
> This patch will be extended to include cost model to reject unprofutable
> transformations, e.g. new vector body cost will be evaluated through new
> target hook which estimates cast of masking different vector statements. New
> threshold parameter will be introduced which determines permissible cost
> increasing which will be tuned on an AVX512 machine.
> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
> load/store support since only part of them is in trunk compiler.
>
> Any comments will be appreciated.

As stated in the previous discussion I don't think the extra mask IV
is a good idea
and we instead should have a masked final iteration for the epilogue
(yes, that's
not really "combined" then).  This is because in the end we'd not only
want AVX512
to benefit from this work but also other ISAs that can do unaligned or masked
operations (we can overlap the epilogue work with the vectorized work or use
masked loads/stores available with AVX).  Note that the same applies to
the alignment prologue if present, I can't see how you can handle that with the
in-loop approach.

Richard.