From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 88186 invoked by alias); 3 Nov 2015 11:47:45 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 88170 invoked by uid 89); 3 Nov 2015 11:47:44 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-yk0-f182.google.com Received: from mail-yk0-f182.google.com (HELO mail-yk0-f182.google.com) (209.85.160.182) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 03 Nov 2015 11:47:43 +0000 Received: by ykdr3 with SMTP id r3so12493960ykd.1 for ; Tue, 03 Nov 2015 03:47:41 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.129.107.8 with SMTP id g8mr19838178ywc.267.1446551261331; Tue, 03 Nov 2015 03:47:41 -0800 (PST) Received: by 10.37.117.136 with HTTP; Tue, 3 Nov 2015 03:47:41 -0800 (PST) In-Reply-To: References: Date: Tue, 03 Nov 2015 11:47:00 -0000 Message-ID: Subject: Re: [RFC] Combine vectorized loops with its scalar remainder. From: Richard Biener To: Yuri Rumyantsev Cc: gcc-patches , Jeff Law , Igor Zamyatin , =?UTF-8?B?0JjQu9GM0Y8g0K3QvdC60L7QstC40Yc=?= Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes X-SW-Source: 2015-11/txt/msg00188.txt.bz2 On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev wrote: > Hi All, > > Here is a preliminary patch to combine vectorized loop with its scalar > remainder, draft of which was proposed by Kirill Yukhin month ago: > https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html > It was tested wwith '-mavx2' option to run on Haswell processor. > The main goal of it is to improve performance of vectorized loops for AVX512. > Note that only loads/stores and simple reductions with binary operations are > converted to masked form, e.g. load --> masked load and reduction like > r1 = f r2 --> t = f r2; r1 = m ? t : r2. Masking is performed through > creation of a new vector induction variable initialized with consequent values > from 0.. VF-1, new const vector upper bound which contains number of iterations > and the result of comparison which is considered as mask vector. > This implementation has several restrictions: > > 1. Multiple types are not supported. > 2. SLP is not supported. > 3. Gather/Scatter's are also not supported. > 4. Vectorization of the loops with low trip count is not implemented yet since > it requires additional design and tuning. > > We are planning to eleminate all these restrictions in GCCv7. > > This patch will be extended to include cost model to reject unprofutable > transformations, e.g. new vector body cost will be evaluated through new > target hook which estimates cast of masking different vector statements. New > threshold parameter will be introduced which determines permissible cost > increasing which will be tuned on an AVX512 machine. > This patch is not in sync with changes of Ilya Enkovich for AVX512 masked > load/store support since only part of them is in trunk compiler. > > Any comments will be appreciated. As stated in the previous discussion I don't think the extra mask IV is a good idea and we instead should have a masked final iteration for the epilogue (yes, that's not really "combined" then). This is because in the end we'd not only want AVX512 to benefit from this work but also other ISAs that can do unaligned or masked operations (we can overlap the epilogue work with the vectorized work or use masked loads/stores available with AVX). Note that the same applies to the alignment prologue if present, I can't see how you can handle that with the in-loop approach. Richard.