From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qweC=CD=suse.de=rguenther@sourceware.org>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c])
	by sourceware.org (Postfix) with ESMTPS id 3E41B3858C1F
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Jun 2023 06:52:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3E41B3858C1F
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id 32A9F22383;
	Thu, 15 Jun 2023 06:52:00 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1686811920; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=zqZen88s8MWmx/8fvrvVthlrESLwGBJ7f8nDcLXUK5o=;
	b=mmq2HE5cOz/K2WLB8TDgxBJIb911rfCbHLpmV+5h+alLSyXnvJvivjFy7OFn8Z4T3qRF1o
	6voNPVShj8/NH6wjQTP8oRClTrytOeOK31zjNoOTeg9snoVcWu0ej+U8jnxA7awQWWJycU
	wnCgP+vSOjrshYqJfvqHDIrgJkiZkiw=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1686811920;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=zqZen88s8MWmx/8fvrvVthlrESLwGBJ7f8nDcLXUK5o=;
	b=XYSCLMvvIYpY1abDPJwY1BVvB0kLilovUxfoLPc64++syDhK3EH9iai/wMST/pejTPM/ey
	PO4OjsIaE5qB8sDg==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id DE7F02C141;
	Thu, 15 Jun 2023 06:51:59 +0000 (UTC)
Date: Thu, 15 Jun 2023 06:51:59 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: "Liu, Hongtao" <hongtao.liu@intel.com>
cc: Andrew Stubbs <ams@codesourcery.com>, 
    "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, 
    "richard.sandiford@arm.com" <richard.sandiford@arm.com>, 
    Jan Hubicka <hubicka@ucw.cz>, 
    "kirill.yukhin@gmail.com" <kirill.yukhin@gmail.com>
Subject: RE: [PATCH 3/3] AVX512 fully masked vectorization
In-Reply-To: <SA1PR11MB67573854CBC1B787A4395D53E55BA@SA1PR11MB6757.namprd11.prod.outlook.com>
Message-ID: <nycvar.YFH.7.77.849.2306150646550.4723@jbgna.fhfr.qr>
References: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com> <DA5221DD-8F3C-47A3-821D-8CB09DA5E60B@suse.de> <SA1PR11MB67573854CBC1B787A4395D53E55BA@SA1PR11MB6757.namprd11.prod.outlook.com>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="-1609957120-1158412266-1686811920=:4723"
X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

---1609957120-1158412266-1686811920=:4723
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT

On Thu, 15 Jun 2023, Liu, Hongtao wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Wednesday, June 14, 2023 10:30 PM
> > To: Andrew Stubbs <ams@codesourcery.com>
> > Cc: gcc-patches@gcc.gnu.org; richard.sandiford@arm.com; Jan Hubicka
> > <hubicka@ucw.cz>; Liu, Hongtao <hongtao.liu@intel.com>;
> > kirill.yukhin@gmail.com
> > Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
> > 
> > 
> > 
> > > Am 14.06.2023 um 16:27 schrieb Andrew Stubbs
> > <ams@codesourcery.com>:
> > >
> > > ﻿On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> > >> This implemens fully masked vectorization or a masked epilog for
> > >> AVX512 style masks which single themselves out by representing each
> > >> lane with a single bit and by using integer modes for the mask (both
> > >> is much like GCN).
> > >> AVX512 is also special in that it doesn't have any instruction to
> > >> compute the mask from a scalar IV like SVE has with while_ult.
> > >> Instead the masks are produced by vector compares and the loop
> > >> control retains the scalar IV (mainly to avoid dependences on mask
> > >> generation, a suitable mask test instruction is available).
> > >
> > > This is also sounds like GCN. We currently use WHILE_ULT in the middle end
> > which expands to a vector compare against a vector of stepped values. This
> > requires an additional instruction to prepare the comparison vector
> > (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
> > the DImode bitmask, so it works reasonably well.
> > >
> > >> Like RVV code generation prefers a decrementing IV though IVOPTs
> > >> messes things up in some cases removing that IV to eliminate it with
> > >> an incrementing one used for address generation.
> > >> One of the motivating testcases is from PR108410 which in turn is
> > >> extracted from x264 where large size vectorization shows issues with
> > >> small trip loops.  Execution time there improves compared to classic
> > >> AVX512 with AVX2 epilogues for the cases of less than 32 iterations.
> > >> size   scalar     128     256     512    512e    512f
> > >>     1    9.42   11.32    9.35   11.17   15.13   16.89
> > >>     2    5.72    6.53    6.66    6.66    7.62    8.56
> > >>     3    4.49    5.10    5.10    5.74    5.08    5.73
> > >>     4    4.10    4.33    4.29    5.21    3.79    4.25
> > >>     6    3.78    3.85    3.86    4.76    2.54    2.85
> > >>     8    3.64    1.89    3.76    4.50    1.92    2.16
> > >>    12    3.56    2.21    3.75    4.26    1.26    1.42
> > >>    16    3.36    0.83    1.06    4.16    0.95    1.07
> > >>    20    3.39    1.42    1.33    4.07    0.75    0.85
> > >>    24    3.23    0.66    1.72    4.22    0.62    0.70
> > >>    28    3.18    1.09    2.04    4.20    0.54    0.61
> > >>    32    3.16    0.47    0.41    0.41    0.47    0.53
> > >>    34    3.16    0.67    0.61    0.56    0.44    0.50
> > >>    38    3.19    0.95    0.95    0.82    0.40    0.45
> > >>    42    3.09    0.58    1.21    1.13    0.36    0.40
> > >> 'size' specifies the number of actual iterations, 512e is for a
> > >> masked epilog and 512f for the fully masked loop.  From
> > >> 4 scalar iterations on the AVX512 masked epilog code is clearly the
> > >> winner, the fully masked variant is clearly worse and it's size
> > >> benefit is also tiny.
> > >
> > > Let me check I understand correctly. In the fully masked case, there is a
> > single loop in which a new mask is generated at the start of each iteration. In
> > the masked epilogue case, the main loop uses no masking whatsoever, thus
> > avoiding the need for generating a mask, carrying the mask, inserting
> > vec_merge operations, etc, and then the epilogue looks much like the fully
> > masked case, but unlike smaller mode epilogues there is no loop because the
> > eplogue vector size is the same. Is that right?
> > 
> > Yes.
> What about vectorizer and unroll, when vector size is the same, unroll 
> factor is N, but there're at most N - 1 iterations for epilogue loop, 
> will there still a loop?

Yes, it looks like vect_determine_partial_vectors_and_peeling has
a special exception for that, it does

         If we are unrolling we also do not want to use partial vectors. 
This
         is to avoid the overhead of generating multiple masks and also to
         avoid having to execute entire iterations of FALSE masked 
instructions
         when dealing with one or less full iterations.
...
      if ((param_vect_partial_vector_usage == 1
           || loop_vinfo->suggested_unroll_factor > 1)
          && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
          && !vect_known_niters_smaller_than_vf (loop_vinfo))
        LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
      else
        LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;

so when we unroll the main loop we are forcing partial vectors on
the epilog even though that's going to iterate.  The comment suggests
that wasn't the main idea but the idea was to not use partial vectors
for the main loop.

Of course since we can at most have a single vectorized epilog
(but not a vectorized epilog of the epilog) it might still make
sense to use partial vectors there.

> > > This scheme seems like it might also benefit GCN, in so much as it simplifies
> > the hot code path.
> > >
> > > GCN does not actually have smaller vector sizes, so there's no analogue to
> > AVX2 (we pretend we have some smaller sizes, but that's because the
> > middle end can't do masking everywhere yet, and it helps make some vector
> > constants smaller, perhaps).
> > >
> > >> This patch does not enable using fully masked loops or masked
> > >> epilogues by default.  More work on cost modeling and vectorization
> > >> kind selection on x86_64 is necessary for this.
> > >> Implementation wise this introduces
> > LOOP_VINFO_PARTIAL_VECTORS_STYLE
> > >> which could be exploited further to unify some of the flags we have
> > >> right now but there didn't seem to be many easy things to merge, so
> > >> I'm leaving this for followups.
> > >> Mask requirements as registered by vect_record_loop_mask are kept in
> > >> their original form and recorded in a hash_set now instead of being
> > >> processed to a vector of rgroup_controls.  Instead that's now left to
> > >> the final analysis phase which tries forming the rgroup_controls
> > >> vector using while_ult and if that fails now tries AVX512 style which
> > >> needs a different organization and instead fills a hash_map with the
> > >> relevant info.  vect_get_loop_mask now has two implementations, one
> > >> for the two mask styles we then have.
> > >> I have decided against interweaving
> > >> vect_set_loop_condition_partial_vectors
> > >> with conditions to do AVX512 style masking and instead opted to
> > >> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> > >> Likewise for vect_verify_full_masking vs
> > vect_verify_full_masking_avx512.
> > >> I was split between making 'vec_loop_masks' a class with methods,
> > >> possibly merging in the _len stuff into a single registry.  It seemed
> > >> to be too many changes for the purpose of getting AVX512 working.
> > >> I'm going to play wait and see what happens with RISC-V here since
> > >> they are going to get both masks and lengths registered I think.
> > >> The vect_prepare_for_masked_peels hunk might run into issues with
> > >> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> > >> looked odd.
> > >> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run the
> > >> testsuite with --param vect-partial-vector-usage=2 with and without
> > >> -fno-vect-cost-model and filed two bugs, one ICE (PR110221) and one
> > >> latent wrong-code (PR110237).
> > >> There's followup work to be done to try enabling masked epilogues for
> > >> x86-64 by default (when AVX512 is enabled, possibly only when
> > >> -mprefer-vector-width=512).  Getting cost modeling and decision right
> > >> is going to be challenging.
> > >> Any comments?
> > >> OK?
> > >> Btw, testing on GCN would be welcome - the _avx512 paths could work
> > >> for it so in case the while_ult path fails (not sure if it ever does)
> > >> it could get _avx512 style masking.  Likewise testing on ARM just to
> > >> see I didn't break anything here.
> > >> I don't have SVE hardware so testing is probably meaningless.
> > >
> > > I can set some tests going. Is vect.exp enough?
> > 
> > Well, only you know (from experience), but sure that?s a nice start.
> > 
> > Richard
> > 
> > > Andrew
> > >
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
---1609957120-1158412266-1686811920=:4723--