From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qweC=CD=suse.de=rguenther@sourceware.org>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
	by sourceware.org (Postfix) with ESMTPS id ED3FF385842C
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Jun 2023 13:34:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org ED3FF385842C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id 1254121833;
	Thu, 15 Jun 2023 13:34:14 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1686836054; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=S0ZBXoF+2tGMHm0GhHZsQOBrHO3yge+I6ywwo0n1Oos=;
	b=jxD0pvkKZS9PuAksr9hEqvpbjEvw+TnARrdI5bgKk72kww73FibxYDWnn6vF22rqrXsWEu
	UpN060sHxANOpgo7eBSM64TYEZntNgxSdMKWV/jOlnaW9OPrVwEhHOYCM5Z+tshZQAxZmY
	cMn3GZEyw5aFMhtMU3oCo4hI1FLkQ/w=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1686836054;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=S0ZBXoF+2tGMHm0GhHZsQOBrHO3yge+I6ywwo0n1Oos=;
	b=GtOCsPYj+0FmkJ8YGpVfeDN4IL0dhiz2xLWaH6XtVF8fiDAzY3mgFn5Uc/e5Kvg3sJQvpi
	sFHOx7pKEXwDMPDQ==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id B099B2C141;
	Thu, 15 Jun 2023 13:34:13 +0000 (UTC)
Date: Thu, 15 Jun 2023 13:34:13 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Andrew Stubbs <ams@codesourcery.com>
cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com, 
    Jan Hubicka <hubicka@ucw.cz>, hongtao.liu@intel.com, 
    kirill.yukhin@gmail.com
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
In-Reply-To: <20d9c811-7964-bc51-64be-572a0a32e8be@codesourcery.com>
Message-ID: <nycvar.YFH.7.77.849.2306151331470.4723@jbgna.fhfr.qr>
References: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com> <DA5221DD-8F3C-47A3-821D-8CB09DA5E60B@suse.de> <378905b9-2383-d564-1c91-2c6b1e06629d@codesourcery.com> <nycvar.YFH.7.77.849.2306150956130.4723@jbgna.fhfr.qr> <d33d0c45-12f1-75dc-d162-8d2da3b0a810@codesourcery.com>
 <nycvar.YFH.7.77.849.2306151059280.4723@jbgna.fhfr.qr> <20d9c811-7964-bc51-64be-572a0a32e8be@codesourcery.com>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="-1609957120-1482378203-1686836054=:4723"
X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

---1609957120-1482378203-1686836054=:4723
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT

On Thu, 15 Jun 2023, Andrew Stubbs wrote:

> On 15/06/2023 12:06, Richard Biener wrote:
> > On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> > 
> >> On 15/06/2023 10:58, Richard Biener wrote:
> >>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> >>>
> >>>> On 14/06/2023 15:29, Richard Biener wrote:
> >>>>>
> >>>>>
> >>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> >>>>>>
> >>>>>> ﻿On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >>>>>>> This implemens fully masked vectorization or a masked epilog for
> >>>>>>> AVX512 style masks which single themselves out by representing
> >>>>>>> each lane with a single bit and by using integer modes for the mask
> >>>>>>> (both is much like GCN).
> >>>>>>> AVX512 is also special in that it doesn't have any instruction
> >>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
> >>>>>>> Instead the masks are produced by vector compares and the loop
> >>>>>>> control retains the scalar IV (mainly to avoid dependences on
> >>>>>>> mask generation, a suitable mask test instruction is available).
> >>>>>>
> >>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle
> >>>>>> end
> >>>>>> which expands to a vector compare against a vector of stepped values.
> >>>>>> This
> >>>>>> requires an additional instruction to prepare the comparison vector
> >>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
> >>>>>> returns
> >>>>>> the DImode bitmask, so it works reasonably well.
> >>>>>>
> >>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
> >>>>>>> messes things up in some cases removing that IV to eliminate
> >>>>>>> it with an incrementing one used for address generation.
> >>>>>>> One of the motivating testcases is from PR108410 which in turn
> >>>>>>> is extracted from x264 where large size vectorization shows
> >>>>>>> issues with small trip loops.  Execution time there improves
> >>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
> >>>>>>> of less than 32 iterations.
> >>>>>>> size   scalar     128     256     512    512e    512f
> >>>>>>>        1    9.42   11.32    9.35   11.17   15.13   16.89
> >>>>>>>        2    5.72    6.53    6.66    6.66    7.62    8.56
> >>>>>>>        3    4.49    5.10    5.10    5.74    5.08    5.73
> >>>>>>>        4    4.10    4.33    4.29    5.21    3.79    4.25
> >>>>>>>        6    3.78    3.85    3.86    4.76    2.54    2.85
> >>>>>>>        8    3.64    1.89    3.76    4.50    1.92    2.16
> >>>>>>>       12    3.56    2.21    3.75    4.26    1.26    1.42
> >>>>>>>       16    3.36    0.83    1.06    4.16    0.95    1.07
> >>>>>>>       20    3.39    1.42    1.33    4.07    0.75    0.85
> >>>>>>>       24    3.23    0.66    1.72    4.22    0.62    0.70
> >>>>>>>       28    3.18    1.09    2.04    4.20    0.54    0.61
> >>>>>>>       32    3.16    0.47    0.41    0.41    0.47    0.53
> >>>>>>>       34    3.16    0.67    0.61    0.56    0.44    0.50
> >>>>>>>       38    3.19    0.95    0.95    0.82    0.40    0.45
> >>>>>>>       42    3.09    0.58    1.21    1.13    0.36    0.40
> >>>>>>> 'size' specifies the number of actual iterations, 512e is for
> >>>>>>> a masked epilog and 512f for the fully masked loop.  From
> >>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
> >>>>>>> the winner, the fully masked variant is clearly worse and
> >>>>>>> it's size benefit is also tiny.
> >>>>>>
> >>>>>> Let me check I understand correctly. In the fully masked case, there is
> >>>>>> a
> >>>>>> single loop in which a new mask is generated at the start of each
> >>>>>> iteration. In the masked epilogue case, the main loop uses no masking
> >>>>>> whatsoever, thus avoiding the need for generating a mask, carrying the
> >>>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks
> >>>>>> much
> >>>>>> like the fully masked case, but unlike smaller mode epilogues there is
> >>>>>> no
> >>>>>> loop because the eplogue vector size is the same. Is that right?
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>> This scheme seems like it might also benefit GCN, in so much as it
> >>>>>> simplifies the hot code path.
> >>>>>>
> >>>>>> GCN does not actually have smaller vector sizes, so there's no analogue
> >>>>>> to
> >>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
> >>>>>> middle
> >>>>>> end can't do masking everywhere yet, and it helps make some vector
> >>>>>> constants smaller, perhaps).
> >>>>>>
> >>>>>>> This patch does not enable using fully masked loops or
> >>>>>>> masked epilogues by default.  More work on cost modeling
> >>>>>>> and vectorization kind selection on x86_64 is necessary
> >>>>>>> for this.
> >>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >>>>>>> which could be exploited further to unify some of the flags
> >>>>>>> we have right now but there didn't seem to be many easy things
> >>>>>>> to merge, so I'm leaving this for followups.
> >>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
> >>>>>>> their
> >>>>>>> original form and recorded in a hash_set now instead of being
> >>>>>>> processed to a vector of rgroup_controls.  Instead that's now
> >>>>>>> left to the final analysis phase which tries forming the
> >>>>>>> rgroup_controls
> >>>>>>> vector using while_ult and if that fails now tries AVX512 style
> >>>>>>> which needs a different organization and instead fills a hash_map
> >>>>>>> with the relevant info.  vect_get_loop_mask now has two
> >>>>>>> implementations,
> >>>>>>> one for the two mask styles we then have.
> >>>>>>> I have decided against interweaving
> >>>>>>> vect_set_loop_condition_partial_vectors
> >>>>>>> with conditions to do AVX512 style masking and instead opted to
> >>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >>>>>>> Likewise for vect_verify_full_masking vs
> >>>>>>> vect_verify_full_masking_avx512.
> >>>>>>> I was split between making 'vec_loop_masks' a class with methods,
> >>>>>>> possibly merging in the _len stuff into a single registry.  It
> >>>>>>> seemed to be too many changes for the purpose of getting AVX512
> >>>>>>> working.  I'm going to play wait and see what happens with RISC-V
> >>>>>>> here since they are going to get both masks and lengths registered
> >>>>>>> I think.
> >>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
> >>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >>>>>>> looked odd.
> >>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> >>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
> >>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> >>>>>>> and one latent wrong-code (PR110237).
> >>>>>>> There's followup work to be done to try enabling masked epilogues
> >>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
> >>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
> >>>>>>> right is going to be challenging.
> >>>>>>> Any comments?
> >>>>>>> OK?
> >>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
> >>>>>>> work for it so in case the while_ult path fails (not sure if
> >>>>>>> it ever does) it could get _avx512 style masking.  Likewise
> >>>>>>> testing on ARM just to see I didn't break anything here.
> >>>>>>> I don't have SVE hardware so testing is probably meaningless.
> >>>>>>
> >>>>>> I can set some tests going. Is vect.exp enough?
> >>>>>
> >>>>> Well, only you know (from experience), but sure that?s a nice start.
> >>>>
> >>>> I tested vect.exp for both gcc and gfortran and there were no
> >>>> regressions.
> >>>> I
> >>>> have another run going with the other param settings.
> >>>>
> >>>> (Side note: vect.exp used to be a nice quick test for use during
> >>>> development,
> >>>> but the tsvc tests are now really slow, at least when run on a single GPU
> >>>> thread.)
> >>>>
> >>>> I tried some small examples with --param vect-partial-vector-usage=1
> >>>> (IIUC
> >>>> this prevents masked loops, but not masked epilogues, right?)
> >>>
> >>> Yes.  That should also work with the while_ult style btw.
> >>>
> >>>> and the results
> >>>> look good. I plan to do some benchmarking shortly. One comment: building
> >>>> a
> >>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
> >>>> the
> >>>> constant pool and an unnecessary memory load (it literally has to use
> >>>> this
> >>>> sequence to generate the addresses to load the constant!) Generating the
> >>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
> >>>> ABI-mandated register that already holds that value. (Perhaps I have
> >>>> another
> >>>> piece missing here, IDK?)
> >>>
> >>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
> >>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> >>> It might be that the target needs to pattern match these constants
> >>> at RTL expansion time?
> >>>
> >>> Btw, did you disable your while_ult pattern for the experiment?
> >>
> >> I tried it both ways; both appear to work, and the while_ult case does
> >> avoid
> >> the constant vector. I also don't seem to need while_ult for the fully
> >> masked
> >> case any more (is that new?).
> > 
> > Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
> > conveniently available but it has to multiply the IV with the
> > number of scalars per iter which has overflow issues it has
> > to compensate for by choosing a wider IV.  I'm avoiding that
> > issue (besides for alignment peeling) by instead altering
> > the constant vector to compare against.  On x86 the constant
> > vector is always a load but the multiplication would add to
> > the latency of mask production which already isn't too great.
> 
> Is the multiplication not usually a shift?
> 
> > And yes, the alternate scheme doesn't rely on while_ult but instead
> > on vec_cmpu to produce the masks.
> > 
> > You might be able to produce the {0, 0, 1, 1, ... } constant
> > by interleaving v1 with itself?  Any non-power-of-two duplication
> > looks more difficult though.
> 
> I think that would need to use a full permutation, which is probably faster
> than a cold load, but in all these cases the vector that defines the
> permutation looks exactly like the result, so ......
> 
> I've been playing with this stuff some more and I find that even though GCN
> supports fully masked loops and uses them when I test without offload, it's
> actually been running in param_vect_partial_vector_usage==0 mode for offload
> because i386.cc has that hardcoded and the offload compiler inherits param
> settings from the host.

Doesn't that mean it will have a scalar epilog and a very large VF for the
main loop due to the large vector size?

> I tried running the Babelstream benchmark with the various settings and it's a
> wash for most of the measurements (memory limited, most likely), but the "Dot"
> benchmark is considerably slower when fully masked (about 50%). This probably
> explains why adding the additional "fake" smaller vector sizes was so good for
> our numbers, but confirms that the partial epilogue is a good option.

Ah, "fake" smaller vector sizes probably then made up for this with
"fixed" size epilogue vectorization?  But yes, I think a vectorized
epilog with partial vectors that then does not iterate would get you
the best of both worlds.

So param_vect_partial_vector_usage == 1.

Whether with a while_ult optab or the vec_cmpu scheme should then
depend on generated code quality.

Richard.
---1609957120-1482378203-1686836054=:4723--