From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qweC=CD=suse.de=rguenther@sourceware.org>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
	by sourceware.org (Postfix) with ESMTPS id 939F6385771A
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Jun 2023 16:17:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 939F6385771A
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id A3B631FE51;
	Thu, 15 Jun 2023 16:17:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1686845823; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=lKqa++/ASlgSJSwu3ygoEquHaWDJTt1hkT0vt0m9dS8=;
	b=Pz1oLVDpKuALoxihhrK3tA1xoM/XTpoLQ19H7BcbxZpFxBYInB4AerpVoL0mE3hx4sDOCr
	kOSNnBVLozTNyVrqY9WNs/98Tw8580vbyjCY/Vas8OHUCop9aJpP7sfxWsc2WTDJacX8Vt
	ZWSashGKdN+DeNUJhaG8OjgDNRNQAC0=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1686845823;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=lKqa++/ASlgSJSwu3ygoEquHaWDJTt1hkT0vt0m9dS8=;
	b=dX59r0oOfNBOo3Uj8JwTViI3y2Vs+Rn/vJEXLHLpo3LFAmJL8YSR8AUIYfiQ9AqjzKQ8gu
	4n+WMLLlpCKvyqCg==
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 94F9913467;
	Thu, 15 Jun 2023 16:17:03 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
	by imap2.suse-dmz.suse.de with ESMTPSA
	id 7xZcJH85i2Q/LgAAMHmgww
	(envelope-from <rguenther@suse.de>); Thu, 15 Jun 2023 16:17:03 +0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
From: Richard Biener <rguenther@suse.de>
Mime-Version: 1.0 (1.0)
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
Date: Thu, 15 Jun 2023 18:16:53 +0200
Message-Id: <7B2A55C9-E276-42DA-BA36-F6863C3EF969@suse.de>
References: <7758ea65-3a6b-c49b-c2c1-bfdf9b1bcbee@codesourcery.com>
Cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com,
 Jan Hubicka <hubicka@ucw.cz>, hongtao.liu@intel.com, kirill.yukhin@gmail.com
In-Reply-To: <7758ea65-3a6b-c49b-c2c1-bfdf9b1bcbee@codesourcery.com>
To: Andrew Stubbs <ams@codesourcery.com>
X-Mailer: iPhone Mail (20F66)
X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


> Am 15.06.2023 um 16:04 schrieb Andrew Stubbs <ams@codesourcery.com>:
>=20
> =EF=BB=BFOn 15/06/2023 15:00, Richard Biener wrote:
>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>> On 15/06/2023 14:34, Richard Biener wrote:
>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>=20
>>>>> On 15/06/2023 12:06, Richard Biener wrote:
>>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>>=20
>>>>>>> On 15/06/2023 10:58, Richard Biener wrote:
>>>>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>>>>=20
>>>>>>>>> On 14/06/2023 15:29, Richard Biener wrote:
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.c=
om>:
>>>>>>>>>>>=20
>>>>>>>>>>> =EF=BB=BFOn 14/06/2023 12:54, Richard Biener via Gcc-patches wro=
te:
>>>>>>>>>>>> This implemens fully masked vectorization or a masked epilog fo=
r
>>>>>>>>>>>> AVX512 style masks which single themselves out by representing
>>>>>>>>>>>> each lane with a single bit and by using integer modes for the m=
ask
>>>>>>>>>>>> (both is much like GCN).
>>>>>>>>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ul=
t.
>>>>>>>>>>>> Instead the masks are produced by vector compares and the loop
>>>>>>>>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>>>>>>>>> mask generation, a suitable mask test instruction is available)=
.
>>>>>>>>>>>=20
>>>>>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the
>>>>>>>>>>> middle
>>>>>>>>>>> end
>>>>>>>>>>> which expands to a vector compare against a vector of stepped va=
lues.
>>>>>>>>>>> This
>>>>>>>>>>> requires an additional instruction to prepare the comparison vec=
tor
>>>>>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for examp=
le)
>>>>>>>>>>> returns
>>>>>>>>>>> the DImode bitmask, so it works reasonably well.
>>>>>>>>>>>=20
>>>>>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPT=
s
>>>>>>>>>>>> messes things up in some cases removing that IV to eliminate
>>>>>>>>>>>> it with an incrementing one used for address generation.
>>>>>>>>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>>>>>>>>> is extracted from x264 where large size vectorization shows
>>>>>>>>>>>> issues with small trip loops.  Execution time there improves
>>>>>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>>>>>>>>> of less than 32 iterations.
>>>>>>>>>>>> size   scalar     128     256     512    512e    512f
>>>>>>>>>>>>         1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>>>>>>>>         2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>>>>>>>>         3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>>>>>>>>         4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>>>>>>>>         6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>>>>>>>>         8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>>>>>>>>        12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>>>>>>>>        16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>>>>>>>>        20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>>>>>>>>        24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>>>>>>>>        28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>>>>>>>>        32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>>>>>>>>        34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>>>>>>>>        38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>>>>>>>>        42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>>>>>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>>>>>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly=

>>>>>>>>>>>> the winner, the fully masked variant is clearly worse and
>>>>>>>>>>>> it's size benefit is also tiny.
>>>>>>>>>>>=20
>>>>>>>>>>> Let me check I understand correctly. In the fully masked case, t=
here
>>>>>>>>>>> is
>>>>>>>>>>> a
>>>>>>>>>>> single loop in which a new mask is generated at the start of eac=
h
>>>>>>>>>>> iteration. In the masked epilogue case, the main loop uses no ma=
sking
>>>>>>>>>>> whatsoever, thus avoiding the need for generating a mask, carryi=
ng
>>>>>>>>>>> the
>>>>>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue=

>>>>>>>>>>> looks
>>>>>>>>>>> much
>>>>>>>>>>> like the fully masked case, but unlike smaller mode epilogues th=
ere
>>>>>>>>>>> is
>>>>>>>>>>> no
>>>>>>>>>>> loop because the eplogue vector size is the same. Is that right?=

>>>>>>>>>>=20
>>>>>>>>>> Yes.
>>>>>>>>>>=20
>>>>>>>>>>> This scheme seems like it might also benefit GCN, in so much as i=
t
>>>>>>>>>>> simplifies the hot code path.
>>>>>>>>>>>=20
>>>>>>>>>>> GCN does not actually have smaller vector sizes, so there's no
>>>>>>>>>>> analogue
>>>>>>>>>>> to
>>>>>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because t=
he
>>>>>>>>>>> middle
>>>>>>>>>>> end can't do masking everywhere yet, and it helps make some vect=
or
>>>>>>>>>>> constants smaller, perhaps).
>>>>>>>>>>>=20
>>>>>>>>>>>> This patch does not enable using fully masked loops or
>>>>>>>>>>>> masked epilogues by default.  More work on cost modeling
>>>>>>>>>>>> and vectorization kind selection on x86_64 is necessary
>>>>>>>>>>>> for this.
>>>>>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_=
STYLE
>>>>>>>>>>>> which could be exploited further to unify some of the flags
>>>>>>>>>>>> we have right now but there didn't seem to be many easy things
>>>>>>>>>>>> to merge, so I'm leaving this for followups.
>>>>>>>>>>>> Mask requirements as registered by vect_record_loop_mask are ke=
pt in
>>>>>>>>>>>> their
>>>>>>>>>>>> original form and recorded in a hash_set now instead of being
>>>>>>>>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>>>>>>>>> left to the final analysis phase which tries forming the
>>>>>>>>>>>> rgroup_controls
>>>>>>>>>>>> vector using while_ult and if that fails now tries AVX512 style=

>>>>>>>>>>>> which needs a different organization and instead fills a hash_m=
ap
>>>>>>>>>>>> with the relevant info.  vect_get_loop_mask now has two
>>>>>>>>>>>> implementations,
>>>>>>>>>>>> one for the two mask styles we then have.
>>>>>>>>>>>> I have decided against interweaving
>>>>>>>>>>>> vect_set_loop_condition_partial_vectors
>>>>>>>>>>>> with conditions to do AVX512 style masking and instead opted to=

>>>>>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx=
512.
>>>>>>>>>>>> Likewise for vect_verify_full_masking vs
>>>>>>>>>>>> vect_verify_full_masking_avx512.
>>>>>>>>>>>> I was split between making 'vec_loop_masks' a class with method=
s,
>>>>>>>>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>>>>>>>>> seemed to be too many changes for the purpose of getting AVX512=

>>>>>>>>>>>> working.  I'm going to play wait and see what happens with RISC=
-V
>>>>>>>>>>>> here since they are going to get both masks and lengths registe=
red
>>>>>>>>>>>> I think.
>>>>>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues wi=
th
>>>>>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYP=
E
>>>>>>>>>>>> looked odd.
>>>>>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>>>>>>>>> the testsuite with --param vect-partial-vector-usage=3D2 with a=
nd
>>>>>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110=
221)
>>>>>>>>>>>> and one latent wrong-code (PR110237).
>>>>>>>>>>>> There's followup work to be done to try enabling masked epilogu=
es
>>>>>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only wh=
en
>>>>>>>>>>>> -mprefer-vector-width=3D512).  Getting cost modeling and decisi=
on
>>>>>>>>>>>> right is going to be challenging.
>>>>>>>>>>>> Any comments?
>>>>>>>>>>>> OK?
>>>>>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>>>>>>>>> work for it so in case the while_ult path fails (not sure if
>>>>>>>>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>>>>>>>>> testing on ARM just to see I didn't break anything here.
>>>>>>>>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>>>>>>>>=20
>>>>>>>>>>> I can set some tests going. Is vect.exp enough?
>>>>>>>>>>=20
>>>>>>>>>> Well, only you know (from experience), but sure that?s a nice sta=
rt.
>>>>>>>>>=20
>>>>>>>>> I tested vect.exp for both gcc and gfortran and there were no
>>>>>>>>> regressions.
>>>>>>>>> I
>>>>>>>>> have another run going with the other param settings.
>>>>>>>>>=20
>>>>>>>>> (Side note: vect.exp used to be a nice quick test for use during
>>>>>>>>> development,
>>>>>>>>> but the tsvc tests are now really slow, at least when run on a sin=
gle
>>>>>>>>> GPU
>>>>>>>>> thread.)
>>>>>>>>>=20
>>>>>>>>> I tried some small examples with --param vect-partial-vector-usage=
=3D1
>>>>>>>>> (IIUC
>>>>>>>>> this prevents masked loops, but not masked epilogues, right?)
>>>>>>>>=20
>>>>>>>> Yes.  That should also work with the while_ult style btw.
>>>>>>>>=20
>>>>>>>>> and the results
>>>>>>>>> look good. I plan to do some benchmarking shortly. One comment:
>>>>>>>>> building
>>>>>>>>> a
>>>>>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large ent=
ry in
>>>>>>>>> the
>>>>>>>>> constant pool and an unnecessary memory load (it literally has to u=
se
>>>>>>>>> this
>>>>>>>>> sequence to generate the addresses to load the constant!) Generati=
ng
>>>>>>>>> the
>>>>>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have=
 an
>>>>>>>>> ABI-mandated register that already holds that value. (Perhaps I ha=
ve
>>>>>>>>> another
>>>>>>>>> piece missing here, IDK?)
>>>>>>>>=20
>>>>>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
>>>>>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ..=
. }.
>>>>>>>> It might be that the target needs to pattern match these constants
>>>>>>>> at RTL expansion time?
>>>>>>>>=20
>>>>>>>> Btw, did you disable your while_ult pattern for the experiment?
>>>>>>>=20
>>>>>>> I tried it both ways; both appear to work, and the while_ult case do=
es
>>>>>>> avoid
>>>>>>> the constant vector. I also don't seem to need while_ult for the ful=
ly
>>>>>>> masked
>>>>>>> case any more (is that new?).
>>>>>>=20
>>>>>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
>>>>>> conveniently available but it has to multiply the IV with the
>>>>>> number of scalars per iter which has overflow issues it has
>>>>>> to compensate for by choosing a wider IV.  I'm avoiding that
>>>>>> issue (besides for alignment peeling) by instead altering
>>>>>> the constant vector to compare against.  On x86 the constant
>>>>>> vector is always a load but the multiplication would add to
>>>>>> the latency of mask production which already isn't too great.
>>>>>=20
>>>>> Is the multiplication not usually a shift?
>>>>>=20
>>>>>> And yes, the alternate scheme doesn't rely on while_ult but instead
>>>>>> on vec_cmpu to produce the masks.
>>>>>>=20
>>>>>> You might be able to produce the {0, 0, 1, 1, ... } constant
>>>>>> by interleaving v1 with itself?  Any non-power-of-two duplication
>>>>>> looks more difficult though.
>>>>>=20
>>>>> I think that would need to use a full permutation, which is probably f=
aster
>>>>> than a cold load, but in all these cases the vector that defines the
>>>>> permutation looks exactly like the result, so ......
>>>>>=20
>>>>> I've been playing with this stuff some more and I find that even thoug=
h GCN
>>>>> supports fully masked loops and uses them when I test without offload,=
 it's
>>>>> actually been running in param_vect_partial_vector_usage=3D=3D0 mode f=
or
>>>>> offload
>>>>> because i386.cc has that hardcoded and the offload compiler inherits p=
aram
>>>>> settings from the host.
>>>>=20
>>>> Doesn't that mean it will have a scalar epilog and a very large VF for t=
he
>>>> main loop due to the large vector size?
>>>>=20
>>>>> I tried running the Babelstream benchmark with the various settings an=
d
>>>>> it's a
>>>>> wash for most of the measurements (memory limited, most likely), but t=
he
>>>>> "Dot"
>>>>> benchmark is considerably slower when fully masked (about 50%). This
>>>>> probably
>>>>> explains why adding the additional "fake" smaller vector sizes was so g=
ood
>>>>> for
>>>>> our numbers, but confirms that the partial epilogue is a good option.
>>>>=20
>>>> Ah, "fake" smaller vector sizes probably then made up for this with
>>>> "fixed" size epilogue vectorization?  But yes, I think a vectorized
>>>> epilog with partial vectors that then does not iterate would get you
>>>> the best of both worlds.
>>>=20
>>> Yes, it uses V32 for the epilogue, which won't fit every case but it bet=
ter
>>> than nothing.
>>>=20
>>>> So param_vect_partial_vector_usage =3D=3D 1.
>>>=20
>>> Unfortunately, there's doesn't seem to be a way to set this *only* for t=
he
>>> offload compiler. If you could fix it for x86_64 soon then that would be=

>>> awesome. :)
>> So some opts tweaking in the GCN option_override hook doesn't work?
>=20
> I didn't try that, but -foffload-options=3D--param=3Dvect-partial-vector-u=
sage=3D1 had no effect.

I guess the flag is streamed as function specific optimization=E2=80=A6

>=20
>>>> Whether with a while_ult optab or the vec_cmpu scheme should then
>>>> depend on generated code quality.
>>>=20
>>> Which looks like it depends on these constants alone.
>> Understood, compared to while_ult these now appear explicitely.  As
>> Richard said you likely have to tweak the backend to make use of 'r1'
>=20
> Yes, I think I understood what he meant. Should be doable.
>=20
> Andrew