From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=9AtS=CC=suse.de=rguenther@sourceware.org>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d])
	by sourceware.org (Postfix) with ESMTPS id 193A33858288
	for <gcc-patches@gcc.gnu.org>; Wed, 14 Jun 2023 14:30:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 193A33858288
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id 2F2FF1FE08;
	Wed, 14 Jun 2023 14:30:05 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1686753005; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=TaPCHDkD4vOe0nIVU8aCwEiNa1pd6nfgG4IBrVMpZH4=;
	b=BVRsUVbSMjeIAh9pYHghQ24PoJzkKR2kkINwBXIDv2jWYpv3Ys3zyc9ITnigQE2cBd5plu
	Dl/7GwSmuUlAXEYDTuQgsi/rzmUntrafOnUwRSEGNHHdHak819UaGyle1KZTyMn5I/g0H0
	fpOhHB2AoF8J0OtHFmDMvgU6VH4y3pc=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1686753005;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=TaPCHDkD4vOe0nIVU8aCwEiNa1pd6nfgG4IBrVMpZH4=;
	b=LcmLwHHoUxCCLbtZkzB8S6bzy4BwVKWzQ1xGH/4PMoDVuWrfLnKa7UfemfMeaiQsxh6uMo
	1Aq4p/Eo3IGr8dCg==
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 182791391E;
	Wed, 14 Jun 2023 14:30:05 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
	by imap2.suse-dmz.suse.de with ESMTPSA
	id CZbYBe3OiWSsfQAAMHmgww
	(envelope-from <rguenther@suse.de>); Wed, 14 Jun 2023 14:30:05 +0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
From: Richard Biener <rguenther@suse.de>
Mime-Version: 1.0 (1.0)
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
Date: Wed, 14 Jun 2023 16:29:54 +0200
Message-Id: <DA5221DD-8F3C-47A3-821D-8CB09DA5E60B@suse.de>
References: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com>
Cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com,
 Jan Hubicka <hubicka@ucw.cz>, hongtao.liu@intel.com, kirill.yukhin@gmail.com
In-Reply-To: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com>
To: Andrew Stubbs <ams@codesourcery.com>
X-Mailer: iPhone Mail (20F66)
X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>=20
> =EF=BB=BFOn 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>> This implemens fully masked vectorization or a masked epilog for
>> AVX512 style masks which single themselves out by representing
>> each lane with a single bit and by using integer modes for the mask
>> (both is much like GCN).
>> AVX512 is also special in that it doesn't have any instruction
>> to compute the mask from a scalar IV like SVE has with while_ult.
>> Instead the masks are produced by vector compares and the loop
>> control retains the scalar IV (mainly to avoid dependences on
>> mask generation, a suitable mask test instruction is available).
>=20
> This is also sounds like GCN. We currently use WHILE_ULT in the middle end=
 which expands to a vector compare against a vector of stepped values. This r=
equires an additional instruction to prepare the comparison vector (compared=
 to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImod=
e bitmask, so it works reasonably well.
>=20
>> Like RVV code generation prefers a decrementing IV though IVOPTs
>> messes things up in some cases removing that IV to eliminate
>> it with an incrementing one used for address generation.
>> One of the motivating testcases is from PR108410 which in turn
>> is extracted from x264 where large size vectorization shows
>> issues with small trip loops.  Execution time there improves
>> compared to classic AVX512 with AVX2 epilogues for the cases
>> of less than 32 iterations.
>> size   scalar     128     256     512    512e    512f
>>     1    9.42   11.32    9.35   11.17   15.13   16.89
>>     2    5.72    6.53    6.66    6.66    7.62    8.56
>>     3    4.49    5.10    5.10    5.74    5.08    5.73
>>     4    4.10    4.33    4.29    5.21    3.79    4.25
>>     6    3.78    3.85    3.86    4.76    2.54    2.85
>>     8    3.64    1.89    3.76    4.50    1.92    2.16
>>    12    3.56    2.21    3.75    4.26    1.26    1.42
>>    16    3.36    0.83    1.06    4.16    0.95    1.07
>>    20    3.39    1.42    1.33    4.07    0.75    0.85
>>    24    3.23    0.66    1.72    4.22    0.62    0.70
>>    28    3.18    1.09    2.04    4.20    0.54    0.61
>>    32    3.16    0.47    0.41    0.41    0.47    0.53
>>    34    3.16    0.67    0.61    0.56    0.44    0.50
>>    38    3.19    0.95    0.95    0.82    0.40    0.45
>>    42    3.09    0.58    1.21    1.13    0.36    0.40
>> 'size' specifies the number of actual iterations, 512e is for
>> a masked epilog and 512f for the fully masked loop.  From
>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>> the winner, the fully masked variant is clearly worse and
>> it's size benefit is also tiny.
>=20
> Let me check I understand correctly. In the fully masked case, there is a s=
ingle loop in which a new mask is generated at the start of each iteration. I=
n the masked epilogue case, the main loop uses no masking whatsoever, thus a=
voiding the need for generating a mask, carrying the mask, inserting vec_mer=
ge operations, etc, and then the epilogue looks much like the fully masked c=
ase, but unlike smaller mode epilogues there is no loop because the eplogue v=
ector size is the same. Is that right?

Yes.

> This scheme seems like it might also benefit GCN, in so much as it simplif=
ies the hot code path.
>=20
> GCN does not actually have smaller vector sizes, so there's no analogue to=
 AVX2 (we pretend we have some smaller sizes, but that's because the middle e=
nd can't do masking everywhere yet, and it helps make some vector constants s=
maller, perhaps).
>=20
>> This patch does not enable using fully masked loops or
>> masked epilogues by default.  More work on cost modeling
>> and vectorization kind selection on x86_64 is necessary
>> for this.
>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>> which could be exploited further to unify some of the flags
>> we have right now but there didn't seem to be many easy things
>> to merge, so I'm leaving this for followups.
>> Mask requirements as registered by vect_record_loop_mask are kept in thei=
r
>> original form and recorded in a hash_set now instead of being
>> processed to a vector of rgroup_controls.  Instead that's now
>> left to the final analysis phase which tries forming the rgroup_controls
>> vector using while_ult and if that fails now tries AVX512 style
>> which needs a different organization and instead fills a hash_map
>> with the relevant info.  vect_get_loop_mask now has two implementations,
>> one for the two mask styles we then have.
>> I have decided against interweaving vect_set_loop_condition_partial_vecto=
rs
>> with conditions to do AVX512 style masking and instead opted to
>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.=

>> I was split between making 'vec_loop_masks' a class with methods,
>> possibly merging in the _len stuff into a single registry.  It
>> seemed to be too many changes for the purpose of getting AVX512
>> working.  I'm going to play wait and see what happens with RISC-V
>> here since they are going to get both masks and lengths registered
>> I think.
>> The vect_prepare_for_masked_peels hunk might run into issues with
>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>> looked odd.
>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>> the testsuite with --param vect-partial-vector-usage=3D2 with and
>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>> and one latent wrong-code (PR110237).
>> There's followup work to be done to try enabling masked epilogues
>> for x86-64 by default (when AVX512 is enabled, possibly only when
>> -mprefer-vector-width=3D512).  Getting cost modeling and decision
>> right is going to be challenging.
>> Any comments?
>> OK?
>> Btw, testing on GCN would be welcome - the _avx512 paths could
>> work for it so in case the while_ult path fails (not sure if
>> it ever does) it could get _avx512 style masking.  Likewise
>> testing on ARM just to see I didn't break anything here.
>> I don't have SVE hardware so testing is probably meaningless.
>=20
> I can set some tests going. Is vect.exp enough?

Well, only you know (from experience), but sure that=E2=80=99s a nice start.=


Richard=20

> Andrew
>=20