From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=VXlO=CC=mentor.com=Andrew_Stubbs@sourceware.org>
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
	by sourceware.org (Postfix) with ESMTPS id 522FB3858C5E
	for <gcc-patches@gcc.gnu.org>; Wed, 14 Jun 2023 14:27:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 522FB3858C5E
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="6.00,242,1681200000"; 
   d="scan'208";a="8802300"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
  by esa4.mentor.iphmx.com with ESMTP; 14 Jun 2023 06:27:03 -0800
IronPort-SDR: DHx/MiKkabU6Dj8KM5YpU2mTIXFY4BcPytMu0qXOBiJJyCSsIuMgdFDkuRY8OLfCD4MMlRbXyp
 Jd7x8uMyXrzSnf8+SOXx07CG1Ke0t4aYxy9rdPIo0CQ+tPKacF8GrKFnIpoHHjzMWOCyqcyvvj
 xup0EvQUSCNKb+Y6Kb812vbz9EcDsLwAc5pIsJdst2rxd9OWaZ/JMLEJiABDjJpgRUKfnUnWau
 GchT+SIi85sUgA1h2gLqi3L2FkdF7cnsBPk88HIjMj/uqtucLMjoJIr8sI82mEWlveHqI2QWh1
 k5w=
Message-ID: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com>
Date: Wed, 14 Jun 2023 15:26:57 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Firefox/102.0 Thunderbird/102.12.0
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
To: Richard Biener <rguenther@suse.de>, <gcc-patches@gcc.gnu.org>
CC: <richard.sandiford@arm.com>, Jan Hubicka <hubicka@ucw.cz>,
	<hongtao.liu@intel.com>, <kirill.yukhin@gmail.com>
References: <20230614115450.28CEA3858288@sourceware.org>
Content-Language: en-GB
From: Andrew Stubbs <ams@codesourcery.com>
In-Reply-To: <20230614115450.28CEA3858288@sourceware.org>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 7bit
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> This implemens fully masked vectorization or a masked epilog for
> AVX512 style masks which single themselves out by representing
> each lane with a single bit and by using integer modes for the mask
> (both is much like GCN).
> 
> AVX512 is also special in that it doesn't have any instruction
> to compute the mask from a scalar IV like SVE has with while_ult.
> Instead the masks are produced by vector compares and the loop
> control retains the scalar IV (mainly to avoid dependences on
> mask generation, a suitable mask test instruction is available).

This is also sounds like GCN. We currently use WHILE_ULT in the middle 
end which expands to a vector compare against a vector of stepped 
values. This requires an additional instruction to prepare the 
comparison vector (compared to SVE), but the "while_ultv64sidi" pattern 
(for example) returns the DImode bitmask, so it works reasonably well.

> Like RVV code generation prefers a decrementing IV though IVOPTs
> messes things up in some cases removing that IV to eliminate
> it with an incrementing one used for address generation.
> 
> One of the motivating testcases is from PR108410 which in turn
> is extracted from x264 where large size vectorization shows
> issues with small trip loops.  Execution time there improves
> compared to classic AVX512 with AVX2 epilogues for the cases
> of less than 32 iterations.
> 
> size   scalar     128     256     512    512e    512f
>      1    9.42   11.32    9.35   11.17   15.13   16.89
>      2    5.72    6.53    6.66    6.66    7.62    8.56
>      3    4.49    5.10    5.10    5.74    5.08    5.73
>      4    4.10    4.33    4.29    5.21    3.79    4.25
>      6    3.78    3.85    3.86    4.76    2.54    2.85
>      8    3.64    1.89    3.76    4.50    1.92    2.16
>     12    3.56    2.21    3.75    4.26    1.26    1.42
>     16    3.36    0.83    1.06    4.16    0.95    1.07
>     20    3.39    1.42    1.33    4.07    0.75    0.85
>     24    3.23    0.66    1.72    4.22    0.62    0.70
>     28    3.18    1.09    2.04    4.20    0.54    0.61
>     32    3.16    0.47    0.41    0.41    0.47    0.53
>     34    3.16    0.67    0.61    0.56    0.44    0.50
>     38    3.19    0.95    0.95    0.82    0.40    0.45
>     42    3.09    0.58    1.21    1.13    0.36    0.40
> 
> 'size' specifies the number of actual iterations, 512e is for
> a masked epilog and 512f for the fully masked loop.  From
> 4 scalar iterations on the AVX512 masked epilog code is clearly
> the winner, the fully masked variant is clearly worse and
> it's size benefit is also tiny.

Let me check I understand correctly. In the fully masked case, there is 
a single loop in which a new mask is generated at the start of each 
iteration. In the masked epilogue case, the main loop uses no masking 
whatsoever, thus avoiding the need for generating a mask, carrying the 
mask, inserting vec_merge operations, etc, and then the epilogue looks 
much like the fully masked case, but unlike smaller mode epilogues there 
is no loop because the eplogue vector size is the same. Is that right?

This scheme seems like it might also benefit GCN, in so much as it 
simplifies the hot code path.

GCN does not actually have smaller vector sizes, so there's no analogue 
to AVX2 (we pretend we have some smaller sizes, but that's because the 
middle end can't do masking everywhere yet, and it helps make some 
vector constants smaller, perhaps).

> This patch does not enable using fully masked loops or
> masked epilogues by default.  More work on cost modeling
> and vectorization kind selection on x86_64 is necessary
> for this.
> 
> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> which could be exploited further to unify some of the flags
> we have right now but there didn't seem to be many easy things
> to merge, so I'm leaving this for followups.
> 
> Mask requirements as registered by vect_record_loop_mask are kept in their
> original form and recorded in a hash_set now instead of being
> processed to a vector of rgroup_controls.  Instead that's now
> left to the final analysis phase which tries forming the rgroup_controls
> vector using while_ult and if that fails now tries AVX512 style
> which needs a different organization and instead fills a hash_map
> with the relevant info.  vect_get_loop_mask now has two implementations,
> one for the two mask styles we then have.
> 
> I have decided against interweaving vect_set_loop_condition_partial_vectors
> with conditions to do AVX512 style masking and instead opted to
> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
> 
> I was split between making 'vec_loop_masks' a class with methods,
> possibly merging in the _len stuff into a single registry.  It
> seemed to be too many changes for the purpose of getting AVX512
> working.  I'm going to play wait and see what happens with RISC-V
> here since they are going to get both masks and lengths registered
> I think.
> 
> The vect_prepare_for_masked_peels hunk might run into issues with
> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> looked odd.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> the testsuite with --param vect-partial-vector-usage=2 with and
> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> and one latent wrong-code (PR110237).
> 
> There's followup work to be done to try enabling masked epilogues
> for x86-64 by default (when AVX512 is enabled, possibly only when
> -mprefer-vector-width=512).  Getting cost modeling and decision
> right is going to be challenging.
> 
> Any comments?
> 
> OK?
> 
> Btw, testing on GCN would be welcome - the _avx512 paths could
> work for it so in case the while_ult path fails (not sure if
> it ever does) it could get _avx512 style masking.  Likewise
> testing on ARM just to see I didn't break anything here.
> I don't have SVE hardware so testing is probably meaningless.

I can set some tests going. Is vect.exp enough?

Andrew