From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=reDU=CD=mentor.com=Andrew_Stubbs@sourceware.org>
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
	by sourceware.org (Postfix) with ESMTPS id A16703858D3C
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Jun 2023 10:13:26 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A16703858D3C
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="6.00,244,1681200000"; 
   d="scan'208";a="9925003"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
  by esa2.mentor.iphmx.com with ESMTP; 15 Jun 2023 02:13:25 -0800
IronPort-SDR: zbTOfctMUokojjlazYuGq1fBTXN1MN//wc0gkmffa6blDpdfLa/MGOWGT6SaCkl/G5b/ZjNeP0
 jWNjluPs/KCdIMfeomfUN/ruYNzI9EptL6xMT6nNvZet2MRneROZsiwE98VRNYq/0qLMRZg+Gv
 Cp0pK2ZxsROWEnmzCawTkNmTK2KFXFH7K5euwHkH7Mie0fA3V9Qb3eHkxo5CqG03WvLG2Ftx7X
 GNsQww/RxH+KVDG3oadhVpOCZbvSys7ZoemVz0lFE7N+eouyDl7ZV/+MBjGsTnYKgMeFidT4VG
 BTQ=
Message-ID: <d33d0c45-12f1-75dc-d162-8d2da3b0a810@codesourcery.com>
Date: Thu, 15 Jun 2023 11:13:20 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Firefox/102.0 Thunderbird/102.12.0
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
Content-Language: en-GB
To: Richard Biener <rguenther@suse.de>
CC: <gcc-patches@gcc.gnu.org>, <richard.sandiford@arm.com>, Jan Hubicka
	<hubicka@ucw.cz>, <hongtao.liu@intel.com>, <kirill.yukhin@gmail.com>
References: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com>
 <DA5221DD-8F3C-47A3-821D-8CB09DA5E60B@suse.de>
 <378905b9-2383-d564-1c91-2c6b1e06629d@codesourcery.com>
 <nycvar.YFH.7.77.849.2306150956130.4723@jbgna.fhfr.qr>
From: Andrew Stubbs <ams@codesourcery.com>
In-Reply-To: <nycvar.YFH.7.77.849.2306150956130.4723@jbgna.fhfr.qr>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 15/06/2023 10:58, Richard Biener wrote:
> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> 
>> On 14/06/2023 15:29, Richard Biener wrote:
>>>
>>>
>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>>>
>>>> ﻿On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>>>> This implemens fully masked vectorization or a masked epilog for
>>>>> AVX512 style masks which single themselves out by representing
>>>>> each lane with a single bit and by using integer modes for the mask
>>>>> (both is much like GCN).
>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>>>> Instead the masks are produced by vector compares and the loop
>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>> mask generation, a suitable mask test instruction is available).
>>>>
>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle end
>>>> which expands to a vector compare against a vector of stepped values. This
>>>> requires an additional instruction to prepare the comparison vector
>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
>>>> the DImode bitmask, so it works reasonably well.
>>>>
>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>>>> messes things up in some cases removing that IV to eliminate
>>>>> it with an incrementing one used for address generation.
>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>> is extracted from x264 where large size vectorization shows
>>>>> issues with small trip loops.  Execution time there improves
>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>> of less than 32 iterations.
>>>>> size   scalar     128     256     512    512e    512f
>>>>>       1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>       2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>       3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>       4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>       6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>       8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>      12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>      16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>      20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>      24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>      28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>      32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>      34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>      38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>      42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>>>> the winner, the fully masked variant is clearly worse and
>>>>> it's size benefit is also tiny.
>>>>
>>>> Let me check I understand correctly. In the fully masked case, there is a
>>>> single loop in which a new mask is generated at the start of each
>>>> iteration. In the masked epilogue case, the main loop uses no masking
>>>> whatsoever, thus avoiding the need for generating a mask, carrying the
>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks much
>>>> like the fully masked case, but unlike smaller mode epilogues there is no
>>>> loop because the eplogue vector size is the same. Is that right?
>>>
>>> Yes.
>>>
>>>> This scheme seems like it might also benefit GCN, in so much as it
>>>> simplifies the hot code path.
>>>>
>>>> GCN does not actually have smaller vector sizes, so there's no analogue to
>>>> AVX2 (we pretend we have some smaller sizes, but that's because the middle
>>>> end can't do masking everywhere yet, and it helps make some vector
>>>> constants smaller, perhaps).
>>>>
>>>>> This patch does not enable using fully masked loops or
>>>>> masked epilogues by default.  More work on cost modeling
>>>>> and vectorization kind selection on x86_64 is necessary
>>>>> for this.
>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>>>> which could be exploited further to unify some of the flags
>>>>> we have right now but there didn't seem to be many easy things
>>>>> to merge, so I'm leaving this for followups.
>>>>> Mask requirements as registered by vect_record_loop_mask are kept in their
>>>>> original form and recorded in a hash_set now instead of being
>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>> left to the final analysis phase which tries forming the rgroup_controls
>>>>> vector using while_ult and if that fails now tries AVX512 style
>>>>> which needs a different organization and instead fills a hash_map
>>>>> with the relevant info.  vect_get_loop_mask now has two implementations,
>>>>> one for the two mask styles we then have.
>>>>> I have decided against interweaving
>>>>> vect_set_loop_condition_partial_vectors
>>>>> with conditions to do AVX512 style masking and instead opted to
>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>>>> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
>>>>> I was split between making 'vec_loop_masks' a class with methods,
>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>> seemed to be too many changes for the purpose of getting AVX512
>>>>> working.  I'm going to play wait and see what happens with RISC-V
>>>>> here since they are going to get both masks and lengths registered
>>>>> I think.
>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>>>> looked odd.
>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>>>> and one latent wrong-code (PR110237).
>>>>> There's followup work to be done to try enabling masked epilogues
>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>>>> right is going to be challenging.
>>>>> Any comments?
>>>>> OK?
>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>> work for it so in case the while_ult path fails (not sure if
>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>> testing on ARM just to see I didn't break anything here.
>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>
>>>> I can set some tests going. Is vect.exp enough?
>>>
>>> Well, only you know (from experience), but sure that?s a nice start.
>>
>> I tested vect.exp for both gcc and gfortran and there were no regressions. I
>> have another run going with the other param settings.
>>
>> (Side note: vect.exp used to be a nice quick test for use during development,
>> but the tsvc tests are now really slow, at least when run on a single GPU
>> thread.)
>>
>> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC
>> this prevents masked loops, but not masked epilogues, right?)
> 
> Yes.  That should also work with the while_ult style btw.
> 
>> and the results
>> look good. I plan to do some benchmarking shortly. One comment: building a
>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the
>> constant pool and an unnecessary memory load (it literally has to use this
>> sequence to generate the addresses to load the constant!) Generating the
>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
>> ABI-mandated register that already holds that value. (Perhaps I have another
>> piece missing here, IDK?)
> 
> I failed to special-case the {0, 1, 2, 3, ... } constant because I
> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> It might be that the target needs to pattern match these constants
> at RTL expansion time?
> 
> Btw, did you disable your while_ult pattern for the experiment?

I tried it both ways; both appear to work, and the while_ult case does 
avoid the constant vector. I also don't seem to need while_ult for the 
fully masked case any more (is that new?).

Andrew