From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <andre.simoesdiasvieira@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 243593861036
 for <gcc-patches@gcc.gnu.org>; Wed,  5 May 2021 16:58:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 243593861036
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6619A1FB;
 Wed,  5 May 2021 09:58:33 -0700 (PDT)
Received: from [10.57.1.74] (unknown [10.57.1.74])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7B7493F70D;
 Wed,  5 May 2021 09:58:32 -0700 (PDT)
Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue
 vectorization
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
 Richard Sandiford <richard.sandiford@arm.com>
References: <ea077462-00c3-f2dc-e1ec-1f95130e918c@arm.com>
 <nycvar.YFH.7.76.2105041149180.9200@zhemvz.fhfr.qr>
 <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com>
 <nycvar.YFH.7.76.2105051414540.9200@zhemvz.fhfr.qr>
From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
Message-ID: <6e4541ef-e0a1-1d2d-53f5-4bfed9a65598@arm.com>
Date: Wed, 5 May 2021 17:58:30 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.9.1
MIME-Version: 1.0
In-Reply-To: <nycvar.YFH.7.76.2105051414540.9200@zhemvz.fhfr.qr>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 NICE_REPLY_A, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 05 May 2021 16:58:38 -0000


On 05/05/2021 13:34, Richard Biener wrote:
> On Wed, 5 May 2021, Andre Vieira (lists) wrote:
>
>> I tried to see what IVOPTs would make of this and it is able to analyze the
>> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1)
>> could be used as the base for the other (loop 2). I don't know if this is
>> where you'd want such optimizations to be made, on one side I think it would
>> be great as it would also help with non-vectorized loops as you allured to.
> Hmm, OK.  So there's the first loop that has a looparound jump and thus
> we do not always enter the 2nd loop with the first loop final value of the
> IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
> And for a followup transform to catch this it would need to compute
> the final value of the IV and then match this up with the initial
> value computation.  I suppose FRE could be teached to do this, at
> least for very simple cases.
I will admit I am not at all familiar with how FRE works, I know it 
exists as the occlusion of running it often breaks my vector patches :P 
But that's about all I know.
I will have a look and see if it makes sense from my perspective to 
address it there, because ...
>
>> Anyway I diverge. Back to the main question of this patch. How do you suggest
>> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs
>> in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's
>> if one ends where the other begins?
> I don't think we will make that work easily.  So indeed attacking this
> in the vectorizer sounds most promising.

The problem with this that I found with my approach is that it only 
tackles the vectorized epilogues and that leads to regressions, I don't 
have the example at hand, but what I saw was happening was that 
increased register pressure lead to a spill in the hot path. I believe 
this was caused by the epilogue loop using the update pointers as the 
base for their DR's, in this case there were three DR's (2 loads one 
store), but the scalar epilogue still using the original base + niters, 
since this data_reference approach only changes the vectorized epilogues.


>   I'll note there's also
> the issue of epilogue vectorization and reductions where we seem
> to not re-use partially reduced reduction vectors but instead
> reduce to a scalar in each step.  That's a related issue - we're
> not able to carry forward a (reduction) IV we generated for the
> main vector loop to the epilogue loops.  Like for
>
> double foo (double *a, int n)
> {
>    double sum = 0.;
>    for (int i = 0; i < n; ++i)
>      sum += a[i];
>    return sum;
> }
>
> with AVX512 we get three reductions to scalars instead of
> a partial reduction from zmm to ymm before the first vectorized
> epilogue followed by a reduction from ymm to xmm before the second
> (the jump around for the epilogues need to jump to the further
> reduction piece obviously).
>
> So I think we want to record IVs we generate (the reduction IVs
> are already nicely associated with the stmt-infos), one might
> consider to refer to them from the dr_vec_info for example.
>
> It's just going to be "interesting" to wire everything up
> correctly with all the jump-arounds we have ...
I have a downstream hack for the reductions, but it only worked for 
partial-vector-usage as there you have the guarantee it's the same 
vector-mode, so you don't need to pfaff around with half and full 
vectors. Obviously what you are suggesting has much wider applications 
and not surprisingly I think Richard Sandiford also pointed out to me 
that these are somewhat related and we might be able to reuse the 
IV-creation to manage it all. But I feel like I am currently light years 
away from that.

I had started to look at removing the data_reference updating we have 
now and dealing with this in the 'create_iv' calls from 
'vect_create_data_ref_ptr' inside 'vectorizable_{load,store}' but then I 
thought it would be good to discuss it with you first. This will require 
keeping track of the 'end-value' of the IV, which for loops where we can 
skip the previous loop means we will need to construct a phi-node 
containing the updated pointer and the initial base. But I'm not 
entirely sure where to keep track of all this. Also I don't know if I 
can replace the base address of the data_reference right there at the 
'create_iv' call, can a data_reference be used multiple times in the 
same loop?

I'll go do a bit more nosing around this idea and the ivmap you 
mentioned before. Let me know if you have any ideas on how this all 
should look like, even if its a 'in an ideal world'.

Andre
>
>> On 04/05/2021 10:56, Richard Biener wrote:
>>> On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
>>>
>>>> Hi,
>>>>
>>>> The aim of this RFC is to explore a way of cleaning up the codegen around
>>>> data_references.  To be specific, I'd like to reuse the main-loop's updated
>>>> data_reference as the base_address for the epilogue's corresponding
>>>> data_reference, rather than use the niters.  We have found this leads to
>>>> better codegen in the vectorized epilogue loops.
>>>>
>>>> The approach in this RFC creates a map if iv_updates which always contain
>>>> an
>>>> updated pointer that is caputed in vectorizable_{load,store}, an iv_update
>>>> may
>>>> also contain a skip_edge in case we decide the vectorization can be skipped
>>>> in
>>>> 'vect_do_peeling'. During the epilogue update this map of iv_updates is
>>>> then
>>>> checked to see if it contains an entry for a data_reference and it is used
>>>> accordingly and if not it reverts back to the old behavior of using the
>>>> niters
>>>> to advance the data_reference.
>>>>
>>>> The motivation for this work is to improve codegen for the option `--param
>>>> vect-partial-vector-usage=1` for SVE. We found that one of the main
>>>> problems
>>>> for the codegen here was coming from unnecessary conversions caused by the
>>>> way
>>>> we update the data_references in the epilogue.
>>>>
>>>> This patch passes regression tests in aarch64-linux-gnu, but the codegen is
>>>> still not optimal in some cases. Specifically those where we have a scalar
>>>> epilogue, as this does not use the data_reference's and will rely on the
>>>> gimple scalar code, thus constructing again a memory access using the
>>>> niters.
>>>> This is a limitation for which I haven't quite worked out a solution yet
>>>> and
>>>> does cause some minor regressions due to unfortunate spills.
>>>>
>>>> Let me know what you think and if you have ideas of how we can better
>>>> achieve
>>>> this.
>>> Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
>>>
>>> I think it might be interesting to create a C testcase mimicing the
>>> update problem without involving the vectorizer.  That way we can
>>> see how the various components involved behave (FRE + ivopts most
>>> specifically).
>>>
>>> That said, a cleaner approach to dealing with this would be to
>>> explicitely track the IVs we generate for vectorized DRs, eventually
>>> factoring that out from vectorizable_{store,load} so we can simply
>>> carry over the actual pointer IV final value to the epilogue as
>>> initial value.  For each DR group we'd create a single IV (we can
>>> even do better in case we have load + store of the "same" group).
>>>
>>> We already kind-of track things via the ivexpr_map, but I'm not sure
>>> if this lazly populated map can be reliably re-used to "re-populate"
>>> the epilogue one (walk the map, create epilogue IVs with the appropriate
>>> initial value & adjustd upate).
>>>
>>> Richard.
>>>
>>>> Kind regards,
>>>> Andre Vieira
>>>>
>>>>
>>>>
>>