From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <andre.simoesdiasvieira@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id C828839888AA
 for <gcc-patches@gcc.gnu.org>; Wed,  5 May 2021 11:34:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C828839888AA
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 627C0ED1;
 Wed,  5 May 2021 04:34:20 -0700 (PDT)
Received: from [10.57.1.74] (unknown [10.57.1.74])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CD5A23F70D;
 Wed,  5 May 2021 04:34:19 -0700 (PDT)
Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue
 vectorization
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
 Richard Sandiford <richard.sandiford@arm.com>
References: <ea077462-00c3-f2dc-e1ec-1f95130e918c@arm.com>
 <nycvar.YFH.7.76.2105041149180.9200@zhemvz.fhfr.qr>
From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
Message-ID: <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com>
Date: Wed, 5 May 2021 12:34:18 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.9.1
MIME-Version: 1.0
In-Reply-To: <nycvar.YFH.7.76.2105041149180.9200@zhemvz.fhfr.qr>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00, BODY_8BITS,
 KAM_DMARC_STATUS, NICE_REPLY_A, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 05 May 2021 11:34:22 -0000

Hi Richi,

So I'm trying to look at what IVOPTs does right now and how it might be 
able to help us. Looking at these two code examples:
#include <stddef.h>
#if 0
int foo(short * a, short * b, unsigned int n)
{
     int sum = 0;
     for (unsigned int i = 0; i < n; ++i)
         sum += a[i] + b[i];

     return sum;
}


#else

int bar (short * a, short *b, unsigned int n)
{
     int sum = 0;
     unsigned int i = 0;
     for (; i < (n / 16); i += 1)
     {
         // Iterates [0, 16, .., (n/16 * 16) * 16]
         // Example n = 127,
         // iterates [0, 16, 32, 48, 64, 80, 96, 112]
         sum += a[i*16] + b[i*16];
     }
     for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)
     {
         // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
         // Example n = 127,
         // j starts at (127/16) * 16 = 7 * 16 = 112,
         // So iterates over [112, 113, 114, 115, ..., 127]
         sum += a[j] + b[j];
     }
     return sum;
}
#endif

Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the 
following options '-O3 -march=armv8-a -fno-tree-vectorize 
-fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here: 
https://godbolt.org/z/MEf6j6ebM

I tried to see what IVOPTs would make of this and it is able to analyze 
the IVs but it doesn't realize (not even sure it tries) that one IV's 
end (loop 1) could be used as the base for the other (loop 2). I don't 
know if this is where you'd want such optimizations to be made, on one 
side I think it would be great as it would also help with non-vectorized 
loops as you allured to.

However, if you compile the top test case (#if 1) and let the 
tree-vectorizer have a go you will see different behaviours for 
different vectorization approaches, so for:
'-O3 -march=armv8-a', using NEON and epilogue vectorization it seems 
IVOPTs only picks up on one loop.
If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' 
it will detect two loops. This may well be because in fact epilogue 
vectorization 'un-loops' it because it knows it will only have to do one 
iteration of the vectorized epilogue. vect-partial-vector-usage=1 could 
have done the same, but because we are dealing with polymorphic vector 
modes it fails to, I have a hack that works for 
vect-partial-vector-usage to avoid it, but I think we can probably do 
better and try to reason about boundaries in poly_int's rather than 
integers (TBC).

Anyway I diverge. Back to the main question of this patch. How do you 
suggest I go about this? Is there a way to make IVOPTS aware of the 
'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then 
teach it to merge IV's if one ends where the other begins?

On 04/05/2021 10:56, Richard Biener wrote:
> On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
>
>> Hi,
>>
>> The aim of this RFC is to explore a way of cleaning up the codegen around
>> data_references.  To be specific, I'd like to reuse the main-loop's updated
>> data_reference as the base_address for the epilogue's corresponding
>> data_reference, rather than use the niters.  We have found this leads to
>> better codegen in the vectorized epilogue loops.
>>
>> The approach in this RFC creates a map if iv_updates which always contain an
>> updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
>> also contain a skip_edge in case we decide the vectorization can be skipped in
>> 'vect_do_peeling'. During the epilogue update this map of iv_updates is then
>> checked to see if it contains an entry for a data_reference and it is used
>> accordingly and if not it reverts back to the old behavior of using the niters
>> to advance the data_reference.
>>
>> The motivation for this work is to improve codegen for the option `--param
>> vect-partial-vector-usage=1` for SVE. We found that one of the main problems
>> for the codegen here was coming from unnecessary conversions caused by the way
>> we update the data_references in the epilogue.
>>
>> This patch passes regression tests in aarch64-linux-gnu, but the codegen is
>> still not optimal in some cases. Specifically those where we have a scalar
>> epilogue, as this does not use the data_reference's and will rely on the
>> gimple scalar code, thus constructing again a memory access using the niters.
>> This is a limitation for which I haven't quite worked out a solution yet and
>> does cause some minor regressions due to unfortunate spills.
>>
>> Let me know what you think and if you have ideas of how we can better achieve
>> this.
> Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
>
> I think it might be interesting to create a C testcase mimicing the
> update problem without involving the vectorizer.  That way we can
> see how the various components involved behave (FRE + ivopts most
> specifically).
>
> That said, a cleaner approach to dealing with this would be to
> explicitely track the IVs we generate for vectorized DRs, eventually
> factoring that out from vectorizable_{store,load} so we can simply
> carry over the actual pointer IV final value to the epilogue as
> initial value.  For each DR group we'd create a single IV (we can
> even do better in case we have load + store of the "same" group).
>
> We already kind-of track things via the ivexpr_map, but I'm not sure
> if this lazly populated map can be reliably re-used to "re-populate"
> the epilogue one (walk the map, create epilogue IVs with the appropriate
> initial value & adjustd upate).
>
> Richard.
>
>> Kind regards,
>> Andre Vieira
>>
>>
>>