From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <andre.simoesdiasvieira@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id BF043388A40F
 for <gcc-patches@gcc.gnu.org>; Wed, 16 Jun 2021 10:24:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BF043388A40F
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 585321042;
 Wed, 16 Jun 2021 03:24:27 -0700 (PDT)
Received: from [10.57.75.172] (unknown [10.57.75.172])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C18B53F70D;
 Wed, 16 Jun 2021 03:24:26 -0700 (PDT)
Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue
 vectorization
To: Richard Biener <rguenther@suse.de>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
References: <ea077462-00c3-f2dc-e1ec-1f95130e918c@arm.com>
 <nycvar.YFH.7.76.2105041149180.9200@zhemvz.fhfr.qr>
 <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com>
 <nycvar.YFH.7.76.2105051414540.9200@zhemvz.fhfr.qr>
 <6e4541ef-e0a1-1d2d-53f5-4bfed9a65598@arm.com>
 <c03961fa-ef0f-7ce4-180b-8414a610ee73@arm.com>
 <nycvar.YFH.7.76.2105201147390.9200@zhemvz.fhfr.qr>
 <4925fee5-dcea-0c14-388c-85c881ffd918@arm.com>
 <nycvar.YFH.7.76.2106141014261.9200@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2106141253390.9200@zhemvz.fhfr.qr>
From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
Message-ID: <077c4a2f-d745-dd0a-4d6d-d73d6a558529@arm.com>
Date: Wed, 16 Jun 2021 11:24:24 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.1
MIME-Version: 1.0
In-Reply-To: <nycvar.YFH.7.76.2106141253390.9200@zhemvz.fhfr.qr>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 NICE_REPLY_A, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Jun 2021 10:24:29 -0000


On 14/06/2021 11:57, Richard Biener wrote:
> On Mon, 14 Jun 2021, Richard Biener wrote:
>
>> Indeed. For example a simple
>> int a[1024], b[1024], c[1024];
>>
>> void foo(int n)
>> {
>>    for (int i = 0; i < n; ++i)
>>      a[i+1] += c[i+i] ? b[i+1] : 0;
>> }
>>
>> should usually see peeling for alignment (though on x86 you need
>> exotic -march= since cost models generally have equal aligned and
>> unaligned access costs).  For example with -mavx2 -mtune=atom
>> we'll see an alignment peeling prologue, a AVX2 vector loop,
>> a SSE2 vectorized epilogue and a scalar epilogue.  It also
>> shows the original scalar loop being used in the scalar prologue
>> and epilogue.
>>
>> We're not even trying to make the counting IV easily used
>> across loops (we're not counting scalar iterations in the
>> vector loops).
> Specifically we see
>
> <bb 33> [local count: 94607391]:
> niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
> _67 = niters_vector_mult_vf.10_62 + 7;
> _64 = (int) niters_vector_mult_vf.10_62;
> tmp.11_63 = i_43 + _64;
> if (niters.8_45 == niters_vector_mult_vf.10_62)
>    goto <bb 37>; [12.50%]
> else
>    goto <bb 36>; [87.50%]
>
> after the maini vect loop, recomputing the original IV (i) rather
> than using the inserted canonical IV.  And then the vectorized
> epilogue header check doing
>
> <bb 36> [local count: 93293400]:
> # i_59 = PHI <tmp.11_63(33), 0(18)>
> # _66 = PHI <_67(33), 0(18)>
> _96 = (unsigned int) n_10(D);
> niters.26_95 = _96 - _66;
> _108 = (unsigned int) n_10(D);
> _109 = _108 - _66;
> _110 = _109 + 4294967295;
> if (_110 <= 3)
>    goto <bb 47>; [10.00%]
> else
>    goto <bb 40>; [90.00%]
>
> re-computing everything from scratch again (also notice how
> the main vect loop guard jumps around the alignment prologue
> as well and lands here - and the vectorized epilogue using
> unaligned accesses - good!).
>
> That is, I'd expect _much_ easier jobs if we'd manage to
> track the number of performed scalar iterations (or the
> number of scalar iterations remaining) using the canonical
> IV we add to all loops across all of the involved loops.
>
> Richard.


So I am now looking at using an IV that counts scalar iterations rather 
than vector iterations and reusing that through all loops, (prologue, 
main loop, vect_epilogue and scalar epilogue). The first is easy, since 
that's what we already do for partial vectors or non-constant VFs. The 
latter requires some plumbing and removing a lot of the code in there 
that creates new IV's going from [0, niters - previous iterations]. I 
don't yet have a clear cut view of how to do this, I first thought of 
keeping track of the 'control' IV in the loop_vinfo, but the prologue 
and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, 
but that is used for overflow detection and only keeps track of what 
looks like a constant 'base' and 'step'. Not quite sure how all that 
works, but intuitively doesn't seem like the right thing to reuse.

I'll go hack around and keep you posted on progress.

Regards,
Andre