[RFC][ivopts] Generate better code for IVs with uses outside the loop

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
To: "Bin.Cheng" <amker.cheng@gmail.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
	"bin.cheng" <bin.cheng@linux.alibaba.com>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	Richard Biener <rguenther@suse.de>
Subject: [RFC][ivopts] Generate better code for IVs with uses outside the loop
Date: Thu, 10 Jun 2021 12:51:12 +0100	[thread overview]
Message-ID: <daf7312f-47c0-06e4-2a1c-a64d47890c12@arm.com> (raw)
In-Reply-To: <ecf8f81a-ab2e-0b47-72aa-8befdcb6933f@arm.com>


On 08/06/2021 16:00, Andre Simoes Dias Vieira via Gcc-patches wrote:
> Hi Bin,
>
> Thank you for the reply, I have some questions, see below.
>
> On 07/06/2021 12:28, Bin.Cheng wrote:
>> On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi Andre,
>> I didn't look into the details of the IV sharing RFC.  It seems to me
>> costing outside uses is trying to generate better code for later code
>> (epilogue loop here).  The only problem is IVOPTs doesn't know that
>> the outside use is not in the final form - which will be transformed
>> by IVOPTs again.
>>
>> I think this example is not good at describing your problem because it
>> shows exactly that considering outside use results in better code,
>> compared to the other two approaches.
> I don't quite understand what you are saying here :( What do you mean 
> by final form? It seems to me that costing uses inside and outside 
> loop the same way is wrong because calculating the IV inside the loop 
> has to be done every iteration, whereas if you can resolve it to a 
> single update (without an IV) then you can sink it outside the loop. 
> This is why I think this example shows why we need to cost these uses 
> differently.
>>> 2) Is there a cleaner way to generate the optimal 'post-increment' use
>>> for the outside-use variable? I first thought the position in the
>>> candidate might be something I could use or even the var_at_stmt
>>> functionality, but the outside IV has the actual increment of the
>>> variable as it's use, rather than the outside uses. This is this RFC's
>>> main weakness I find.
>> To answer why IVOPTs behaves like this w/o your two patches. The main
>> problem is the point IVOPTs rewrites outside use IV - I don't remember
>> the exact point - but looks like at the end of loop while before
>> incrementing instruction of main IV.  It's a known issue that outside
>> use should be costed/re-written on the exit edge along which its value
>> flows out of loop.  I had a patch a long time ago but discarded it,
>> because it didn't bring obvious improvement and is complicated in case
>> of multi-exit edges.
> Yeah I haven't looked at multi-exit edges and I understand that 
> complicates things. But for now we could disable the special casing of 
> outside uses when dealing with multi-exit loops and keep the current 
> behavior.
>>
>> But in general, I am less convinced that any of the two patches is the
>> right direction solving IV sharing issue between vectorized loop and
>> epilogue loop.  I would need to read the previous RFC before giving
>> further comments though.
>
> The previous RFC still has a lot of unanswered questions too, but 
> regardless of that, take the following (non-vectorizer) example:
>
> #include <arm_neon.h>
> #include <arm_sve.h>
>
> void bar (char  * __restrict__ a, char * __restrict__ b, char * 
> __restrict__ c, unsigned long long n)
> {
>     svbool_t all_true = svptrue_b8 ();
>   unsigned long long i = 0;
>     for (; i < (n & ~(svcntb() - 1)); i += svcntb()) {
>       svuint8_t va = svld1 (all_true, (uint8_t*)a);
>       svuint8_t vb = svld1 (all_true, (uint8_t*)b);
>       svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
>       a += svcntb();
>       b += svcntb();
>       c += svcntb();
>   }
>   svbool_t pred;
>   for (; i < (n); i += svcntb()) {
>       pred = svwhilelt_b8 (i, n);
>       svuint8_t va = svld1 (pred, (uint8_t*)a);
>       svuint8_t vb = svld1 (pred, (uint8_t*)b);
>       svst1 (pred, (uint8_t *)c, svadd_z (pred, va,vb));
>       a += svcntb();
>       b += svcntb();
>       c += svcntb();
>   }
>
>
> Current IVOPTs will use 4 iterators for the first loop, when it could 
> do with just 1. In fact, if you use my patches it will create just a 
> single IV and sink the uses and it is then able to merge them with 
> loads & stores of the next loop.
I mixed things up here, I think an earlier version of my patch (with 
even more hacks) managed to rewrite these properly, but it looks like 
the current ones are messing things up.
I'll continue to try to understand how this works as I do still think 
IVOPTs should be able to do better.

You mentioned you had a patch you thought might help earlier, but you 
dropped it. Do you still have it lying around anywhere?
>
> I am not saying setting outside costs to 0 is the right thing to do by 
> the way. It is absolutely not! It will break cost considerations for 
> other cases. Like I said above I've been playing around with using 
> '!use->outside' as a multiplier for the cost. Unfortunately it won't 
> help with the case above, because this seems to choose 'infinite_cost' 
> because the candidate IV has a lower precision than the use IV. I 
> don't quite understand yet how candidates are created, but something 
> I'm going to try to look at. Just wanted to show this as an example of 
> how IVOPTs would not improve code with multiple loops that don't 
> involve the vectorizer.
>
> BR,
> Andre
>
>
>>
>> Thanks,
>> bin

     prev parent reply	other threads:[~2021-06-10 11:51 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-03 16:34 [RFC] Implementing detection of saturation and rounding arithmetic Andre Vieira (lists)
2021-06-03 16:41 ` [RFC][ivopts] Generate better code for IVs with uses outside the loop (was Re: [RFC] Implementing detection of saturation and rounding arithmetic) Andre Vieira (lists)
2021-06-07 11:28 ` [RFC] Implementing detection of saturation and rounding arithmetic Bin.Cheng
2021-06-08 15:00   ` Andre Simoes Dias Vieira
2021-06-10 11:51     ` Andre Vieira (lists) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=daf7312f-47c0-06e4-2a1c-a64d47890c12@arm.com \
    --to=andre.simoesdiasvieira@arm.com \
    --cc=amker.cheng@gmail.com \
    --cc=bin.cheng@linux.alibaba.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).