public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] tree-optimization/115385 - handle more gaps with peeling of a single iteration
Date: Wed, 12 Jun 2024 10:38:22 +0100	[thread overview]
Message-ID: <mpth6dyxzzl.fsf@arm.com> (raw)
In-Reply-To: <372q9s4p-86qp-q2p9-9pn1-1n185pn86s7q@fhfr.qr> (Richard Biener's message of "Wed, 12 Jun 2024 11:06:29 +0200 (CEST)")

Richard Biener <rguenther@suse.de> writes:
> On Wed, 12 Jun 2024, Richard Biener wrote:
>
>> On Tue, 11 Jun 2024, Richard Sandiford wrote:
>> 
>> > Don't think it makes any difference, but:
>> > 
>> > Richard Biener <rguenther@suse.de> writes:
>> > > @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>> > >  		     access excess elements.
>> > >  		     ???  Enhancements include peeling multiple iterations
>> > >  		     or using masked loads with a static mask.  */
>> > > -		  || (group_size * cvf) % cnunits + group_size - gap < cnunits))
>> > > +		  || ((group_size * cvf) % cnunits + group_size - gap < cnunits
>> > > +		      /* But peeling a single scalar iteration is enough if
>> > > +			 we can use the next power-of-two sized partial
>> > > +			 access.  */
>> > > +		      && ((cremain = (group_size * cvf - gap) % cnunits), true
>> > 
>> > ...this might be less surprising as:
>> > 
>> > 		      && ((cremain = (group_size * cvf - gap) % cnunits, true)
>> > 
>> > in terms of how the &&s line up.
>> 
>> Yeah - I'll fix before pushing.
>
> The aarch64 CI shows that a few testcases no longer use SVE
> (gcc.target/aarch64/sve/slp_perm_{4,7,8}.c) because peeling
> for gaps is deemed isufficient.  Formerly we had
>
>           if (loop_vinfo
>               && *memory_access_type == VMAT_CONTIGUOUS
>               && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
>               && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR 
> (loop_vinfo),
>                               nunits))
>             {
>               unsigned HOST_WIDE_INT cnunits, cvf;
>               if (!can_overrun_p
>                   || !nunits.is_constant (&cnunits)
>                   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant 
> (&cvf)
>                   /* Peeling for gaps assumes that a single scalar 
> iteration
>                      is enough to make sure the last vector iteration 
> doesn't
>                      access excess elements.
>                      ???  Enhancements include peeling multiple iterations
>                      or using masked loads with a static mask.  */
>                   || (group_size * cvf) % cnunits + group_size - gap < 
> cnunits)
>                 {
>                   if (dump_enabled_p ())
>                     dump_printf_loc (MSG_MISSED_OPTIMIZATION, 
> vect_location,
>                                      "peeling for gaps insufficient for "
>                                      "access\n");
>
> and in all cases multiple_p (group_size * LOOP_VINFO_VECT_FACTOR, nunits)
> is true so we didn't check for whether peeling one iteration is
> sufficient.  But after the refactoring the outer checks merely
> indicates there's overrun (which is there already because gap != 0).
>
> That is, we never verified, for the "regular" gap case, whether peeling
> for a single iteration is sufficient.  But now of course we run into
> the inner check which will always trigger if earlier checks didn't
> work out to set overrun_p to false.
>
> For slp_perm_8.c we have a group_size of two, nunits is {16, 16}
> and VF is {8, 8} and gap is one.  Given we know the
> multiple_p we know that (group_size * cvf) % cnunits is zero,
> so what remains is group_size - gap < nunits but 1 is probably
> always less than {16, 16}.

I thought the idea was that the size of the gap was immaterial
for VMAT_CONTIGUOUS, on the assumption that it would never be
bigger than a page.  That is, any gap loaded by the final
unpeeled iteration would belong to the same page as the non-gap
data from either the same vector iteration or the subsequent
peeled scalar iteration.

Will have to think more about this if that doesn't affect the
rest of the message, but FWIW...

> The new logic I added in the later patch that peeling a single
> iteration is OK when we use a smaller, rounded-up to power-of-two
> sized access is
>
>                   || ((group_size * cvf) % cnunits + group_size - gap < 
> cnunits
>                       /* But peeling a single scalar iteration is enough 
> if
>                          we can use the next power-of-two sized partial
>                          access.  */
>                       && (cremain = (group_size * cvf - gap) % cnunits, 
> true)
>                       && (cpart_size = (1 << ceil_log2 (cremain))) != 
> cnunits
>                       && vector_vector_composition_type 
>                            (vectype, cnunits / cpart_size, 
>                             &half_vtype) == NULL_TREE)))
>
> again knowing the multiple we know cremain is nunits - gap and with
> gap == 1 rounding this size up will yield nunits and thus the existing
> peeling is OK.  Something is inconsistent here and the pre-existing
>
>   (group_size * cvf) % cnunits + group_size - gap < cnunits
>
> check looks suspicious for a general check.
>
>   (group_size * cvf - gap)
>
> should be the number of elements we can access without touching
> excess elements.  Peeling a single iteration will make sure
> group_size * cvf + group_size - gap is accessed
> (that's group_size * (cvf + 1) - gap).  The excess elements
> touched in the vector loop are
>
>   cnunits - (group_size * cvf - gap) % cnunits
>
> I think that number needs to be less or equal to group_size, so
> the correct check should be
>
>   (group_size * cvf - gap) % cnunits + group_size < cnunits
>
> for the SVE case that's (nunits - 1) + 2 < nunits which should
> simplify to false.  Now the question is how to formulate this
> with poly-ints in a way that it works out, for the case in
> question doing the overrun check as
>
>           if (overrun_p
>               && can_div_trunc_p (group_size
>                                   * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 
> gap,
>                                   nunits, &tem, &remain)
>               && maybe_lt (remain + group_size, nunits))

...it looks like this should be known_lt, if we're relying on it
for correctness.

Thanks,
Richard

> seems to do the trick.  I'm going to test this, will post an updated
> series for the CI.
>
> Richard.
>
>> Thanks,
>> Richard.
>> 
>> > Thanks,
>> > Richard
>> > 
>> > > +			  && ((cpart_size = (1 << ceil_log2 (cremain)))
>> > > +			      != cnunits)
>> > > +			  && vector_vector_composition_type
>> > > +			       (vectype, cnunits / cpart_size,
>> > > +				&half_vtype) == NULL_TREE))))
>> > >  	    {
>> > >  	      if (dump_enabled_p ())
>> > >  		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> > > @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo,
>> > >  			      gcc_assert (new_vtype
>> > >  					  || LOOP_VINFO_PEELING_FOR_GAPS
>> > >  					       (loop_vinfo));
>> > > +			    /* But still reduce the access size to the next
>> > > +			       required power-of-two so peeling a single
>> > > +			       scalar iteration is sufficient.  */
>> > > +			    unsigned HOST_WIDE_INT cremain;
>> > > +			    if (remain.is_constant (&cremain))
>> > > +			      {
>> > > +				unsigned HOST_WIDE_INT cpart_size
>> > > +				  = 1 << ceil_log2 (cremain);
>> > > +				if (known_gt (nunits, cpart_size)
>> > > +				    && constant_multiple_p (nunits, cpart_size,
>> > > +							    &num))
>> > > +				  {
>> > > +				    tree ptype;
>> > > +				    new_vtype
>> > > +				      = vector_vector_composition_type (vectype,
>> > > +									num,
>> > > +									&ptype);
>> > > +				    if (new_vtype)
>> > > +				      ltype = ptype;
>> > > +				  }
>> > > +			      }
>> > >  			  }
>> > >  		      }
>> > >  		    tree offset
>> > 
>> 
>> 

  reply	other threads:[~2024-06-12  9:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <11ebbcd8-66c1-4843-8e33-504fcd055d95@DB1PEPF00039232.eurprd03.prod.outlook.com>
2024-06-11 21:44 ` Richard Sandiford
2024-06-12  7:03   ` Richard Biener
2024-06-12  9:06     ` Richard Biener
2024-06-12  9:38       ` Richard Sandiford [this message]
2024-06-12 10:43         ` Richard Biener
2024-06-11  9:02 Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mpth6dyxzzl.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).