public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps
@ 2021-11-07 11:36 rsandifo at gcc dot gnu.org
  2021-11-08  8:20 ` [Bug tree-optimization/103116] " rguenth at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-11-07 11:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

            Bug ID: 103116
           Summary: SLP vectoriser fails to peel for gaps
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: wrong-code
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

Created attachment 51744
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51744&action=edit
testcase

Compiling the attached testcase on x86_64-linux-gnu with -O2
-fno-vect-cost-model triggers a segfault.  We implement the y[] reads in:

  for (int i = 0; i < COUNT; ++i)
    {
      x[i * 4] = y[i * 2] + 1;
      x[i * 4 + 1] = y[i * 2] + 2;
      x[i * 4 + 2] = y[i * 2 + 1] + 3;
      x[i * 4 + 3] = y[i * 2 + 1] + 4;
    }

using a load and permute in which only the low half of the loaded data is used.
 The high half of the final iteration overruns the array.

I guess the usual approach of peeling one iteration for gaps will be enough,
but haven't thought about it much.  Are there corner cases where we might need
to peel more?

(We get away with this on AArch64 because we fall back to load/store lanes
instead.)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
@ 2021-11-08  8:20 ` rguenth at gcc dot gnu.org
  2022-05-03 13:19 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-11-08  8:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-11-08
      Known to fail|                            |11.2.1, 12.0, 7.5.0
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  Indeed peeling for gaps is what we usually do here.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
  2021-11-08  8:20 ` [Bug tree-optimization/103116] " rguenth at gcc dot gnu.org
@ 2022-05-03 13:19 ` rguenth at gcc dot gnu.org
  2022-05-04  7:34 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-03 13:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the issue is we have group_size == 2 but nunits == 4 but still gap == 0.
That makes get_group_load_store_type assume overrun_p = false.

I suppose that when we'd have 8 elements in x and four times the first and
second in y peeling one vector iteration as scalar is not enough to avoid the
breakage.

So while peeling for gaps in this particular case helps it's not the solution
for the more general problem.  Here instead I think we need to enforce a
minimum vectorization factor so that nunits divides group_size * vf (or at
least
nunits/2 does to allow peeling for gaps to work).

ISTR we specifically did not do this to allow more vectorization though.  The
better alternative would then be to allow a smaller vector size to be used
for the load with all the ripple down effects that might have (and only
enforce a larger VF if there is no such vector type).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
  2021-11-08  8:20 ` [Bug tree-optimization/103116] " rguenth at gcc dot gnu.org
  2022-05-03 13:19 ` rguenth at gcc dot gnu.org
@ 2022-05-04  7:34 ` rguenth at gcc dot gnu.org
  2022-05-04  8:42 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-04  7:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
We could make peeling for gaps handle this by making it not a flag but indicate
the number of vector(!?) iterations we need to peel.  We're doing the "correct"
thing in adjusting the IV increment via

          if (slp_perm
              && (group_size != scalar_lanes
                  || !multiple_p (nunits, group_size)))
            {
              /* We don't yet generate such SLP_TREE_LOAD_PERMUTATIONs for
                 variable VF; see vect_transform_slp_perm_load.  */
              unsigned int const_vf = vf.to_constant ();
              unsigned int const_nunits = nunits.to_constant ();
              vec_num = CEIL (group_size * const_vf, const_nunits);
              group_gap_adj = vf * group_size - nunits * vec_num;

The problem also shows up for loops like

  for (int i = 0; i < COUNT; ++i)
    {
      x[i * 4] = y[i * 3] + 1;
      x[i * 4 + 1] = y[i * 3] + 2;
      x[i * 4 + 2] = y[i * 3 + 1] + 3;
      x[i * 4 + 3] = y[i * 3 + 2] + 4;
    }

where we cannot use a smaller vector type.

We could also use masked loads if available of course (not sure about the
cost of that vs peeling for gaps).

A conservative fix would be to detect when peeling for gaps as implemented
is good enough and do that and otherwise reject vectorization.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-05-04  7:34 ` rguenth at gcc dot gnu.org
@ 2022-05-04  8:42 ` rguenth at gcc dot gnu.org
  2022-05-04 13:05 ` rsandifo at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-04  8:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is currently visible in gcc.dg/vect/slp-perm-12.c which uses a VF of
2 and

  vect__1.20_88 = MEM <vector(8) short int> [(short int *)vectp_tptr.18_90];
  vectp_tptr.18_87 = vectp_tptr.18_90 + 16;
  vect__1.21_86 = MEM <vector(8) short int> [(short int *)vectp_tptr.18_87];
  vectp_tptr.18_85 = vectp_tptr.18_90 + 32;
  vect__1.22_84 = MEM <vector(8) short int> [(short int *)vectp_tptr.18_85];
  vectp_tptr.18_83 = vectp_tptr.18_90 + 28;
  vect__1.23_82 = VEC_PERM_EXPR <vect__1.21_86, vect__1.22_84, { 3, 4, 5, 6, 7,
8, 9, 10 }>;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-05-04  8:42 ` rguenth at gcc dot gnu.org
@ 2022-05-04 13:05 ` rsandifo at gcc dot gnu.org
  2022-05-04 13:06 ` rguenther at suse dot de
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-05-04 13:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

--- Comment #5 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> We could make peeling for gaps handle this by making it not a flag but
> indicate the number of vector(!?) iterations we need to peel.
I think it'd be better to keep it in scalar iterations if possible.
For example, we still need to peel for gaps when using masked LD2s
if only the first vector result is needed.  But we do only need to
peel 1 scalar iteration, rather than a whole vector's worth.

But yeah, agree that moving from 0-or-1 to a general number sounds good.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-05-04 13:05 ` rsandifo at gcc dot gnu.org
@ 2022-05-04 13:06 ` rguenther at suse dot de
  2022-05-04 13:12 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenther at suse dot de @ 2022-05-04 13:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 4 May 2022, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116
> 
> --- Comment #5 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #3)
> > We could make peeling for gaps handle this by making it not a flag but
> > indicate the number of vector(!?) iterations we need to peel.
> I think it'd be better to keep it in scalar iterations if possible.
> For example, we still need to peel for gaps when using masked LD2s
> if only the first vector result is needed.  But we do only need to
> peel 1 scalar iteration, rather than a whole vector's worth.
> 
> But yeah, agree that moving from 0-or-1 to a general number sounds good.

Yep.  I'll keep that for GCC 13, the posted fix should be what we can
reasonably easy backport.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2022-05-04 13:06 ` rguenther at suse dot de
@ 2022-05-04 13:12 ` cvs-commit at gcc dot gnu.org
  2022-05-04 13:14 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-05-04 13:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:52b7b86f8c72eb19e637f1e72ffd10f39b8cb829

commit r13-110-g52b7b86f8c72eb19e637f1e72ffd10f39b8cb829
Author: Richard Biener <rguenther@suse.de>
Date:   Wed May 4 10:43:07 2022 +0200

    tree-optimization/103116 - SLP permutes and peeling for gaps

    The testcase shows that we can end up with a contiguous access across
    loop iterations but by means of permutations the elements accessed
    might only cover parts of a vector.  In this case we end up with
    GROUP_GAP == 0 but still need to avoid accessing excess elements
    in the last loop iterations.  Peeling for gaps is designed to cover
    this but a single scalar iteration might not cover all of the excess
    elements.  The following ensures peeling for gaps is done in this
    situation and when that isn't sufficient because we need to peel
    more than one iteration (gcc.dg/vect/pr103116-2.c), fail the SLP
    vectorization.

    2022-05-04  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/103116
            * tree-vect-stmts.cc (get_group_load_store_type): Handle the
            case we need peeling for gaps even though GROUP_GAP is zero.

            * gcc.dg/vect/pr103116-1.c: New testcase.
            * gcc.dg/vect/pr103116-2.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-05-04 13:12 ` cvs-commit at gcc dot gnu.org
@ 2022-05-04 13:14 ` rguenth at gcc dot gnu.org
  2022-05-20  7:10 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-04 13:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |13.0

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed on trunk sofar, queued for backporting.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-05-04 13:14 ` rguenth at gcc dot gnu.org
@ 2022-05-20  7:10 ` cvs-commit at gcc dot gnu.org
  2022-07-22 11:21 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-05-20  7:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-12 branch has been updated by Richard Biener
<rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:fdf50499a40399a48ac5e5d521ef93ed302be157

commit r12-8405-gfdf50499a40399a48ac5e5d521ef93ed302be157
Author: Richard Biener <rguenther@suse.de>
Date:   Wed May 4 10:43:07 2022 +0200

    tree-optimization/103116 - SLP permutes and peeling for gaps

    The testcase shows that we can end up with a contiguous access across
    loop iterations but by means of permutations the elements accessed
    might only cover parts of a vector.  In this case we end up with
    GROUP_GAP == 0 but still need to avoid accessing excess elements
    in the last loop iterations.  Peeling for gaps is designed to cover
    this but a single scalar iteration might not cover all of the excess
    elements.  The following ensures peeling for gaps is done in this
    situation and when that isn't sufficient because we need to peel
    more than one iteration (gcc.dg/vect/pr103116-2.c), fail the SLP
    vectorization.

    2022-05-04  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/103116
            * tree-vect-stmts.cc (get_group_load_store_type): Handle the
            case we need peeling for gaps even though GROUP_GAP is zero.

            * gcc.dg/vect/pr103116-1.c: New testcase.
            * gcc.dg/vect/pr103116-2.c: Likewise.

    (cherry picked from commit 52b7b86f8c72eb19e637f1e72ffd10f39b8cb829)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-05-20  7:10 ` cvs-commit at gcc dot gnu.org
@ 2022-07-22 11:21 ` cvs-commit at gcc dot gnu.org
  2022-07-22 11:23 ` rguenth at gcc dot gnu.org
  2024-03-10  4:59 ` pinskia at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-07-22 11:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-11 branch has been updated by Richard Biener
<rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:ff408622a5f8fb395e5584a494aaaad05b3611db

commit r11-10167-gff408622a5f8fb395e5584a494aaaad05b3611db
Author: Richard Biener <rguenther@suse.de>
Date:   Wed May 4 10:43:07 2022 +0200

    tree-optimization/103116 - SLP permutes and peeling for gaps

    The testcase shows that we can end up with a contiguous access across
    loop iterations but by means of permutations the elements accessed
    might only cover parts of a vector.  In this case we end up with
    GROUP_GAP == 0 but still need to avoid accessing excess elements
    in the last loop iterations.  Peeling for gaps is designed to cover
    this but a single scalar iteration might not cover all of the excess
    elements.  The following ensures peeling for gaps is done in this
    situation and when that isn't sufficient because we need to peel
    more than one iteration (gcc.dg/vect/pr103116-2.c), fail the SLP
    vectorization.

    2022-05-04  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/103116
            * tree-vect-stmts.c (get_group_load_store_type): Handle the
            case we need peeling for gaps even though GROUP_GAP is zero.

            * gcc.dg/vect/pr103116-1.c: New testcase.
            * gcc.dg/vect/pr103116-2.c: Likewise.

    (cherry picked from commit 52b7b86f8c72eb19e637f1e72ffd10f39b8cb829)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2022-07-22 11:21 ` cvs-commit at gcc dot gnu.org
@ 2022-07-22 11:23 ` rguenth at gcc dot gnu.org
  2024-03-10  4:59 ` pinskia at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-22 11:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED
      Known to fail|11.2.1                      |11.3.0
      Known to work|                            |11.3.1

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 11+

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/103116] SLP vectoriser fails to peel for gaps
  2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2022-07-22 11:23 ` rguenth at gcc dot gnu.org
@ 2024-03-10  4:59 ` pinskia at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-10  4:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103116

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |11.4

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-03-10  4:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-07 11:36 [Bug tree-optimization/103116] New: SLP vectoriser fails to peel for gaps rsandifo at gcc dot gnu.org
2021-11-08  8:20 ` [Bug tree-optimization/103116] " rguenth at gcc dot gnu.org
2022-05-03 13:19 ` rguenth at gcc dot gnu.org
2022-05-04  7:34 ` rguenth at gcc dot gnu.org
2022-05-04  8:42 ` rguenth at gcc dot gnu.org
2022-05-04 13:05 ` rsandifo at gcc dot gnu.org
2022-05-04 13:06 ` rguenther at suse dot de
2022-05-04 13:12 ` cvs-commit at gcc dot gnu.org
2022-05-04 13:14 ` rguenth at gcc dot gnu.org
2022-05-20  7:10 ` cvs-commit at gcc dot gnu.org
2022-07-22 11:21 ` cvs-commit at gcc dot gnu.org
2022-07-22 11:23 ` rguenth at gcc dot gnu.org
2024-03-10  4:59 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).