[Bug tree-optimization/116583] New: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/116583] New: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
@ 2024-09-03 12:37 rguenth at gcc dot gnu.org
  2024-09-03 12:39 ` [Bug tree-optimization/116583] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-03 12:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

            Bug ID: 116583
           Summary: vectorizable_slp_permutation cannot handle even/odd
                    extract from VLA vector
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

gcc.dg/vect/O3-pr39675-2.c:9:1: note: node 0x450c0e0 (max_nunits=1, refcnt=1)
vector([4,4]) int
gcc.dg/vect/O3-pr39675-2.c:9:1: note: op: VEC_PERM_EXPR
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     stmt 0 a0_8 = in[_1];
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     stmt 1 a2_10 = in[_3];
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     lane permutation { 0[0] 0[2] }
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     children 0x450bf18
gcc.dg/vect/O3-pr39675-2.c:9:1: note: node 0x450bf18 (max_nunits=4, refcnt=2)
vector([4,4]) int
gcc.dg/vect/O3-pr39675-2.c:9:1: note: op template: a0_8 = in[_1];
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     stmt 0 a0_8 = in[_1];
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     stmt 1 a1_9 = in[_2];
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     stmt 2 a2_10 = in[_3];
gcc.dg/vect/O3-pr39675-2.c:9:1: note:     stmt 3 a3_11 = in[_4];

because the number of lanes in the SLP nodes do not agree we end up with
repeating_p == false which causes the permute to fail to be supported
for VLA vectors.  repeating_p is initially set to

  tree vectype = SLP_TREE_VECTYPE (node);
  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
  bool repeating_p = multiple_p (nunits, SLP_TREE_LANES (node));

I suppose as long as 'child' is repeating in the same sense the overall thing
is still repeating.  When doing that we get

  vect_a0_8.6_27 = .MASK_LOAD (vectp_in.4_23, 32B, loop_mask_19);
  vectp_in.4_28 = vectp_in.4_23 + POLY_INT_CST [16, 16];
  vect_a0_8.7_29 = .MASK_LOAD (vectp_in.4_28, 32B, loop_mask_18);
  vectp_in.4_30 = vectp_in.4_23 + POLY_INT_CST [32, 32];
  vect_a0_8.8_31 = .MASK_LOAD (vectp_in.4_30, 32B, loop_mask_6);
  vectp_in.4_32 = vectp_in.4_23 + POLY_INT_CST [48, 48];
  vect_a0_8.9_33 = .MASK_LOAD (vectp_in.4_32, 32B, loop_mask_5);
  _43 = VEC_PERM_EXPR <vect_a0_8.6_27, vect_a0_8.6_27, { 0, 2, 4, ... }>;
  _44 = VEC_PERM_EXPR <vect_a0_8.7_29, vect_a0_8.7_29, { 0, 2, 4, ... }>;
  _45 = VEC_PERM_EXPR <_43, _43, { 0, 2, 4, ... }>;

that isn't entirely what we expect though.  We'd have expected _27, _29
in the first and _30 and _32 in the second and _43 and _44 in the third
permute.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
  2024-09-03 12:37 [Bug tree-optimization/116583] New: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector rguenth at gcc dot gnu.org
@ 2024-09-03 12:39 ` rguenth at gcc dot gnu.org
  2024-09-20  9:04 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-03 12:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org
             Blocks|                            |116578

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Richard, can you help her?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116578
[Bug 116578] vectorizer SLP transition issues / dependences

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
  2024-09-03 12:37 [Bug tree-optimization/116583] New: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector rguenth at gcc dot gnu.org
  2024-09-03 12:39 ` [Bug tree-optimization/116583] " rguenth at gcc dot gnu.org
@ 2024-09-20  9:04 ` rguenth at gcc dot gnu.org
  2024-09-20  9:09 ` tnfchris at gcc dot gnu.org
  2024-09-20  9:33 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-20  9:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-09-20
             Target|                            |aarch64, riscv
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another example this shows is for gcc.dg/vect/slp-42.c - we definitely can
do the interleaving scheme as non-SLP vectorization shows.

gcc.dg/vect/slp-42.c also shows we're not yet "lowering" all SLP load permutes.
The original SLP attempt still has

   node 0x45d5050 (max_nunits=4, refcnt=2) vector([4,4]) int
   op template: _2 = q[_1];
        stmt 0 _2 = q[_1];
        stmt 1 _8 = q[_7];
        stmt 2 _14 = q[_13];
        stmt 3 _20 = q[_19];
        load permutation { 0 1 2 3 }
   node 0x45d50e8 (max_nunits=4, refcnt=2) vector([4,4]) int
   op template: _4 = q[_3];
        stmt 0 _4 = q[_3];
        stmt 1 _10 = q[_9];
        stmt 2 _16 = q[_15];
        stmt 3 _22 = q[_21];
        load permutation { 4 5 6 7 }

instead of a single contiguous load and two VEC_PERM_EXPR nodes to extract
the lo/hi parts (which is also extract even/odd, but with a larger mode
encompassing 4 elements).

I'd say for VLA operation this is one of the major blockers for all-SLP.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
  2024-09-03 12:37 [Bug tree-optimization/116583] New: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector rguenth at gcc dot gnu.org
  2024-09-03 12:39 ` [Bug tree-optimization/116583] " rguenth at gcc dot gnu.org
  2024-09-20  9:04 ` rguenth at gcc dot gnu.org
@ 2024-09-20  9:09 ` tnfchris at gcc dot gnu.org
  2024-09-20  9:33 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-09-20  9:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> Another example this shows is for gcc.dg/vect/slp-42.c - we definitely can
> do the interleaving scheme as non-SLP vectorization shows.
> 
> gcc.dg/vect/slp-42.c also shows we're not yet "lowering" all SLP load
> permutes.
> The original SLP attempt still has
> 
>    node 0x45d5050 (max_nunits=4, refcnt=2) vector([4,4]) int
>    op template: _2 = q[_1];
>         stmt 0 _2 = q[_1];
>         stmt 1 _8 = q[_7];
>         stmt 2 _14 = q[_13];
>         stmt 3 _20 = q[_19];
>         load permutation { 0 1 2 3 }
>    node 0x45d50e8 (max_nunits=4, refcnt=2) vector([4,4]) int
>    op template: _4 = q[_3];
>         stmt 0 _4 = q[_3];
>         stmt 1 _10 = q[_9];
>         stmt 2 _16 = q[_15];
>         stmt 3 _22 = q[_21];
>         load permutation { 4 5 6 7 }
> 
> instead of a single contiguous load and two VEC_PERM_EXPR nodes to extract
> the lo/hi parts (which is also extract even/odd, but with a larger mode
> encompassing 4 elements).
> 
> I'd say for VLA operation this is one of the major blockers for all-SLP.

I'll take a look if Richard hasn't yet once I finish early break transition :)
.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
  2024-09-03 12:37 [Bug tree-optimization/116583] New: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-09-20  9:09 ` tnfchris at gcc dot gnu.org
@ 2024-09-20  9:33 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-20  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the key to notice here is the regular interleaving knows there's enough
vectors to perform two-vector to one permutes within the same group and
we only have a single child for the VEC_PERM_EXPR which for the permute
in question effectively means we have to take "two" VLA vectors.

The non-SLP interleaving scheme for this performs multiple VLA loads while
we'd have a contiguous load node that we'd permute later on but we're usually
not emitting multiple loads(?).  For gcc.dg/vect/slp-42.c we do end up
(after re-analyzing with single-lane SLP) with store-lanes for the 4 element
store but SVE doesn't support 8 element load-lanes (we could use 4 element
load lanes with u64 elements - missing feature).

I do think the VLA interleaving scheme we produce is quite inefficient
(and the cost modeling agrees and would choose V4SI fixed-size regs).

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-20  9:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-03 12:37 [Bug tree-optimization/116583] New: vectorizable_slp_permutation cannot handle even/odd extract from VLA vector rguenth at gcc dot gnu.org
2024-09-03 12:39 ` [Bug tree-optimization/116583] " rguenth at gcc dot gnu.org
2024-09-20  9:04 ` rguenth at gcc dot gnu.org
2024-09-20  9:09 ` tnfchris at gcc dot gnu.org
2024-09-20  9:33 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).