[Bug tree-optimization/115340] New: Loop/SLP vectorization possible inefficiency

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/115340] New: Loop/SLP vectorization possible inefficiency
@ 2024-06-04  7:46 rdapp at gcc dot gnu.org
  2024-06-04  8:06 ` [Bug tree-optimization/115340] " rguenth at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-06-04  7:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340

            Bug ID: 115340
           Summary: Loop/SLP vectorization possible inefficiency
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
            Target: riscv

Given the following loop which is heavily adjusted from x264 satd's last loop
to showcase a particular problem.  I'm not sure it's still representative of
x264 but well...
I used riscv as target but avx512 and aarch64 look similar.

typedef unsigned int uint32_t;

void
foo (uint32_t **restrict tmp, uint32_t *restrict out)
{
    uint32_t sum = 0;
    for( int i = 0; i < 4; i++ )
    {
        out[i + 0] = tmp[0][i] + 1;
        out[i + 4] = tmp[1][i] + 1;
        out[i + 8] = tmp[2][i] + 1;
        out[i + 12] = tmp[3][i] + 1;
    }
}

we loop vectorize it as 4x unrolled
  vect__5.6_70 = MEM <vector(4) unsigned int> [(uint32_t *)vectp.4_72];
  vect__7.7_68 = vect__5.6_70 + { 1, 1, 1, 1 };
  MEM <vector(4) unsigned int> [(uint32_t *)vectp_out.8_67] = vect__7.7_68;
  ...

On riscv:

        vsetivli        zero,4,e32,mf2,ta,ma
        vle32.v v1,0(a4)
        vadd.vi v1,v1,1
        vse32.v v1,0(a1)
        addi    a4,a1,16
        vle32.v v1,0(a2)
        vadd.vi v1,v1,1
        vse32.v v1,0(a4)
        addi    a4,a1,32
        vle32.v v1,0(a3)
        vadd.vi v1,v1,1
        vse32.v v1,0(a4)
        addi    a1,a1,48
        vle32.v v1,0(a5)
        vadd.vi v1,v1,1
        vse32.v v1,0(a1)

That's not ideal when a 64-byte vector can hold all values simultaneously
(and permuting four 16-byte vectors into one is acceptable).

Of course the exact sequence will depend on costs but on riscv I would expect
code along those lines (similar to what clang produces):

        vsetivli        zero, 4, e32, mf2, ta, ma
        vle32.v v8, (a2)
        vle32.v v9, (a4)
        vle32.v v10, (a0)
        vle32.v v11, (a3)
        vsetivli        zero, 8, e32, mf2, ta, ma
        vslideup.vi     v9, v10, 4
        vslideup.vi     v8, v11, 4
        vsetivli        zero, 16, e32, m1, ta, ma
        vslideup.vi     v8, v9, 8
        vadd.vi v8, v8, 1
        vse32.v v8, (a1)

4 loads, 3 permutes, one add and one store.

When disabling tree vectorization, SLP gets us halfway there but still loads
each value separately:

  ...
  _17 = MEM[(uint32_tD.2756 *)_15 + 12B clique 1 base 0];
  _24 = MEM[(uint32_tD.2756 *)_22 + 12B clique 1 base 0];
  _69 = {_66, _81, _114, _5, _61, _86, _119, _10, _54, _93, _126, _17, _47,
_100, _133, _24};
  vect__64.5_68 = _69 + { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
  MEM <vector(16) unsigned intD.4> [(uint32_tD.2756 *)out_33(D) clique 1 base
1] = vect__64.5_68;

I suppose that's the way it should work.  I'd rather have the grouped loads
be done with vector loads and permuted into place (rather than scalar loads and
"permute" each one individually)?

Now, with SLPification in progress I suppose that goal is a bit more realistic
as before.  Richi, is there something that can be done for now to get us
closer?
I'm currently looking into your rguenth/load-perm branch, is that a reasonable
way to start?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug tree-optimization/115340] Loop/SLP vectorization possible inefficiency
  2024-06-04  7:46 [Bug tree-optimization/115340] New: Loop/SLP vectorization possible inefficiency rdapp at gcc dot gnu.org
@ 2024-06-04  8:06 ` rguenth at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-04  8:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-06-04
             Blocks|                            |53947
             Status|UNCONFIRMED                 |NEW
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is that the DRs for the loads tmp[0][i] and tmp[1][i] are not
related - they are off different base pointers.  At the moment we are
not merging unrelated "groups" (even though the loads are not marked
as grouped) into one SLP node.

The stores are not considered "grouped" because they have gaps.

With SLP-ification you'd get four instances and the same code-gen as now.

To do better we'd have to improve the store dataref analysis to see
that a vectorization factor of four would "close" the gaps, or more
generally support store groups with gaps.  Stores with gaps can be
handled by masking for example.

You get the store side handled when using -fno-tree-loop-vectorize to
get basic-block vectorization after unrolling the loop.  But you
still run into the issue that we do not combine from different load
groups during SLP discovery.  That's another angle you can attack;
during greedy discovery we also do not consider splitting the store
but instead build the loads from scalars which is of course less than
optimal.  Also since we do not re-process the built vector CTORs for
further basic-block vectorization opportunities.

Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-06-04  8:06 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-04  7:46 [Bug tree-optimization/115340] New: Loop/SLP vectorization possible inefficiency rdapp at gcc dot gnu.org
2024-06-04  8:06 ` [Bug tree-optimization/115340] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).