[Bug tree-optimization/98542] New: Redundant loads in vectorised loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/98542] New: Redundant loads in vectorised loop
@ 2021-01-05 16:17 rsandifo at gcc dot gnu.org
  2021-01-06  9:01 ` [Bug tree-optimization/98542] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-01-05 16:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98542

            Bug ID: 98542
           Summary: Redundant loads in vectorised loop
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

For the testcase below, based loosely on one from 450.soplex,
the vectoriser loads the v and i fields twice:

struct s { double v; long i; };
double
f (struct s *x, double *y, int n)
{
  double res = 0;
  for (int i = 0; i < n; ++i)
    res += x[i].v * y[x[i].i];
  return res;
}

SVE loop:

        add     x5, x0, 8
        ...
.L4:
        ld2d    {z2.d - z3.d}, p0/z, [x5, x4, lsl 3]
        ld2d    {z4.d - z5.d}, p0/z, [x0, x4, lsl 3]
        ld1d    z2.d, p0/z, [x1, z2.d, lsl 3]
        incd    x3
        fmla    z0.d, p0/m, z4.d, z2.d
        incw    x4
        whilelo p0.d, w3, w2
        b.any   .L4

where z5 from the second ld2d is the same as z2
from the first ld2d.

In the soplex example, "i" is instead a 32-bit value,
but I guess we need to fix this case first.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/98542] Redundant loads in vectorised loop
  2021-01-05 16:17 [Bug tree-optimization/98542] New: Redundant loads in vectorised loop rsandifo at gcc dot gnu.org
@ 2021-01-06  9:01 ` rguenth at gcc dot gnu.org
  2021-01-06 13:16 ` rsandifo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-06  9:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98542

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
What do you mean with "twice"?  We seem to do interleaving here (on x86_64)
but since 'v' and 'i' have different types they do not belong to the same
interleaving chain (but we have two that "interleave" - heh).

x.c:6:21: note:   === vect_analyze_data_ref_accesses ===
x.c:6:21: note:   Detected single element interleaving _3->v step 16
x.c:6:21: note:   Detected single element interleaving _3->i step 16

so if that's the main complaint then a testcase w/o gather is probably
more relevant at first?  For x86 the two loads are offsetted by one
element, for your asm that looks like to be the same (x5 vs x0).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/98542] Redundant loads in vectorised loop
  2021-01-05 16:17 [Bug tree-optimization/98542] New: Redundant loads in vectorised loop rsandifo at gcc dot gnu.org
  2021-01-06  9:01 ` [Bug tree-optimization/98542] " rguenth at gcc dot gnu.org
@ 2021-01-06 13:16 ` rsandifo at gcc dot gnu.org
  2021-01-07  7:38 ` rguenther at suse dot de
  2021-01-07 10:44 ` rsandifo at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-01-06 13:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98542

--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> What do you mean with "twice"?  We seem to do interleaving here (on x86_64)
> but since 'v' and 'i' have different types they do not belong to the same
> interleaving chain (but we have two that "interleave" - heh).
> 
> x.c:6:21: note:   === vect_analyze_data_ref_accesses ===
> x.c:6:21: note:   Detected single element interleaving _3->v step 16
> x.c:6:21: note:   Detected single element interleaving _3->i step 16
> 
> so if that's the main complaint then a testcase w/o gather is probably
> more relevant at first?  For x86 the two loads are offsetted by one
> element, for your asm that looks like to be the same (x5 vs x0).
Yeah, that's the main complaint.  Because we implement the interleaving
with load-lanes, the load of the “v” vector provides the corresponding
“i” vector as a byproduct.  But rather than use that “i” vector,
we load the “i” fields a second time, which again provides a vector
of the following “v” fields.  In other words, we effectively do
four loads and four permutes in order to get two vectors.

This is in contrast to targets that use separate loads and permutes,
where we only emit permutes for the vectors that we use, and where
we stand a chance of CSEing four loads into three.

It also means that we force peeling for gaps when it shouldn't be
necessary.

With the follow-on mentioned (“i” being 32-bits rather than 64) we'd
still want to treat the structure access as a single group, even though
the fields are different sizes.

Using gather seems fine to me, but then I'm biased :-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/98542] Redundant loads in vectorised loop
  2021-01-05 16:17 [Bug tree-optimization/98542] New: Redundant loads in vectorised loop rsandifo at gcc dot gnu.org
  2021-01-06  9:01 ` [Bug tree-optimization/98542] " rguenth at gcc dot gnu.org
  2021-01-06 13:16 ` rsandifo at gcc dot gnu.org
@ 2021-01-07  7:38 ` rguenther at suse dot de
  2021-01-07 10:44 ` rsandifo at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenther at suse dot de @ 2021-01-07  7:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98542

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 6 Jan 2021, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98542
> 
> --- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> With the follow-on mentioned (“i” being 32-bits rather than 64) we'd
> still want to treat the structure access as a single group, even though
> the fields are different sizes.

Supporting that is going to be tricky though.  Presumably this only
works for { int, double }, not for { int, int, double } and accessing
the 2nd int (or it all depends on endianess?).

Anyway, one thing to look at is classify SLP load + permute as
load-lanes and think of how to represent this in the SLP graph.
A load-lanes is similar to a VEC_PERM node but it has a single
input and multiple outputs (in contrast to VEC_PERM with multiple
inputs and a single output).  Guess I'd call it SCATTER and we'd
have a load-permutation style specification of the output lanes
(it simplifies things if we constrain it to scatter all incoming
scalar lanes plus have the same number of output lanes per output).
Multiple outputs do not fit the SLP node style very well
(SCALAR_STMTS/DEFs is not meaningful, etc.), so it really needs some
more thought on the representation side.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/98542] Redundant loads in vectorised loop
  2021-01-05 16:17 [Bug tree-optimization/98542] New: Redundant loads in vectorised loop rsandifo at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-01-07  7:38 ` rguenther at suse dot de
@ 2021-01-07 10:44 ` rsandifo at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-01-07 10:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98542

--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #3)
> On Wed, 6 Jan 2021, rsandifo at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98542
> > 
> > --- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> > With the follow-on mentioned (“i” being 32-bits rather than 64) we'd
> > still want to treat the structure access as a single group, even though
> > the fields are different sizes.
> 
> Supporting that is going to be tricky though.  Presumably this only
> works for { int, double }, not for { int, int, double } and accessing
> the 2nd int (or it all depends on endianess?).
It should work for both in the SVE example.  The idea is that the
int field will be loaded into an unpacked vector, with each 32-bit
element being stored in a 64-bit container.  We then need to extend
the int (if the int starts off in the low half) or right-shift the
container (if the int starts off in the high half).  The nice thing
about the gather example is that the extend case can be folded into
the gather itself.

Like you say, the choice between extending and shifting will depend
on which int we need and on the endianness.  All four combinations
should be doable though.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-01-07 10:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 16:17 [Bug tree-optimization/98542] New: Redundant loads in vectorised loop rsandifo at gcc dot gnu.org
2021-01-06  9:01 ` [Bug tree-optimization/98542] " rguenth at gcc dot gnu.org
2021-01-06 13:16 ` rsandifo at gcc dot gnu.org
2021-01-07  7:38 ` rguenther at suse dot de
2021-01-07 10:44 ` rsandifo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).