[Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types
@ 2021-03-23 13:22 martin@mpa-garching.mpg.de
  2021-03-23 13:23 ` [Bug c++/99728] " martin@mpa-garching.mpg.de
                   ` (20 more replies)
  0 siblings, 21 replies; 22+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-03-23 13:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

            Bug ID: 99728
           Summary: code pessimization when using wrapper classes around
                    SIMD types
           Product: gcc
           Version: 10.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: martin@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 50456
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50456&action=edit
test case

I originally reported this at
https://gcc.gnu.org/pipermail/gcc-help/2021-March/139976.html, but I'm now
fairly confident that this warrants a PR.

The test case needs to be processed on x86_64 with the command

g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc

Code for two functions will be generated, and I would expect that the generated
assembler for both should be identical. However, for the version using the
wrapper class around __m256d, g++ does not seem to recognize the dead stores at
the end of the loop and leaves them inside the loop body instead of moving them
after the final jump instruction of the loop, which reduces performance
considerably.

clang++ generates nearly identical code for both functions and manages to
remove the dead stores, so I think that g++ might be able to do better here and
is not pessimizing the code due to some C++ intricacies.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
@ 2021-03-23 13:23 ` martin@mpa-garching.mpg.de
  2021-03-23 13:50 ` martin@mpa-garching.mpg.de
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-03-23 13:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #1 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Created attachment 50457
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50457&action=edit
generated assembler

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
  2021-03-23 13:23 ` [Bug c++/99728] " martin@mpa-garching.mpg.de
@ 2021-03-23 13:50 ` martin@mpa-garching.mpg.de
  2021-03-23 13:56 ` amonakov at gcc dot gnu.org
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-03-23 13:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #2 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Created attachment 50458
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50458&action=edit
additional test case by Alexander Monakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
  2021-03-23 13:23 ` [Bug c++/99728] " martin@mpa-garching.mpg.de
  2021-03-23 13:50 ` martin@mpa-garching.mpg.de
@ 2021-03-23 13:56 ` amonakov at gcc dot gnu.org
  2021-03-23 14:08 ` kretz at kde dot org
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: amonakov at gcc dot gnu.org @ 2021-03-23 13:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
My testcase changes __m256d to __v2df to avoid __may_alias__, and changes
overloaded operators to make Tvsimple objects passed by value everywhere (in an
attempt to simplify GIMPLE IR).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (2 preceding siblings ...)
  2021-03-23 13:56 ` amonakov at gcc dot gnu.org
@ 2021-03-23 14:08 ` kretz at kde dot org
  2021-03-23 14:28 ` martin@mpa-garching.mpg.de
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: kretz at kde dot org @ 2021-03-23 14:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #4 from Matthias Kretz (Vir) <kretz at kde dot org> ---
FWIW, using std::experimental::native_simd<double> also does not hoist the
stores out of the loop. However, if you pass d by value and return d, the issue
goes away. So I guess this is an aliasing pessimization. Even though you added
__restrict__. In any case __m256 has the problem that it is declared with the
may_alias attribute. I recommend to just never use __m256 unless you have no
other choice. Note that `s0data_s<__m256d>` warns "ignoring attributes on
template argument '__m256d' [-Wignored-attributes]", meaning it drops the
may_alias attribute.

Reduced test case (https://godbolt.org/z/PW98Wsfoj):
#include <immintrin.h>

struct Tvsimple {
  __m256d v;
};

struct s0data_s {
  Tvsimple lam1, lam2;
};
struct s0data_s_intrin {
  __m256d lam1, lam2;
};

template <typename T>
void foo(T &__restrict__ d, size_t l, size_t lmax) {
  while (l <= lmax) {
    d.lam1 = d.lam2;
    l += 2;
  }
}

// hoists load out of the loop but loops over the store
template void foo<>(s0data_s &__restrict__ d, size_t l, size_t lmax);

// turns loop into a single branch
template void foo<>(s0data_s_intrin &__restrict__ d, size_t l, size_t lmax);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (3 preceding siblings ...)
  2021-03-23 14:08 ` kretz at kde dot org
@ 2021-03-23 14:28 ` martin@mpa-garching.mpg.de
  2021-03-23 14:32 ` kretz at kde dot org
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-03-23 14:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #5 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
(In reply to Matthias Kretz (Vir) from comment #4)
> FWIW, using std::experimental::native_simd<double> also does not hoist the
> stores out of the loop. However, if you pass d by value and return d, the
> issue goes away. So I guess this is an aliasing pessimization.

This is an interesting data point ... In my first test case (attached to
https://gcc.gnu.org/pipermail/gcc-help/2021-March/139976.html), I explicitly
make a local copy of d and copy back at the end of the function, and this
didn't help. Strange.

> Even though
> you added __restrict__. In any case __m256 has the problem that it is
> declared with the may_alias attribute. I recommend to just never use __m256
> unless you have no other choice.

I guess I need it for unaligned loads/stores, correct? Otherwise __v4df should
work everywhere.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (4 preceding siblings ...)
  2021-03-23 14:28 ` martin@mpa-garching.mpg.de
@ 2021-03-23 14:32 ` kretz at kde dot org
  2021-03-23 14:36 ` martin@mpa-garching.mpg.de
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: kretz at kde dot org @ 2021-03-23 14:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #6 from Matthias Kretz (Vir) <kretz at kde dot org> ---
> I guess I need it for unaligned loads/stores, correct? Otherwise __v4df should work everywhere.

1. You can freely reinterpret_cast by value between all the different
[[gnu::vector_size(N)]] types (with equal N).

2. I use std::memcpy for loads and stores. Use __builtin_assume_aligned to get
aligned loads & stores.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (5 preceding siblings ...)
  2021-03-23 14:32 ` kretz at kde dot org
@ 2021-03-23 14:36 ` martin@mpa-garching.mpg.de
  2021-03-23 15:00 ` redi at gcc dot gnu.org
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-03-23 14:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #7 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Thanks!

(BTW, I'm aware your code and will immediately switch to it once it lands in
gcc! But for the time being I try to make do with my poor man's version to
avoid the external dependency.)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug c++/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (6 preceding siblings ...)
  2021-03-23 14:36 ` martin@mpa-garching.mpg.de
@ 2021-03-23 15:00 ` redi at gcc dot gnu.org
  2021-03-24  8:52 ` [Bug tree-optimization/99728] " rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: redi at gcc dot gnu.org @ 2021-03-23 15:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #8 from Jonathan Wakely <redi at gcc dot gnu.org> ---
It's landed r11-6935 aka g:2bcceb6fc59fcdaf51006d4fcfc71c2d26761396

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (7 preceding siblings ...)
  2021-03-23 15:00 ` redi at gcc dot gnu.org
@ 2021-03-24  8:52 ` rguenth at gcc dot gnu.org
  2021-03-24  9:15 ` kretz at kde dot org
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-24  8:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
          Component|c++                         |tree-optimization
   Last reconfirmed|                            |2021-03-24

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue for store-motion is that we see an aggregate copy:

Unanalyzed memory reference 0: *d_28(D).lam1 = *d_28(D).lam2;

  __MEM <struct s0data_s> (d_28(D)).lam1 = __MEM <struct s0data_s>
(d_28(D)).lam2;
  __MEM <struct s0data_s> (d_28(D)).lam2.v = _38;
  il_36 = il_74 + 1ul;

there is another PR about those Unanalyzed refs preventing LIM/SM but then
getting rid of those aggregate copies would be nice as well since many
passes do not like them.  I suppose 'vtype' in this case has a FP mode
which prevents us from simplistic folding of this (unless we'd always
expand those to FP load/store sequences).

Indeed, we're copying

    type <record_type 0x7ffff437dbd0 Tvsimple sizes-gimplified
needs-constructing cxx-odr-p type_1 type_5 type_6 V4DF
        size <integer_cst 0x7ffff658b228 constant 256>
        unit-size <integer_cst 0x7ffff658b318 constant 32>
        align:256 warn_if_not_align:0 symtab:0 alias-set 1 canonical-type
0x7ffff437dbd0
        fields <function_decl 0x7ffff4391400 operator= type <method_type
0x7ffff43933f0>
            public external autoinline decl_3 QI t.C:3:8 align:16
warn_if_not_align:0 context <record_type 0x7ffff437dbd0 Tvsimple>
            full-name "constexpr Tvsimple& Tvsimple::operator=(Tvsimple&&)
noexcept (<uninstantiated>)"
            not-really-extern chain <function_decl 0x7ffff4391300 operator=>>
context <translation_unit_decl 0x7ffff6578168 t.C>
        full-name "struct Tvsimple"
        needs-constructor X() X(constX&) this=(X&) n_parents=0 use_template=0
interface-unknown
        pointer_to_this <pointer_type 0x7ffff437dd20> reference_to_this
<reference_type 0x7ffff437d690> chain <type_decl 0x7ffff4a554c0 Tvsimple>>

OK, so for a simple

struct X { double x; };

void foo (struct X *x, struct X *y)
{
  *x = *y;
}

we do generate x87 FP load/store insns and do not transfer bytes.  Probably
OK from a C language perspective but questionable on the GIMPLE side
(we've been there before).

So one thing we can experiment with is to gimplify those aggregate
copies to register load/store when the aggregates have been assigned
non-BLKmode by the target.  This might of course confuse SRA which
means that SRA itself might be a better place to perform this
optimization.  [mind struct { double; double; } on x86 gets TImode
for example]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (8 preceding siblings ...)
  2021-03-24  8:52 ` [Bug tree-optimization/99728] " rguenth at gcc dot gnu.org
@ 2021-03-24  9:15 ` kretz at kde dot org
  2021-03-24  9:26 ` rguenther at suse dot de
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: kretz at kde dot org @ 2021-03-24  9:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #10 from Matthias Kretz (Vir) <kretz at kde dot org> ---
Is this the same issue:

struct A {
  double v;
};

struct B {
  double v;
  B& operator=(const B& rhs) {
      v = rhs.v;
      return *this;
  }
};

// 10 loads & stores
void f(A& a, const A& b) {
  for (int i = 0; i < 10; ++i) a = b;
}

// 1 load & store
void f(B& a, const B& b) {
  for (int i = 0; i < 10; ++i) a = b;
}

I.e. by turning the aggregate assignment into an explicit assignment of the
fundamental (including vector) type, the issue can be worked around.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (9 preceding siblings ...)
  2021-03-24  9:15 ` kretz at kde dot org
@ 2021-03-24  9:26 ` rguenther at suse dot de
  2021-07-02  8:27 ` martin@mpa-garching.mpg.de
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenther at suse dot de @ 2021-03-24  9:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 24 Mar 2021, kretz at kde dot org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728
> 
> --- Comment #10 from Matthias Kretz (Vir) <kretz at kde dot org> ---
> Is this the same issue:
> 
> struct A {
>   double v;
> };
> 
> struct B {
>   double v;
>   B& operator=(const B& rhs) {
>       v = rhs.v;
>       return *this;
>   }
> };
> 
> // 10 loads & stores
> void f(A& a, const A& b) {
>   for (int i = 0; i < 10; ++i) a = b;
> }
> 
> // 1 load & store
> void f(B& a, const B& b) {
>   for (int i = 0; i < 10; ++i) a = b;
> }
> 
> I.e. by turning the aggregate assignment into an explicit assignment of the
> fundamental (including vector) type, the issue can be worked around.

Yes, that's a good cut-down of the issue at hand.  Using
memcpy () for the assignment also tends to work since we inline
that using a load/store sequence if we can.

Note this might also point at a solution in the C++ FE and it's
default-generated copy/assign operations to copy the ultimate
single element there rather than the containing aggregate.
(one could argue that this would be premature optimization of course)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (10 preceding siblings ...)
  2021-03-24  9:26 ` rguenther at suse dot de
@ 2021-07-02  8:27 ` martin@mpa-garching.mpg.de
  2021-07-02  9:52 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-07-02  8:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #12 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Any hope of addressing this for gcc 12?
I have a real-world test case where this effect causes roughly 15-20% slowdown,
and I expect that with the wider availability of std::simd types more people
will encounter this soon. (And the workaround pretty much defeats the purpose
of having such convenient functionality as std::simd in the first place...)

I'm happy to help out in any way I can, but unfortunately I'm more of a
numerics guy than a compiler expert.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (11 preceding siblings ...)
  2021-07-02  8:27 ` martin@mpa-garching.mpg.de
@ 2021-07-02  9:52 ` rguenth at gcc dot gnu.org
  2021-07-02  9:52 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-02  9:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 51100
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51100&action=edit
hack

The attached tries to rewrite the aggregate assignments into a load/store
sequence producing

  _33 = VIEW_CONVERT_EXPR<vector(32) unsigned char>(d_42(D)->lam2D.32702);
  VIEW_CONVERT_EXPR<vector(32) unsigned char>(d_42(D)->lam1D.32701) = _33;

from originally

  d_42(D)->lam1D.32701 = d_42(D)->lam2D.32702;

that's a bit ugly but still falls short of doing the full store-motion but
at least now moves all but the above store:

...
  _35 = _36 + val$v_63;
  _30 = VIEW_CONVERT_EXPR<vector(32) unsigned char>(_56);
  VIEW_CONVERT_EXPR<vector(32) unsigned char>(*d_28(D).lam1D.32701) = _30;
  *d_28(D).lam2D.32702.vD.32579 = _35;
  il_33 = il_69 + 1;
  l_34 = l_68 + 2;
  if (lmax_26(D) >= l_34)
    goto <bb 6>; [89.00%]
  else
    goto <bb 7>; [11.00%]

  <bb 6> [local count: 850510901]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 105119324]:
  # _84 = PHI <_30(3)>
  # _85 = PHI <_35(3)>
  # d__v_lsm.37_86 = PHI <d__v_lsm.37_74(3)>
  # d__v_lsm.38_87 = PHI <d__v_lsm.38_75(3)>
  # d__v_lsm.39_88 = PHI <d__v_lsm.39_76(3)>
  # d__v_lsm.40_89 = PHI <d__v_lsm.40_77(3)>
  MEM[(struct TvsimpleD.32577 *)d_28(D) + 192B].vD.32579 = d__v_lsm.37_86;
  MEM[(struct TvsimpleD.32577 *)d_28(D) + 224B].vD.32579 = d__v_lsm.38_87;
  MEM[(struct TvsimpleD.32577 *)d_28(D) + 256B].vD.32579 = d__v_lsm.39_88;
  MEM[(struct TvsimpleD.32577 *)d_28(D) + 288B].vD.32579 = d__v_lsm.40_89;
  VIEW_CONVERT_EXPR<vector(32) unsigned char>(*d_28(D).lam1D.32701) = _84;
  *d_28(D).lam2D.32702.vD.32579 = _85;

the dependence analysis of store-motion considers the last stores (ref 14 and
15) dependent:

Querying dependency of refs 2 and 15: dependent.
Querying RAW dependencies of ref 2 in loop 1: dependent
Querying dependency of refs 13 and 14: dependent.
Querying RAW dependencies of ref 13 in loop 1: dependent
Querying dependency of refs 14 and 13: dependent.
Querying SM WAR dependencies of ref 14 in loop 1: dependent
Querying dependency of refs 15 and 2: dependent.
Querying SM WAR dependencies of ref 15 in loop 1: dependent

That's the usual issue of LIM needing to identify "identical" refs
but appearanlty failing to do so for:

Memory reference 2: MEM[(const struct Tvsimple *)d_28(D) + 128B].v
Memory reference 15: *d_28(D).lam2.v

which is because we don't factor the MEM_REF contained offset.  I'll see
to do that independently of the "hack" (which I'm not sure is an appropriate
way of avoiding to change LIM to deal with aggregates ...)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (12 preceding siblings ...)
  2021-07-02  9:52 ` rguenth at gcc dot gnu.org
@ 2021-07-02  9:52 ` rguenth at gcc dot gnu.org
  2021-07-02 10:53 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-02  9:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (13 preceding siblings ...)
  2021-07-02  9:52 ` rguenth at gcc dot gnu.org
@ 2021-07-02 10:53 ` rguenth at gcc dot gnu.org
  2021-07-02 11:55 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-02 10:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
With the hack and PR101293 fixed we're still left with

Memory reference 13: *d_28(D).lam1.v
Memory reference 14: VIEW_CONVERT_EXPR<vector(32) unsigned char>(*d_28(D).lam1)

because the access types are incompatible (v4df vs v32qi).  Possible to hack
around but the initial hack creating the V_C_E has issues enough (though
handling the aggregate copy in LIM will inevitably run into incomaptible
types compared to other refs as well).

How I love the fun of C++ and all the abstraction when using it ... :/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (14 preceding siblings ...)
  2021-07-02 10:53 ` rguenth at gcc dot gnu.org
@ 2021-07-02 11:55 ` rguenth at gcc dot gnu.org
  2021-07-07 11:49 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-02 11:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728
Bug 99728 depends on bug 101293, which changed state.

Bug 101293 Summary: LIM ref canonicalization incomplete
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101293

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (15 preceding siblings ...)
  2021-07-02 11:55 ` rguenth at gcc dot gnu.org
@ 2021-07-07 11:49 ` cvs-commit at gcc dot gnu.org
  2021-07-07 12:09 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-07-07 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #15 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:9f34b780b0461ec7b2b2defe96e44ab616ea2aa3

commit r12-2097-g9f34b780b0461ec7b2b2defe96e44ab616ea2aa3
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jul 7 11:41:03 2021 +0200

    tree-optimization/99728 - improve LIM for loops with aggregate copies

    This improves LIM by recording aggregate copies for disambiguation
    purposes instead of as UNANALYZABLE_MEM which will prevent any
    invariant or store motion across it.  This allows four of the six
    references in the loop of the testcase to be promoted.

    2021-07-07  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/99728
            * tree-ssa-loop-im.c (gather_mem_refs_stmt): Record
            aggregate copies.
            (mem_refs_may_alias_p): Add assert we handled aggregate
            copies elsewhere.
            (sm_seq_valid_bb): Give up when running into aggregate copies.
            (ref_indep_loop_p): Handle aggregate copies as never
            being invariant themselves but allow other refs to be
            disambiguated against them.
            (can_sm_ref_p): Do not try to apply store-motion to aggregate
            copies.

            * g++.dg/opt/pr99728.C: New testcase.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (16 preceding siblings ...)
  2021-07-07 11:49 ` cvs-commit at gcc dot gnu.org
@ 2021-07-07 12:09 ` rguenth at gcc dot gnu.org
  2021-07-07 13:59 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-07 12:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
This has the situation improved but the aggregate copy and one dependent store
are still not moved (so not fixed).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (17 preceding siblings ...)
  2021-07-07 12:09 ` rguenth at gcc dot gnu.org
@ 2021-07-07 13:59 ` rguenth at gcc dot gnu.org
  2021-07-14 11:37 ` rguenth at gcc dot gnu.org
  2021-07-14 13:04 ` rguenth at gcc dot gnu.org
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-07 13:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jason at gcc dot gnu.org

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we have

  val$v_62 = *d_28(D).lam1D.32701.vD.32579;
  *d_28(D).lam1D.32701 = *d_28(D).lam2D.32702;
  *d_28(D).lam2D.32702.vD.32579 = _34;

I believe that the SRA pass has the best analysis capabilities to eventually
decompose aggregate copies into register pieces (with cost considerations).
In particular it knows (but without flow info) what kind of types
sub-accesses use.  Since we want the aggregate copy replaced with pieces
that match the rest of the accesses (here because of LIMs restrictions).

In particular we'd like to use 'vector double' typed accesses here, sth
the middle-end usually avoids for block-copies which aggregate copies
are to the middle-end.

That said, it would be _much_ easier if the frontend with its language specific
semantic knowledge could avoid doing block-copies for such simple wrappers
and instead perform (recursively) memberwise copy (for single member
aggregates).

Of course the simple fix in source is to add

  Tvsimple &operator=(const Tvsimple &other) { v = other.v; return *this;}

producing optimal code.  Jason - would you consider this premature
"optimization" in the C++ frontend?  It doesn't seem that there's
a operator= synthesized, instead we directly emit

   <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (d->lam1 = *(const struct Tvsimple &) &d->lam2) >>>>>;

from

    d.lam1 = d.lam2;

from build_over_call which has a series of optimizations at

  else if (DECL_ASSIGNMENT_OPERATOR_P (fn)
           && DECL_OVERLOADED_OPERATOR_IS (fn, NOP_EXPR)
           && trivial_fn_p (fn))
    {
...
      if (is_really_empty_class (type, /*ignore_vptr*/true))
        {
          /* Avoid copying empty classes.  */
          val = build2 (COMPOUND_EXPR, type, arg, to);
          suppress_warning (val, OPT_Wunused);
        }
      else if (tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (as_base)))
        {
          if (is_std_init_list (type)
              && conv_binds_ref_to_prvalue (convs[1]))
            warning_at (loc, OPT_Winit_list_lifetime,
                        "assignment from temporary %<initializer_list%> does "
                        "not extend the lifetime of the underlying array");
          arg = cp_build_fold_indirect_ref (arg);
          val = build2 (MODIFY_EXPR, TREE_TYPE (to), to, arg);

so we handle empty classes, maybe we can also handle single data-member
classes (not sure how to exactly test for this - walking TYPE_FIELDs
repeatedly for each considered assignment would be slow I guess).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (18 preceding siblings ...)
  2021-07-07 13:59 ` rguenth at gcc dot gnu.org
@ 2021-07-14 11:37 ` rguenth at gcc dot gnu.org
  2021-07-14 13:04 ` rguenth at gcc dot gnu.org
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-14 11:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
OTOH we call is_really_empty_class which for not trivially empty classes ends
up
walking all non-FIELD_DECL fields as well:

      for (field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
        if (TREE_CODE (field) == FIELD_DECL
            && !DECL_ARTIFICIAL (field)
            /* An unnamed bit-field is not a data member.  */
            && !DECL_UNNAMED_BIT_FIELD (field)
            && !is_really_empty_class (TREE_TYPE (field), ignore_vptr))
          return false;

So the following performs single-member copying optimization (also avoiding
the as-base special-casing when the copy involves a single member).  The
copy_single_member function is based on is_really_empty_class which could
be refactored to also compute this as alternate output.  In principle the
single member to copy could be inside an aggregate member/base so as-is
this doesn't always avoid an aggregate copy if the single member in 'type'
is an aggregate.  It fixes the testcase fully.

diff --git a/gcc/cp/call.c b/gcc/cp/call.c
index e4df72ec1a3..31eea7c935b 100644
--- a/gcc/cp/call.c
+++ b/gcc/cp/call.c
@@ -8881,6 +8881,41 @@ immediate_invocation_p (tree fn, int nargs)
          && (nargs > 1 || !source_location_current_p (fn)));
 }

+/* Return the FIELD_DECL of the single member of TYPE that needs to be copied
+   by copy assignment if any or NULL_TREE if there are none or multiple.  */
+
+static tree
+copy_single_member (tree type)
+{
+  if (!CLASS_TYPE_P (type))
+    return NULL_TREE;
+
+  tree binfo;
+  tree base_binfo;
+  int i;
+  for (binfo = TYPE_BINFO (type), i = 0;
+       BINFO_BASE_ITERATE (binfo, i, base_binfo); ++i)
+    if (!is_really_empty_class (BINFO_TYPE (base_binfo), true))
+      return NULL_TREE;
+
+  tree single_field = NULL_TREE;
+  for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
+    {
+      if (TREE_CODE (field) != FIELD_DECL)
+       continue;
+      if (!(!DECL_ARTIFICIAL (field)
+           /* An unnamed bit-field is not a data member.  */
+           && !DECL_UNNAMED_BIT_FIELD (field)
+           && !is_really_empty_class (TREE_TYPE (field), true)))
+       continue;
+      if (single_field)
+       return NULL_TREE;
+      single_field = field;
+    }
+
+  return single_field;
+}
+
 /* Subroutine of the various build_*_call functions.  Overload resolution
    has chosen a winning candidate CAND; build up a CALL_EXPR accordingly.
    ARGS is a TREE_LIST of the unconverted arguments to the call.  FLAGS is a
@@ -9501,6 +9536,16 @@ build_over_call (struct z_candidate *cand, int flags,
tsubst_flags_t complain)
          val = build2 (COMPOUND_EXPR, type, arg, to);
          suppress_warning (val, OPT_Wunused);
        }
+      else if (tree field = copy_single_member (type))
+       {
+         arg = cp_build_fold_indirect_ref (arg);
+         tree t = build2 (MODIFY_EXPR, TREE_TYPE (field),
+                          build3 (COMPONENT_REF, TREE_TYPE (field),
+                                  to, field, NULL_TREE),
+                          build3 (COMPONENT_REF, TREE_TYPE (field),
+                                  arg, field, NULL_TREE));
+         val = build2 (COMPOUND_EXPR, type, t, to);
+         suppress_warning (val, OPT_Wunused);
+       }
       else if (tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (as_base)))
        {
          if (is_std_init_list (type)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug tree-optimization/99728] code pessimization when using wrapper classes around SIMD types
  2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
                   ` (19 preceding siblings ...)
  2021-07-14 11:37 ` rguenth at gcc dot gnu.org
@ 2021-07-14 13:04 ` rguenth at gcc dot gnu.org
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-14 13:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 51150
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51150&action=edit
patch

This is what I bootstrapped & tested on x86_64-unknown-linux-gnu.  It causes

FAIL: g++.dg/warn/Wparentheses-22.C  -std=gnu++14 correct warning (test for
warnings, line 104)
FAIL: g++.dg/warn/Wparentheses-22.C  -std=gnu++14 correct warning (test for
warnings, line 33)
...
FAIL: g++.dg/warn/Wparentheses-23.C  -std=gnu++14  (test for warnings, line
117)
FAIL: g++.dg/warn/Wparentheses-23.C  -std=gnu++14  (test for warnings, line
118)
...

interfering with the intent to warn about code like

  if (a = b) // { dg-warning "assignment" "correct warning" }
    foo (0);

where a and b are aggregates convertible to bool.  We end up generating
something that acts as a valid "fix" for the diagnostic:

  if ((a = b), a)

similarly we'd beforehand not diagnose the case of an empty aggregate
assignment not wrapped in parentheses.

Adjusting the warning code to also look for the COMPOUND_EXPR case
wouldn't be valid if we'd not somehow mark those "fake" COMPOUND_EXPRs
in a special way.

Not sure what to do here.  Maybe simply ignore the lost diagnostic and
XFAIL the testcase or adjust it to have a second member in the aggregate
used.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-14 13:04 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-23 13:22 [Bug c++/99728] New: code pessimization when using wrapper classes around SIMD types martin@mpa-garching.mpg.de
2021-03-23 13:23 ` [Bug c++/99728] " martin@mpa-garching.mpg.de
2021-03-23 13:50 ` martin@mpa-garching.mpg.de
2021-03-23 13:56 ` amonakov at gcc dot gnu.org
2021-03-23 14:08 ` kretz at kde dot org
2021-03-23 14:28 ` martin@mpa-garching.mpg.de
2021-03-23 14:32 ` kretz at kde dot org
2021-03-23 14:36 ` martin@mpa-garching.mpg.de
2021-03-23 15:00 ` redi at gcc dot gnu.org
2021-03-24  8:52 ` [Bug tree-optimization/99728] " rguenth at gcc dot gnu.org
2021-03-24  9:15 ` kretz at kde dot org
2021-03-24  9:26 ` rguenther at suse dot de
2021-07-02  8:27 ` martin@mpa-garching.mpg.de
2021-07-02  9:52 ` rguenth at gcc dot gnu.org
2021-07-02  9:52 ` rguenth at gcc dot gnu.org
2021-07-02 10:53 ` rguenth at gcc dot gnu.org
2021-07-02 11:55 ` rguenth at gcc dot gnu.org
2021-07-07 11:49 ` cvs-commit at gcc dot gnu.org
2021-07-07 12:09 ` rguenth at gcc dot gnu.org
2021-07-07 13:59 ` rguenth at gcc dot gnu.org
2021-07-14 11:37 ` rguenth at gcc dot gnu.org
2021-07-14 13:04 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).