* g++: Suboptimal code generation for simple wrapper class around vector data type
@ 2021-03-22 14:34 Martin Reinecke
2021-03-22 16:42 ` Martin Reinecke
2021-03-23 10:32 ` Martin Reinecke
0 siblings, 2 replies; 5+ messages in thread
From: Martin Reinecke @ 2021-03-22 14:34 UTC (permalink / raw)
To: gcc-help
[-- Attachment #1: Type: text/plain, Size: 1290 bytes --]
Hi,
the attached test case is the (slightly simplified) hot loop from a
library for spherical harmonic transforms.
This code uses explicit vectorization, and I try to use simple wrapper
classes around the primitive vector types (like __m256d) to simplify
operations like initialization with a scalar etc.
However it seems that using the wrapper type inside the critical loop
causes g++ to produce sub-optimal code. This can be seen by running
g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc
and inspecting the generated assembler code (I'm using gcc 10.2.1).
The version where I use the wrapper type even in the hot loop (i.e.
"foo<Tvsimple, 2>") has a few unnecessary "vmovapd" instructions before
the end of the loop body, which are missing in the version where I cast
to __m256d before doing the heavy computation (i.e. "foo<__m256d,2>").
My suspicion is that the "Tvsimple" type is somehow not completely POD
and that this prohibits g++ from optimizing more aggressively. On the
other hand, clang++ produces identical code for both versions, which is
comparable in speed with the faster version generated by g++.
Is g++ missing an opportunity to optimize here? If so, is there a way to
alter the "Tvsimple" class so that it doesn't stop g++ from optimizing?
Thanks,
Martin
[-- Attachment #2: testcase.cc --]
[-- Type: text/x-c++src, Size: 2513 bytes --]
#include <complex>
#include <vector>
#include <array>
#include <immintrin.h>
using namespace std;
struct dbl2 { double a, b; };
// simple OO wrapper around __m256d
class Tvsimple
{
private:
__m256d v;
public:
Tvsimple()=default;
Tvsimple(const double &val) :v(_mm256_set1_pd(val)) {}
Tvsimple(const __m256d &val) :v(val) {}
operator __m256d() const {return v;}
Tvsimple &operator*=(double val) {v*=val; return *this;}
Tvsimple &operator+=(const Tvsimple &other) {v+=other.v; return *this;}
Tvsimple operator*(double val) const {return v*_mm256_set1_pd(val);}
Tvsimple operator*(Tvsimple val) const {return v*val.v;}
Tvsimple operator+(Tvsimple val) const {return v+val.v;}
Tvsimple operator+(double val) const {return v+_mm256_set1_pd(val);}
};
constexpr size_t VLEN=4;
constexpr size_t nv0 = 128/VLEN;
typedef std::array<Tvsimple,nv0> Tbv0;
struct s0data_v
{ Tbv0 sth, corfac, scale, lam1, lam2, csq, p1r, p1i, p2r, p2i; };
template<typename T_inner, size_t nv2> void foo(s0data_v & d,
const vector<dbl2> &coef, const complex<double> * alm,
size_t l, size_t il, size_t lmax, size_t start)
{
T_inner p1r[nv2], p1i[nv2], p2r[nv2], p2i[nv2], lam1[nv2], lam2[nv2], csq[nv2];
for (size_t i=0; i<nv2; ++i)
{
p1r[i] = d.p1r[i+start];
p2r[i] = d.p2r[i+start];
p1i[i] = d.p1i[i+start];
p2i[i] = d.p2i[i+start];
lam1[i] = d.lam1[i+start];
lam2[i] = d.lam2[i+start];
csq[i] = d.csq[i+start];
}
// critical loop
while (l<=lmax)
{
for (size_t i=0; i<nv2; ++i)
p1r[i] += lam2[i]*alm[l ].real();
for (size_t i=0; i<nv2; ++i)
p1i[i] += lam2[i]*alm[l ].imag();
for (size_t i=0; i<nv2; ++i)
p2r[i] += lam2[i]*alm[l+1].real();
for (size_t i=0; i<nv2; ++i)
p2i[i] += lam2[i]*alm[l+1].imag();
for (size_t i=0; i<nv2; ++i)
{
T_inner tmp = lam2[i]*(csq[i]*coef[il].a + coef[il].b) + lam1[i];
lam1[i] = lam2[i];
lam2[i] = tmp;
}
++il; l+=2;
}
for (size_t i=0; i<nv2; ++i)
{
d.p1r[i+start] = p1r[i];
d.p2r[i+start] = p2r[i];
d.p1i[i+start] = p1i[i];
d.p2i[i+start] = p2i[i];
}
}
template void foo<Tvsimple, 2>(s0data_v & __restrict__ d,
const vector<dbl2> &coef, const complex<double> * alm,
size_t l, size_t il, size_t lmax, size_t start);
template void foo<__m256d,2>(s0data_v & __restrict__ d,
const vector<dbl2> &coef, const complex<double> * alm,
size_t l, size_t il, size_t lmax, size_t start);
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: g++: Suboptimal code generation for simple wrapper class around vector data type
2021-03-22 14:34 g++: Suboptimal code generation for simple wrapper class around vector data type Martin Reinecke
@ 2021-03-22 16:42 ` Martin Reinecke
2021-03-23 10:32 ` Martin Reinecke
1 sibling, 0 replies; 5+ messages in thread
From: Martin Reinecke @ 2021-03-22 16:42 UTC (permalink / raw)
To: gcc-help
Just for reference: for the inner loop in "foo<Tvsimple, 2>"
I get the following assembler output:
.L4:
vbroadcastsd (%rax), %ymm2
addq $2, %rcx
addq $32, %rax
addq $16, %r8
vmovapd 200(%rsp), %ymm15
vbroadcastsd -8(%r8), %ymm11
vfmadd231pd %ymm2, %ymm1, %ymm4
vfmadd231pd %ymm2, %ymm0, %ymm3
vbroadcastsd -24(%rax), %ymm2
vfmadd231pd %ymm2, %ymm1, %ymm6
vfmadd231pd %ymm2, %ymm0, %ymm5
vbroadcastsd -16(%rax), %ymm2
vfmadd231pd %ymm1, %ymm2, %ymm8
vfmadd231pd %ymm0, %ymm2, %ymm7
vbroadcastsd -8(%rax), %ymm2
vmovapd %ymm4, -120(%rsp)
vfmadd231pd %ymm1, %ymm2, %ymm10
vfmadd231pd %ymm0, %ymm2, %ymm9
vbroadcastsd -16(%r8), %ymm2
vmovapd %ymm3, -88(%rsp)
vmovapd %ymm6, -56(%rsp)
vmovapd %ymm2, %ymm14
vfmadd132pd %ymm12, %ymm11, %ymm2
vmovapd %ymm5, -24(%rsp)
vfmadd132pd %ymm13, %ymm11, %ymm14
vmovapd %ymm8, 8(%rsp)
vfmadd213pd 168(%rsp), %ymm2, %ymm0
vmovapd 232(%rsp), %ymm2
vfmadd213pd 136(%rsp), %ymm14, %ymm1
vmovapd %ymm7, 40(%rsp)
vmovapd %ymm10, 72(%rsp)
vmovapd %ymm9, 104(%rsp)
vmovapd %ymm15, 136(%rsp)
vmovapd %ymm2, 168(%rsp)
vmovapd %ymm1, 200(%rsp)
vmovapd %ymm0, 232(%rsp)
cmpq %rcx, %r9
jnb .L4
For "foo<__m256d,2>" I get:
.L12:
vmovapd %ymm2, %ymm0
vmovapd %ymm3, %ymm1
.L11:
vmovapd %ymm15, %ymm3
addq $2, %rcx
addq $32, %rax
addq $16, %r8
vbroadcastsd -32(%rax), %ymm2
vbroadcastsd -8(%r8), %ymm4
vfmadd231pd %ymm2, %ymm1, %ymm12
vfmadd231pd %ymm2, %ymm0, %ymm8
vbroadcastsd -24(%rax), %ymm2
vfmadd231pd %ymm2, %ymm1, %ymm10
vfmadd231pd %ymm2, %ymm0, %ymm6
vbroadcastsd -16(%rax), %ymm2
vfmadd231pd %ymm2, %ymm1, %ymm11
vfmadd231pd %ymm2, %ymm0, %ymm7
vbroadcastsd -8(%rax), %ymm2
vfmadd231pd %ymm2, %ymm1, %ymm9
vfmadd231pd %ymm2, %ymm0, %ymm5
vbroadcastsd -16(%r8), %ymm2
vfmadd132pd %ymm2, %ymm4, %ymm3
vfmadd132pd -120(%rsp), %ymm4, %ymm2
vfmadd132pd %ymm1, %ymm14, %ymm3
vfmadd132pd %ymm0, %ymm13, %ymm2
vmovapd %ymm1, %ymm14
vmovapd %ymm0, %ymm13
cmpq %rcx, %r9
jnb .L12
And the assembler generated by clang++ is (in both cases, except for
minimal differences)
.LBB1_2: # =>This Inner Loop Header: Depth=1
vbroadcastsd -16(%rdx,%rsi,2), %ymm14
vfmadd231pd %ymm10, %ymm14, %ymm4
vfmadd231pd %ymm14, %ymm11, %ymm0
vbroadcastsd -8(%rdx,%rsi,2), %ymm14
vfmadd231pd %ymm10, %ymm14, %ymm6
vfmadd231pd %ymm14, %ymm11, %ymm2
vbroadcastsd (%rdx,%rsi,2), %ymm14
vfmadd231pd %ymm10, %ymm14, %ymm5
vfmadd231pd %ymm14, %ymm11, %ymm1
vbroadcastsd 8(%rdx,%rsi,2), %ymm14
vfmadd231pd %ymm10, %ymm14, %ymm7
vfmadd231pd %ymm14, %ymm11, %ymm3
vbroadcastsd -8(%r8,%rsi), %ymm14
vbroadcastsd (%r8,%rsi), %ymm15
vmovapd %ymm9, %ymm8
vfmadd213pd %ymm15, %ymm14, %ymm8
vfmadd213pd %ymm13, %ymm10, %ymm8
vfmadd132pd -56(%rsp), %ymm15, %ymm14
vfmadd213pd %ymm12, %ymm11, %ymm14
addq $2, %rcx
addq $16, %rsi
vmovapd %ymm11, %ymm12
vmovapd %ymm10, %ymm13
vmovapd %ymm14, %ymm11
vmovapd %ymm8, %ymm10
cmpq %r9, %rcx
jbe .LBB1_2
On 3/22/21 3:34 PM, Martin Reinecke wrote:
> Hi,
>
> the attached test case is the (slightly simplified) hot loop from a
> library for spherical harmonic transforms.
> This code uses explicit vectorization, and I try to use simple wrapper
> classes around the primitive vector types (like __m256d) to simplify
> operations like initialization with a scalar etc.
>
> However it seems that using the wrapper type inside the critical loop
> causes g++ to produce sub-optimal code. This can be seen by running
>
> g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc
>
> and inspecting the generated assembler code (I'm using gcc 10.2.1).
> The version where I use the wrapper type even in the hot loop (i.e.
> "foo<Tvsimple, 2>") has a few unnecessary "vmovapd" instructions before
> the end of the loop body, which are missing in the version where I cast
> to __m256d before doing the heavy computation (i.e. "foo<__m256d,2>").
>
> My suspicion is that the "Tvsimple" type is somehow not completely POD
> and that this prohibits g++ from optimizing more aggressively. On the
> other hand, clang++ produces identical code for both versions, which is
> comparable in speed with the faster version generated by g++.
>
> Is g++ missing an opportunity to optimize here? If so, is there a way to
> alter the "Tvsimple" class so that it doesn't stop g++ from optimizing?
>
> Thanks,
> Martin
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: g++: Suboptimal code generation for simple wrapper class around vector data type
2021-03-22 14:34 g++: Suboptimal code generation for simple wrapper class around vector data type Martin Reinecke
2021-03-22 16:42 ` Martin Reinecke
@ 2021-03-23 10:32 ` Martin Reinecke
2021-03-23 13:42 ` Alexander Monakov
1 sibling, 1 reply; 5+ messages in thread
From: Martin Reinecke @ 2021-03-23 10:32 UTC (permalink / raw)
To: gcc-help
[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]
Here is a further reduced test case, together with the generated
assembler output.
I'm really at my wits' end here ... should I file this as a
"missed-optimization" PR?
Cheers,
Martin
On 3/22/21 3:34 PM, Martin Reinecke wrote:
> Hi,
>
> the attached test case is the (slightly simplified) hot loop from a
> library for spherical harmonic transforms.
> This code uses explicit vectorization, and I try to use simple wrapper
> classes around the primitive vector types (like __m256d) to simplify
> operations like initialization with a scalar etc.
>
> However it seems that using the wrapper type inside the critical loop
> causes g++ to produce sub-optimal code. This can be seen by running
>
> g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc
>
> and inspecting the generated assembler code (I'm using gcc 10.2.1).
> The version where I use the wrapper type even in the hot loop (i.e.
> "foo<Tvsimple, 2>") has a few unnecessary "vmovapd" instructions before
> the end of the loop body, which are missing in the version where I cast
> to __m256d before doing the heavy computation (i.e. "foo<__m256d,2>").
>
> My suspicion is that the "Tvsimple" type is somehow not completely POD
> and that this prohibits g++ from optimizing more aggressively. On the
> other hand, clang++ produces identical code for both versions, which is
> comparable in speed with the faster version generated by g++.
>
> Is g++ missing an opportunity to optimize here? If so, is there a way to
> alter the "Tvsimple" class so that it doesn't stop g++ from optimizing?
>
> Thanks,
> Martin
>
[-- Attachment #2: testcase.cc --]
[-- Type: text/x-c++src, Size: 1575 bytes --]
#include <immintrin.h>
// simple OO wrapper around __m256d
struct Tvsimple
{
__m256d v;
Tvsimple &operator+=(const Tvsimple &other) {v+=other.v; return *this;}
Tvsimple operator*(double val) const { Tvsimple res; res.v = v*_mm256_set1_pd(val); return res;}
Tvsimple operator*(Tvsimple val) const { Tvsimple res; res.v = v*val.v; return res; }
Tvsimple operator+(Tvsimple val) const { Tvsimple res; res.v = v+val.v; return res; }
Tvsimple operator+(double val) const { Tvsimple res; res.v = v+_mm256_set1_pd(val); return res;}
};
template<typename vtype> struct s0data_s
{ vtype sth, corfac, scale, lam1, lam2, csq, p1r, p1i, p2r, p2i; };
template<typename vtype> void foo(s0data_s<vtype> & __restrict__ d,
const double * __restrict__ coef, const double * __restrict__ alm,
size_t l, size_t il, size_t lmax)
{
// critical loop
while (l<=lmax)
{
d.p1r += d.lam2*alm[2*l];
d.p1i += d.lam2*alm[2*l+1];
d.p2r += d.lam2*alm[2*l+2];
d.p2i += d.lam2*alm[2*l+3];
auto tmp = d.lam2*(d.csq*coef[2*il] + coef[2*il+1]) + d.lam1;
d.lam1 = d.lam2;
d.lam2 = tmp;
++il; l+=2;
}
}
// this version has dead stores at the end of the loop
template void foo<>(s0data_s<Tvsimple> & __restrict__ d,
const double * __restrict__ coef, const double * __restrict__ alm,
size_t l, size_t il, size_t lmax);
// this version moves the stores after the end of the loop
template void foo<>(s0data_s<__m256d> & __restrict__ d,
const double * __restrict__ coef, const double * __restrict__ alm,
size_t l, size_t il, size_t lmax);
[-- Attachment #3: testcase.s --]
[-- Type: text/plain, Size: 2906 bytes --]
.file "testcase.cc"
.text
.section .text._Z3fooI8TvsimpleEvR8s0data_sIT_EPKdS6_mmm,"axG",@progbits,_Z3fooI8TvsimpleEvR8s0data_sIT_EPKdS6_mmm,comdat
.p2align 4
.weak _Z3fooI8TvsimpleEvR8s0data_sIT_EPKdS6_mmm
.type _Z3fooI8TvsimpleEvR8s0data_sIT_EPKdS6_mmm, @function
_Z3fooI8TvsimpleEvR8s0data_sIT_EPKdS6_mmm:
.LFB5360:
.cfi_startproc
cmpq %r9, %rcx
ja .L5
movq %rcx, %rax
salq $4, %r8
vmovapd 160(%rdi), %ymm7
vmovapd 288(%rdi), %ymm5
salq $4, %rax
vmovapd 256(%rdi), %ymm4
vmovapd 224(%rdi), %ymm3
addq %r8, %rsi
vmovapd 192(%rdi), %ymm2
vmovapd 128(%rdi), %ymm0
addq %rax, %rdx
.p2align 4,,10
.p2align 3
.L4:
vbroadcastsd (%rdx), %ymm1
addq $2, %rcx
addq $32, %rdx
addq $16, %rsi
vbroadcastsd -8(%rsi), %ymm6
vfmadd231pd %ymm0, %ymm1, %ymm2
vbroadcastsd -24(%rdx), %ymm1
vfmadd231pd %ymm0, %ymm1, %ymm3
vbroadcastsd -16(%rdx), %ymm1
vfmadd231pd %ymm0, %ymm1, %ymm4
vbroadcastsd -8(%rdx), %ymm1
vmovapd %ymm2, 192(%rdi)
vfmadd231pd %ymm0, %ymm1, %ymm5
vbroadcastsd -16(%rsi), %ymm1
vmovapd %ymm3, 224(%rdi)
vfmadd132pd %ymm7, %ymm6, %ymm1
vmovapd 128(%rdi), %ymm6
vfmadd213pd 96(%rdi), %ymm1, %ymm0
vmovapd %ymm4, 256(%rdi)
vmovapd %ymm6, 96(%rdi)
vmovapd %ymm5, 288(%rdi)
vmovapd %ymm0, 128(%rdi)
cmpq %rcx, %r9
jnb .L4
vzeroupper
.L5:
ret
.cfi_endproc
.LFE5360:
.size _Z3fooI8TvsimpleEvR8s0data_sIT_EPKdS6_mmm, .-_Z3fooI8TvsimpleEvR8s0data_sIT_EPKdS6_mmm
.section .text._Z3fooIDv4_dEvR8s0data_sIT_EPKdS6_mmm,"axG",@progbits,_Z3fooIDv4_dEvR8s0data_sIT_EPKdS6_mmm,comdat
.p2align 4
.weak _Z3fooIDv4_dEvR8s0data_sIT_EPKdS6_mmm
.type _Z3fooIDv4_dEvR8s0data_sIT_EPKdS6_mmm, @function
_Z3fooIDv4_dEvR8s0data_sIT_EPKdS6_mmm:
.LFB5361:
.cfi_startproc
cmpq %r9, %rcx
ja .L11
movq %rcx, %rax
salq $4, %r8
vmovapd 192(%rdi), %ymm5
vmovapd 128(%rdi), %ymm0
salq $4, %rax
vmovapd 224(%rdi), %ymm4
vmovapd 256(%rdi), %ymm3
addq %r8, %rsi
vmovapd 288(%rdi), %ymm2
vmovapd 160(%rdi), %ymm8
addq %rax, %rdx
vmovapd 96(%rdi), %ymm6
jmp .L9
.p2align 4,,10
.p2align 3
.L10:
vmovapd %ymm1, %ymm0
.L9:
vbroadcastsd (%rdx), %ymm1
addq $2, %rcx
addq $32, %rdx
addq $16, %rsi
vbroadcastsd -8(%rsi), %ymm7
vfmadd231pd %ymm0, %ymm1, %ymm5
vbroadcastsd -24(%rdx), %ymm1
vfmadd231pd %ymm0, %ymm1, %ymm4
vbroadcastsd -16(%rdx), %ymm1
vfmadd231pd %ymm0, %ymm1, %ymm3
vbroadcastsd -8(%rdx), %ymm1
vfmadd231pd %ymm0, %ymm1, %ymm2
vbroadcastsd -16(%rsi), %ymm1
vfmadd132pd %ymm8, %ymm7, %ymm1
vfmadd132pd %ymm0, %ymm6, %ymm1
vmovapd %ymm0, %ymm6
cmpq %rcx, %r9
jnb .L10
vmovapd %ymm5, 192(%rdi)
vmovapd %ymm1, 128(%rdi)
vmovapd %ymm4, 224(%rdi)
vmovapd %ymm3, 256(%rdi)
vmovapd %ymm2, 288(%rdi)
vmovapd %ymm0, 96(%rdi)
vzeroupper
.L11:
ret
.cfi_endproc
.LFE5361:
.size _Z3fooIDv4_dEvR8s0data_sIT_EPKdS6_mmm, .-_Z3fooIDv4_dEvR8s0data_sIT_EPKdS6_mmm
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbits
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: g++: Suboptimal code generation for simple wrapper class around vector data type
2021-03-23 10:32 ` Martin Reinecke
@ 2021-03-23 13:42 ` Alexander Monakov
2021-03-23 13:48 ` Martin Reinecke
0 siblings, 1 reply; 5+ messages in thread
From: Alexander Monakov @ 2021-03-23 13:42 UTC (permalink / raw)
To: Martin Reinecke; +Cc: gcc-help
On Tue, 23 Mar 2021, Martin Reinecke wrote:
> Here is a further reduced test case, together with the generated
> assembler output.
>
> I'm really at my wits' end here ... should I file this as a
> "missed-optimization" PR?
Yes. Note that _m256d is declared with 'may_alias' attribute, so in general
it should weaken optimizations by disabling type-based alias analysis,
but in this case replacing _m256d with __v4df does not help.
The following version should be easier to optimize, but still is not handled
well (at -O2 -mfma). I'd suggest to present it in the PR in addition to your
reduced version.
(to be clear, I'm not in any way suggesting that the below version is a
"better C++ code" or anything, it just leads to a simpler GIMPLE IR)
Alexander
#include <immintrin.h>
struct Tvsimple
{
__v4df v;
Tvsimple() {}
Tvsimple(double val) { v = _mm256_set1_pd(val); }
};
Tvsimple operator*(Tvsimple v1, Tvsimple v2)
{
Tvsimple res; res.v = v1.v*v2.v; return res;
}
Tvsimple operator+(Tvsimple v1, Tvsimple v2)
{
Tvsimple res; res.v = v1.v+v2.v; return res;
}
template<typename vtype> struct s0data_s
{ vtype sth, corfac, scale, lam1, lam2, csq, p1r, p1i, p2r, p2i; };
template<typename vtype> void foo(s0data_s<vtype> & __restrict__ d,
const double * __restrict__ coef, const double * __restrict__ alm,
size_t l, size_t il, size_t lmax)
{
// critical loop
while (l<=lmax)
{
d.p1r = d.p1r+d.lam2*alm[2*l];
d.p1i = d.p1i+d.lam2*alm[2*l+1];
d.p2r = d.p2r+d.lam2*alm[2*l+2];
d.p2i = d.p2i+d.lam2*alm[2*l+3];
auto tmp = d.lam2*(d.csq*coef[2*il] + coef[2*il+1]) + d.lam1;
d.lam1 = d.lam2;
d.lam2 = tmp;
++il; l+=2;
}
}
// this version has dead stores at the end of the loop
template void foo<>(s0data_s<Tvsimple> & __restrict__ d,
const double * __restrict__ coef, const double * __restrict__ alm,
size_t l, size_t il, size_t lmax);
// this version moves the stores after the end of the loop
template void foo<>(s0data_s<__v4df> & __restrict__ d,
const double * __restrict__ coef, const double * __restrict__ alm,
size_t l, size_t il, size_t lmax);
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: g++: Suboptimal code generation for simple wrapper class around vector data type
2021-03-23 13:42 ` Alexander Monakov
@ 2021-03-23 13:48 ` Martin Reinecke
0 siblings, 0 replies; 5+ messages in thread
From: Martin Reinecke @ 2021-03-23 13:48 UTC (permalink / raw)
To: Alexander Monakov; +Cc: gcc-help
Dear Alexander,
thanks a lot for the reply!
> Yes. Note that _m256d is declared with 'may_alias' attribute, so in general
> it should weaken optimizations by disabling type-based alias analysis,
> but in this case replacing _m256d with __v4df does not help.
>
> The following version should be easier to optimize, but still is not handled
> well (at -O2 -mfma). I'd suggest to present it in the PR in addition to your
> reduced version.
I have opened PR c++/99728 for this in the meantime and will add your
test case!
Thanks again,
Martin
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-03-23 13:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22 14:34 g++: Suboptimal code generation for simple wrapper class around vector data type Martin Reinecke
2021-03-22 16:42 ` Martin Reinecke
2021-03-23 10:32 ` Martin Reinecke
2021-03-23 13:42 ` Alexander Monakov
2021-03-23 13:48 ` Martin Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).