Here is a further reduced test case, together with the generated
assembler output.

I'm really at my wits' end here ... should I file this as a
"missed-optimization" PR?

Cheers,
  Martin

On 3/22/21 3:34 PM, Martin Reinecke wrote:
> Hi,
> 
> the attached test case is the (slightly simplified) hot loop from a
> library for spherical harmonic transforms.
> This code uses explicit vectorization, and I try to use simple wrapper
> classes around the primitive vector types (like __m256d) to simplify
> operations like initialization with a scalar etc.
> 
> However it seems that using the wrapper type inside the critical loop
> causes g++ to produce sub-optimal code. This can be seen by running
> 
> g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc
> 
> and inspecting the generated assembler code (I'm using gcc 10.2.1).
> The version where I use the wrapper type even in the hot loop (i.e.
> "foo<Tvsimple, 2>") has a few unnecessary "vmovapd" instructions before
> the end of the loop body, which are missing in the version where I cast
> to __m256d before doing the heavy computation (i.e. "foo<__m256d,2>").
> 
> My suspicion is that the "Tvsimple" type is somehow not completely POD
> and that this prohibits g++ from optimizing more aggressively. On the
> other hand, clang++ produces identical code for both versions, which is
> comparable in speed with the faster version generated by g++.
> 
> Is g++ missing an opportunity to optimize here? If so, is there a way to
> alter the "Tvsimple" class so that it doesn't stop g++ from optimizing?
> 
> Thanks,
>   Martin
>