Here is a further reduced test case, together with the generated assembler output. I'm really at my wits' end here ... should I file this as a "missed-optimization" PR? Cheers, Martin On 3/22/21 3:34 PM, Martin Reinecke wrote: > Hi, > > the attached test case is the (slightly simplified) hot loop from a > library for spherical harmonic transforms. > This code uses explicit vectorization, and I try to use simple wrapper > classes around the primitive vector types (like __m256d) to simplify > operations like initialization with a scalar etc. > > However it seems that using the wrapper type inside the critical loop > causes g++ to produce sub-optimal code. This can be seen by running > > g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc > > and inspecting the generated assembler code (I'm using gcc 10.2.1). > The version where I use the wrapper type even in the hot loop (i.e. > "foo") has a few unnecessary "vmovapd" instructions before > the end of the loop body, which are missing in the version where I cast > to __m256d before doing the heavy computation (i.e. "foo<__m256d,2>"). > > My suspicion is that the "Tvsimple" type is somehow not completely POD > and that this prohibits g++ from optimizing more aggressively. On the > other hand, clang++ produces identical code for both versions, which is > comparable in speed with the faster version generated by g++. > > Is g++ missing an opportunity to optimize here? If so, is there a way to > alter the "Tvsimple" class so that it doesn't stop g++ from optimizing? > > Thanks, > Martin >