Hi all, Recently, I met an issue with auto vectorization. As following code shows, why uint32_t prevents the compiler (GCC 12.1 + O3) from optimizing by auto vectorization. See https://godbolt.org/z/a3GfaKEq6. #include <cstdint> // no auto vectorization void test32(uint32_t *array, uint32_t &nread, uint32_t from, uint32_t to) { for (uint32_t i = from; i < to; i++) { array[nread++] = i; } } // auto vectorization void test64(uint32_t *array, uint64_t &nread, uint32_t from, uint32_t to) { for (uint32_t i = from; i < to; i++) { array[nread++] = i; } } // no auto vectorization void test_another_32(uint32_t *array, uint32_t &nread, uint32_t from, uint32_t to) { uint32_t index = nread; for (uint32_t i = from; i < to; i++) { array[index++] = i; } nread = index; } // auto vectorization void test_another_64(uint32_t *array, uint32_t &nread, uint32_t from, uint32_t to) { uint64_t index = nread; for (uint32_t i = from; i < to; i++) { array[index++] = i; } nread = index; } After I ran the command g++ -O3 -fopt-info-vec-missed -c test.cc -o /dev/null, I got the following result. How to interpret it? bash> g++ -O3 -fopt-info-vec-missed -c test.cc -o /dev/null test.cc:5:31: missed: couldn't vectorize loop test.cc:6:24: missed: not vectorized: not suitable for scatter store *_5 = i_18; test.cc:21:31: missed: couldn't vectorize loop test.cc:22:24: missed: not vectorized: not suitable for scatter store *_4 = i_22; -- Best regards, Adonis
On Mon, 27 Jun 2022, Adonis Ling via Gcc-help wrote: > Hi all, > > Recently, I met an issue with auto vectorization. > > As following code shows, why uint32_t prevents the compiler (GCC 12.1 + O3) > from optimizing by auto vectorization. See https://godbolt.org/z/a3GfaKEq6. > > #include <cstdint> > > // no auto vectorization > void test32(uint32_t *array, uint32_t &nread, uint32_t from, uint32_t to) { > for (uint32_t i = from; i < to; i++) { > array[nread++] = i; > } > } Here the main problem is '*array' and 'nread' have the same type, so they might overlap. Ideally the compiler would recognize that that cannot happen because it would make 'array[nread++] = i' undefined due to unsequenced modifications, but GCC is not sufficiently smart (yet). The secondary issue is the same as below: > // no auto vectorization > void test_another_32(uint32_t *array, uint32_t &nread, uint32_t from, > uint32_t to) { > uint32_t index = nread; > for (uint32_t i = from; i < to; i++) { > array[index++] = i; > } > nread = index; > } ... here: the issue is that index is unsigned and shorter than pointer type, it can wrap around from 0xffffffff to 0, making the access non-consecutive. When you compile for 32-bit x86, this loop is vectorized. Alexander
Hi Alexander, thanks for your reply. On Tue, Jun 28, 2022 at 9:06 PM Alexander Monakov <amonakov@ispras.ru> wrote: > On Mon, 27 Jun 2022, Adonis Ling via Gcc-help wrote: > > > Hi all, > > > > Recently, I met an issue with auto vectorization. > > > > As following code shows, why uint32_t prevents the compiler (GCC 12.1 + > O3) > > from optimizing by auto vectorization. See > https://godbolt.org/z/a3GfaKEq6. > > > > #include <cstdint> > > > > // no auto vectorization > > void test32(uint32_t *array, uint32_t &nread, uint32_t from, uint32_t > to) { > > for (uint32_t i = from; i < to; i++) { > > array[nread++] = i; > > } > > } > > Here the main problem is '*array' and 'nread' have the same type, so they > might > overlap. Ideally the compiler would recognize that that cannot happen > because it > would make 'array[nread++] = i' undefined due to unsequenced > modifications, but > GCC is not sufficiently smart (yet). The secondary issue is the same as > below: > I got your point. After that, I tried to add __restrict__ to nread as the following shows and GCC still doesn't optimize it. #include <cstdint> // no auto vectorization void test32(uint32_t *array, uint32_t & __restrict__ nread, uint32_t from, uint32_t to) { for (uint32_t i = from; i < to; i++) { array[nread++] = i; } } However, when I used Clang to compile, I noticed the code was optimized by Clang. See https://godbolt.org/z/eEz9W7o9z . > > // no auto vectorization > > void test_another_32(uint32_t *array, uint32_t &nread, uint32_t from, > > uint32_t to) { > > uint32_t index = nread; > > for (uint32_t i = from; i < to; i++) { > > array[index++] = i; > > } > > nread = index; > > } > > ... here: the issue is that index is unsigned and shorter than pointer > type, it > can wrap around from 0xffffffff to 0, making the access non-consecutive. > When > you compile for 32-bit x86, this loop is vectorized. > > Alexander > Clang also optimizes this function. See https://godbolt.org/z/eEz9W7o9z . -- Best regards, Adonis
On Tue, 28 Jun 2022, Adonis Ling via Gcc-help wrote:
> > Here the main problem is '*array' and 'nread' have the same type, so they
> > might overlap. Ideally the compiler would recognize that that cannot happen
> > because it would make 'array[nread++] = i' undefined due to unsequenced
> > modifications, but GCC is not sufficiently smart (yet). The secondary issue
> > is the same as below:
> >
>
> I got your point.
>
> After that, I tried to add __restrict__ to nread as the following shows and
> GCC still doesn't optimize it.
As I said, there's a secondary issue even if you add 'restrict'.
Alexander
On Tue, Jun 28, 2022 at 11:38 PM Alexander Monakov <amonakov@ispras.ru>
wrote:
> On Tue, 28 Jun 2022, Adonis Ling via Gcc-help wrote:
>
> > > Here the main problem is '*array' and 'nread' have the same type, so
> they
> > > might overlap. Ideally the compiler would recognize that that cannot
> happen
> > > because it would make 'array[nread++] = i' undefined due to unsequenced
> > > modifications, but GCC is not sufficiently smart (yet). The secondary
> issue
> > > is the same as below:
> > >
> >
> > I got your point.
> >
> > After that, I tried to add __restrict__ to nread as the following shows
> and
> > GCC still doesn't optimize it.
>
> As I said, there's a secondary issue even if you add 'restrict'.
>
> Alexander
>
For the secondary issue, could I explain that Clang chooses to ignore it?
--
Best regards,
Adonis
On Tue, 28 Jun 2022, Adonis Ling via Gcc-help wrote:
> For the secondary issue, could I explain that Clang chooses to ignore it?
That would be a compiler bug if it ignored that case; instead, it appears to
emit a test that the index would not wrap around, branching to a non-vectorized
variant if it would.
Alexander
On Tue, Jun 28, 2022 at 11:56 PM Alexander Monakov <amonakov@ispras.ru>
wrote:
> On Tue, 28 Jun 2022, Adonis Ling via Gcc-help wrote:
>
> > For the secondary issue, could I explain that Clang chooses to ignore it?
>
> That would be a compiler bug if it ignored that case; instead, it appears
> to
> emit a test that the index would not wrap around, branching to a
> non-vectorized
> variant if it would.
>
> Alexander
>
Ok, thanks a lot for your help!
--
Best regards,
Adonis