[Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors
@ 2021-09-18 22:17 freddie at witherden dot org
  2021-09-18 22:18 ` [Bug tree-optimization/102404] " freddie at witherden dot org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: freddie at witherden dot org @ 2021-09-18 22:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404

            Bug ID: 102404
           Summary: Loop vectorized with 32 byte vectors actually uses 16
                    byte vectors
           Product: gcc
           Version: 11.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: freddie at witherden dot org
  Target Milestone: ---

Created attachment 51480
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51480&action=edit
Test case

Consider the loop on L11 of the attached file.  Compiling as:

❯ gcc -march=tigerlake -Ofast -mprefer-vector-width=512 -S -fopenmp test.c
-fopt-info
test.c:25:37: optimized: loop vectorized using 32 byte vectors
test.c:4:6: optimized: loop turned into non-loop; it never loops

which notes that (as requested) the loop has been vectorized using 32-byte
(zmm) vectors.  Inspecting the resulting assembly (also attached) we observe
that has actually ben unrolled by a factor of two and then vectorized using
16-byte (ymm) vectors.

As a point of comparison recent versions of Clang use 32-byte vectors for this
loop, resulting in code which is half the size.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/102404] Loop vectorized with 32 byte vectors actually uses 16 byte vectors
  2021-09-18 22:17 [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors freddie at witherden dot org
@ 2021-09-18 22:18 ` freddie at witherden dot org
  2021-09-20  9:04 ` [Bug target/102404] " rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: freddie at witherden dot org @ 2021-09-18 22:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404

--- Comment #1 from Freddie Witherden <freddie at witherden dot org> ---
Created attachment 51481
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51481&action=edit
Generated assembly.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/102404] Loop vectorized with 32 byte vectors actually uses 16 byte vectors
  2021-09-18 22:17 [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors freddie at witherden dot org
  2021-09-18 22:18 ` [Bug tree-optimization/102404] " freddie at witherden dot org
@ 2021-09-20  9:04 ` rguenth at gcc dot gnu.org
  2021-09-20 12:16 ` freddie at witherden dot org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-20  9:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|x86_64                      |x86_64-*-*
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-09-20

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
32 bytes are 256 bits (ymm), 64 bytes are 512 bits (zmm).  GCC does not
consider zmm vectorization because

t.c:25:37: missed:  loop does not have enough iterations to support
vectorization.

because

t.c:25:37: note:  vectorization_factor = 16, niters = 8

the memory accesses cannot be related so we fail to SLP this.

Does clang use vpgathers/scatters on %zmm here?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/102404] Loop vectorized with 32 byte vectors actually uses 16 byte vectors
  2021-09-18 22:17 [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors freddie at witherden dot org
  2021-09-18 22:18 ` [Bug tree-optimization/102404] " freddie at witherden dot org
  2021-09-20  9:04 ` [Bug target/102404] " rguenth at gcc dot gnu.org
@ 2021-09-20 12:16 ` freddie at witherden dot org
  2021-09-20 12:17 ` freddie at witherden dot org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: freddie at witherden dot org @ 2021-09-20 12:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404

--- Comment #3 from Freddie Witherden <freddie at witherden dot org> ---
(In reply to Richard Biener from comment #2)
> 32 bytes are 256 bits (ymm), 64 bytes are 512 bits (zmm).  GCC does not
> consider zmm vectorization because
> 
> t.c:25:37: missed:  loop does not have enough iterations to support
> vectorization.
> 
> because
> 
> t.c:25:37: note:  vectorization_factor = 16, niters = 8
> 
> the memory accesses cannot be related so we fail to SLP this.
> 
> Does clang use vpgathers/scatters on %zmm here?

Apologises for the typo.  However, I would still expect the loop to be
vectorized with %zmm as it is operating (fundamentally) on double precision
numbers (8 bytes) with a trip count of 8.

Clang assembly is attached, showing the expected structure.  Gathers and
scatters are used on %zmm here.

Could GCC be thrown off by the fact that the table indices are 4 byte integers?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/102404] Loop vectorized with 32 byte vectors actually uses 16 byte vectors
  2021-09-18 22:17 [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors freddie at witherden dot org
                   ` (2 preceding siblings ...)
  2021-09-20 12:16 ` freddie at witherden dot org
@ 2021-09-20 12:17 ` freddie at witherden dot org
  2021-09-22  2:37 ` crazylht at gmail dot com
  2021-09-22  2:40 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: freddie at witherden dot org @ 2021-09-20 12:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404

--- Comment #4 from Freddie Witherden <freddie at witherden dot org> ---
Created attachment 51485
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51485&action=edit
Clang assembly.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/102404] Loop vectorized with 32 byte vectors actually uses 16 byte vectors
  2021-09-18 22:17 [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors freddie at witherden dot org
                   ` (3 preceding siblings ...)
  2021-09-20 12:17 ` freddie at witherden dot org
@ 2021-09-22  2:37 ` crazylht at gmail dot com
  2021-09-22  2:40 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2021-09-22  2:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Freddie Witherden from comment #4)
> Created attachment 51485 [details]
> Clang assembly.

It seems to be because the current GCC loop vectorizer does not support
different vector sizes, and here the index vector is 256bit. Change tripcount
to 16 successfully generate zmm.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/102404] Loop vectorized with 32 byte vectors actually uses 16 byte vectors
  2021-09-18 22:17 [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors freddie at witherden dot org
                   ` (4 preceding siblings ...)
  2021-09-22  2:37 ` crazylht at gmail dot com
@ 2021-09-22  2:40 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2021-09-22  2:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Freddie Witherden from comment #4)
> > Created attachment 51485 [details]
> > Clang assembly.
> 
> It seems to be because the current GCC loop vectorizer does not support
> different vector sizes, and here the index vector is 256bit. Change
> tripcount to 16 successfully generate zmm.

Define index type from const int* to const long long * also can generate zmm.


.i.e.

void intcflux(int _nx, const double* __restrict__ magnl_v, const double*
__restrict__ nl_v, double* __restrict__ ul_v, const long long* __restrict__
ul_vix, double* __restrict__ ur_v, const long long* __restrict__ ur_vix)
{

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-22  2:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-18 22:17 [Bug tree-optimization/102404] New: Loop vectorized with 32 byte vectors actually uses 16 byte vectors freddie at witherden dot org
2021-09-18 22:18 ` [Bug tree-optimization/102404] " freddie at witherden dot org
2021-09-20  9:04 ` [Bug target/102404] " rguenth at gcc dot gnu.org
2021-09-20 12:16 ` freddie at witherden dot org
2021-09-20 12:17 ` freddie at witherden dot org
2021-09-22  2:37 ` crazylht at gmail dot com
2021-09-22  2:40 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).