public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/109690] New: bad SLP vectorization on zen
@ 2023-05-01 21:31 hubicka at gcc dot gnu.org
  2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-01 21:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

            Bug ID: 109690
           Summary: bad SLP vectorization on zen
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

model name      : AMD Ryzen 7 5800X 8-Core Processor
reproduces on my znver1 laptop too.

h@ryzen3:~/gcc-kub/build/gcc> cat tt.c
int a[100];

[[gnu::noipa]]
void loop()
{
          for (int i = 0; i < 3; i++)
                  a[i]+=a[i];
}
int
main()
{
        for (int j = 0; j < 1000000000; j++)
          loop ();
        return 0;
}


jh@ryzen3:~/gcc-kub/build/gcc> ./xgcc -B ./ -O2 -march=native tt.c ; perf stat
./a.out

 Performance counter stats for './a.out':

           2683.95 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
                52      page-faults:u                    #   19.374 /sec        
       13001141361      cycles:u                         #    4.844 GHz        
                (83.31%)
            691180      stalled-cycles-frontend:u        #    0.01% frontend
cycles idle        (83.31%)
            101980      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.31%)
       12999928665      instructions:u                   #    1.00  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn     (83.31%)
        3000013809      branches:u                       #    1.118 G/sec      
                (83.41%)
              1525      branch-misses:u                  #    0.00% of all
branches             (83.36%)

       2.684376360 seconds time elapsed

       2.684369000 seconds user
       0.000000000 seconds sys


jh@ryzen3:~/gcc-kub/build/gcc> ./xgcc -B ./ -O2 -march=native tt.c
-fno-tree-vectorize ; perf stat ./a.out

 Performance counter stats for './a.out':

           1238.92 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
                52      page-faults:u                    #   41.972 /sec        
        6000338140      cycles:u                         #    4.843 GHz        
                (83.21%)
            314660      stalled-cycles-frontend:u        #    0.01% frontend
cycles idle        (83.21%)
                 0      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.23%)
        7999796562      instructions:u                   #    1.33  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn     (83.53%)
        2999887795      branches:u                       #    2.421 G/sec      
                (83.53%)
               698      branch-misses:u                  #    0.00% of all
branches             (83.28%)

       1.239116606 seconds time elapsed

       1.239121000 seconds user
       0.000000000 seconds sys

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
@ 2023-05-01 21:59 ` pinskia at gcc dot gnu.org
  2023-05-01 22:01 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-01 21:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---

Without -march=znver1, we get:
  vect__10.6_9 = MEM <vector(2) int> [(int *)&a];
  vect_patt_13.7_8 = VIEW_CONVERT_EXPR<vector(2) unsigned int>(vect__10.6_9);
  vect_patt_19.8_1 = vect_patt_13.7_8 << 1;
  vect_patt_25.9_2 = VIEW_CONVERT_EXPR<vector(2) int>(vect_patt_19.8_1);
  MEM <vector(2) int> [(int *)&a] = vect_patt_25.9_2;

Which looks reasonable. But with -march=znver1 we get:

  _10 = a[0];
  _11 = _10 * 2;
  _16 = a[1];
  _17 = _16 * 2;
  _13 = {_11, _17};
  MEM <vector(2) int> [(int *)&a] = _13;

So this is definitely a cost model issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
  2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
@ 2023-05-01 22:01 ` pinskia at gcc dot gnu.org
  2023-05-01 22:11 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-01 22:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Even more interesting is:
          for (int i = 0; i < 3; i++)
                  a[i] = ((unsigned)a[i]) << 1;

Produces different code .

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
  2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
  2023-05-01 22:01 ` pinskia at gcc dot gnu.org
@ 2023-05-01 22:11 ` pinskia at gcc dot gnu.org
  2023-05-02  6:46 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-01 22:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So in the case of without -march, we get:

first:
/app/example.cpp:14:24: note: Cost model analysis for part in loop 0:
  Vector cost: 28
  Scalar cost: 24



so we reject that and then we try it again and this time for V8QI and then it
works.

With -march we get:

/app/example.cpp:14:24: note: Cost model analysis for part in loop 0:
  Vector cost: 32
  Scalar cost: 32

Which then we accept and does not retry it ...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-05-01 22:11 ` pinskia at gcc dot gnu.org
@ 2023-05-02  6:46 ` rguenth at gcc dot gnu.org
  2023-05-04 20:46 ` ubizjak at gmail dot com
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-02  6:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2023-05-02
             Status|UNCONFIRMED                 |NEW
                 CC|                            |uros at gcc dot gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
The x86 target chooses not to go the "compare costs" route but choose the first
(usually biggest size) vectorization that is profitable.

So the interesting thing is that with -march=znver3 we have the
integer multiplication in V2SImode unsupported.  Note that SLP chooses
V2SImode for the base V4SImode.

With V8QImode (aka V2SImode) base mode pattern recog works to produce
the desired shift.

I think the disconnect is that with V4SImode we have an integer multiplication
pattern (so no pattern is created) but with V2SImode we have not (looks like
the target chose not to implement that).

A solution would be to perform pattern recog in the vectorizable_* routines
or at least in the cases where straight-forward, simply code-gen a supported
variant.

Thus, mulv2si3 is missing.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-05-02  6:46 ` rguenth at gcc dot gnu.org
@ 2023-05-04 20:46 ` ubizjak at gmail dot com
  2023-05-05 12:16 ` ubizjak at gmail dot com
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: ubizjak at gmail dot com @ 2023-05-04 20:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

--- Comment #5 from Uroš Bizjak <ubizjak at gmail dot com> ---
Created attachment 55002
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55002&action=edit
Patch that introduces mulv2si3

The compiled code with -march=znver1 is now the same as without the flag:

loop:
        vmovq   a(%rip), %xmm0
        sall    a+8(%rip)
        vpslld  $1, %xmm0, %xmm0
        vmovq   %xmm0, a(%rip)
        ret

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-05-04 20:46 ` ubizjak at gmail dot com
@ 2023-05-05 12:16 ` ubizjak at gmail dot com
  2023-05-05 22:40 ` hubicka at gcc dot gnu.org
  2023-05-06  8:45 ` amonakov at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: ubizjak at gmail dot com @ 2023-05-05 12:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

--- Comment #6 from Uroš Bizjak <ubizjak at gmail dot com> ---
The missing pattern was committed as part of:

commit r14-493-g919642fa4b2bc4c32910336dd200d53766801c80
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Fri May 5 14:10:18 2023 +0200

    i386: Introduce mulv2si3 instruction

    For SSE2 targets the expander unpacks input elements into the correct
    position in the V4SI vector and emits PMULUDQ instruction.  The output
    elements are then shuffled back to their positions in the V2SI vector.

    For SSE4 targets PMULLD instruction is emitted directly.

    gcc/ChangeLog:

            * config/i386/mmx.md (mulv2si3): New expander.
            (*mulv2si3): New insn pattern.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/sse2-mmx-mult-vec.c: New test.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-05-05 12:16 ` ubizjak at gmail dot com
@ 2023-05-05 22:40 ` hubicka at gcc dot gnu.org
  2023-05-06  8:45 ` amonakov at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-05 22:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Thanks a lot!  There however still seems to be problem with vectorization

On zen4 i now get:
jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp.c  ; perf stat
./a.out

 Performance counter stats for './a.out':

          1,835.21 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
                53      page-faults:u                    #   28.880 /sec        
    10,000,113,961      cycles:u                         #    5.449 GHz        
                (83.22%)
            31,284      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle        (83.23%)
            64,771      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.43%)
     9,000,118,863      instructions:u                   #    0.90  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn     (83.44%)
     2,999,980,507      branches:u                       #    1.635 G/sec      
                (83.44%)
             1,445      branch-misses:u                  #    0.00% of all
branches             (83.25%)

       1.835610338 seconds time elapsed

       1.835628000 seconds user
       0.000000000 seconds sys


jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native -fno-tree-vectorize
slp.c  ; perf stat ./a.out

 Performance counter stats for './a.out':

          1,107.63 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
                53      page-faults:u                    #   47.850 /sec        
     6,000,774,547      cycles:u                         #    5.418 GHz        
                (83.35%)
            32,208      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle        (83.39%)
            57,126      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.39%)
     7,999,763,446      instructions:u                   #    1.33  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn     (83.39%)
     2,999,982,314      branches:u                       #    2.708 G/sec      
                (83.39%)
               911      branch-misses:u                  #    0.00% of all
branches             (83.09%)

       1.108032230 seconds time elapsed

       1.104079000 seconds user
       0.003985000 seconds sys


with -fno-tree-slp-vectorize I get:
loop:
.LFB0:
        .cfi_startproc
        sall    a(%rip)
        sall    a+4(%rip)
        sall    a+8(%rip)
        ret

Which seem to be still faster. It is same if I do a[i]++
jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp2.c  ; perf stat
./a.out

 Performance counter stats for './a.out':

          1,832.63 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
                54      page-faults:u                    #   29.466 /sec        
    10,000,535,003      cycles:u                         #    5.457 GHz        
                (83.19%)
            36,576      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle        (83.34%)
            75,320      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.41%)
     9,999,890,371      instructions:u                   #    1.00  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn     (83.41%)
     2,999,935,610      branches:u                       #    1.637 G/sec      
                (83.41%)
             1,447      branch-misses:u                  #    0.00% of all
branches             (83.23%)

       1.833046939 seconds time elapsed

       1.833062000 seconds user
       0.000000000 seconds sys


jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp2.c
-fno-tree-vectorize ; perf stat ./a.out

 Performance counter stats for './a.out':

          1,110.15 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
                51      page-faults:u                    #   45.940 /sec        
     6,000,096,821      cycles:u                         #    5.405 GHz        
                (83.17%)
            28,459      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle        (83.43%)
            48,165      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.43%)
     7,999,665,012      instructions:u                   #    1.33  insn per
cycle            
                                                  #    0.00  stalled cycles per
insn     (83.43%)
     2,999,941,619      branches:u                       #    2.702 G/sec      
                (83.43%)
               719      branch-misses:u                  #    0.00% of all
branches             (83.12%)

       1.110557635 seconds time elapsed

       1.110575000 seconds user
       0.000000000 seconds sys


jh@ryzen4:~/gcc/build/gcc> cat slp2.c
int a[100];

[[gnu::noipa]]
void loop()
{
          for (int i = 0; i < 3; i++)
                  a[i]++;
}
int
main()
{
        for (int j = 0; j < 1000000000; j++)
          loop ();
        return 0;
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/109690] bad SLP vectorization on zen
  2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-05-05 22:40 ` hubicka at gcc dot gnu.org
@ 2023-05-06  8:45 ` amonakov at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-05-06  8:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Note that the vectorized variant is latency-bound: vector load in loop() waits
for the vector store into the same location done in the previous invocation of
'loop'. This makes the microbenchmark take 10 cycles per iteration (9 cycles as
the vector store forwarding latency, plus 1 cycle for the ALU op).

In contrast, the fully-scalar variant benefits from "memory renaming" in Zen 2
and Zen 4 (absent in Zen 3) where store-forwarding happens earlier in the
pipeline with zero-cycle latency. I think it bottlenecks on taken branches.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-06  8:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
2023-05-01 22:01 ` pinskia at gcc dot gnu.org
2023-05-01 22:11 ` pinskia at gcc dot gnu.org
2023-05-02  6:46 ` rguenth at gcc dot gnu.org
2023-05-04 20:46 ` ubizjak at gmail dot com
2023-05-05 12:16 ` ubizjak at gmail dot com
2023-05-05 22:40 ` hubicka at gcc dot gnu.org
2023-05-06  8:45 ` amonakov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).