public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/105363] New: -ftree-slp-vectorize decreases performance significantly (x64)
@ 2022-04-24 14:25 mtzguido at gmail dot com
  2022-04-25  3:16 ` [Bug c/105363] " crazylht at gmail dot com
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: mtzguido at gmail dot com @ 2022-04-24 14:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105363

            Bug ID: 105363
           Summary: -ftree-slp-vectorize decreases performance
                    significantly (x64)
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mtzguido at gmail dot com
  Target Milestone: ---

Created attachment 52857
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52857&action=edit
Source file and outputs

Hello,

I found this example where using -O2 (which implies -ftree-slp-vectorize)
decreases performance by about 4x wrt -O1. I've pinned it down to the
-ftree-slp-vectorize, and -O3 -fno-tree-slp-vectorize works very well.

   $ gcc bug_opt.c -O3 -o bug_opt-O3
   $ time ./bug_opt-O3

   real 0m6.627s
   user 0m6.619s
   sys  0m0.005s

   $ gcc bug_opt.c -O3 -fno-tree-slp-vectorize -o bug_opt-O3-novec
   $ time ./bug_opt-O3-novec

   real 0m1.703s
   user 0m1.701s
   sys  0m0.000s

I've verified this with the current HEAD (1ceddd7497) and with 11.2 (though in
that version -O2 does not imply -ftree-slp-vectorize, so the problem starts to
appear at -O3).

I've minimized the example into a pretty basic insertion sort.

I have not checked the generated assembly.

I'm attaching the .c source, which has some more comments with timings. Also
attaching my /proc/cpuinfo, and the temp files generated with -O3. I imagine
the .o and binary is not too helpful, but can send them if needed.

Thanks,
Guido

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug c/105363] -ftree-slp-vectorize decreases performance significantly (x64)
  2022-04-24 14:25 [Bug c/105363] New: -ftree-slp-vectorize decreases performance significantly (x64) mtzguido at gmail dot com
@ 2022-04-25  3:16 ` crazylht at gmail dot com
  2022-04-25  3:37 ` crazylht at gmail dot com
  2022-04-25  7:21 ` [Bug tree-optimization/105363] " rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: crazylht at gmail dot com @ 2022-04-25  3:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105363

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
STLF issues here.
 Performance counter stats for './12.out':

     1,248,728,604      ld_blocks.store_forward:u

       5.756169101 seconds time elapsed

       5.746946000 seconds user
       0.001999000 seconds sys


and this case doens't need IPA, it's SLP inside the loop which has
cross-iteration data-dependence, I think we need to prevent that.

#define N 50000
int a[N];

void insertionsort(int a[], int n)
{
  int i, j;

  for (i = 1; i < n; i++) {
    for (j = i-1; j >= 0 && a[j] > a[j+1]; j--) {
      int t  = a[j+1];
      a[j+1] = a[j];
      a[j]   = t;
    }
  }
}

dump:

  <bb 5> [local count: 958878294]:
  MEM <vector(2) int> [(int *)_37] = vect__4.9_45;
  ivtmp.17_47 = ivtmp.17_28 + 18446744073709551612;
  if (_11 != ivtmp.17_47)
    goto <bb 7>; [94.50%]
  else
    goto <bb 6>; [5.50%]

  <bb 6> [local count: 114863531]:
  ivtmp.25_50 = ivtmp.25_9 + 1;
  ivtmp.28_52 = ivtmp.28_51 + 4;
  if (ivtmp.25_50 != _59)
    goto <bb 4>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 7> [local count: 1014686024]:
  # ivtmp.17_28 = PHI <ivtmp.17_47(5), _61(4)>
  _37 = (void *) ivtmp.17_28;
  vect__8.8_46 = MEM <vector(2) int> [(int *)_37];
  vect__4.9_45 = VEC_PERM_EXPR <vect__8.8_46, vect__8.8_46, { 1, 0 }>;
  _43 = BIT_FIELD_REF <vect__8.8_46, 32, 0>;
  _44 = BIT_FIELD_REF <vect__8.8_46, 32, 32>;
  if (_43 > _44)
    goto <bb 5>; [94.50%]
  else
    goto <bb 6>; [5.50%]

  <bb 8> [local count: 14196616]:

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug c/105363] -ftree-slp-vectorize decreases performance significantly (x64)
  2022-04-24 14:25 [Bug c/105363] New: -ftree-slp-vectorize decreases performance significantly (x64) mtzguido at gmail dot com
  2022-04-25  3:16 ` [Bug c/105363] " crazylht at gmail dot com
@ 2022-04-25  3:37 ` crazylht at gmail dot com
  2022-04-25  7:21 ` [Bug tree-optimization/105363] " rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: crazylht at gmail dot com @ 2022-04-25  3:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105363

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
Looks like neither ICC nor LLVM vectorized the loop
https://godbolt.org/z/sbheqbE6Y

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/105363] -ftree-slp-vectorize decreases performance significantly (x64)
  2022-04-24 14:25 [Bug c/105363] New: -ftree-slp-vectorize decreases performance significantly (x64) mtzguido at gmail dot com
  2022-04-25  3:16 ` [Bug c/105363] " crazylht at gmail dot com
  2022-04-25  3:37 ` crazylht at gmail dot com
@ 2022-04-25  7:21 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25  7:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105363

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
          Component|c                           |tree-optimization
   Last reconfirmed|                            |2022-04-25
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
t.c:12:14: note: Cost model analysis:
_8 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
*_7 1 times scalar_load costs 12 in body
*_3 1 times scalar_load costs 12 in body
*_3 1 times vec_perm costs 4 in body
*_7 1 times unaligned_load (misalign -1) costs 12 in body
_8 1 times unaligned_store (misalign -1) costs 12 in body
*_7 1 times vec_to_scalar costs 4 in epilogue
*_3 1 times vec_to_scalar costs 4 in epilogue
t.c:12:14: note: Cost model analysis for part in loop 2:
  Vector cost: 36
  Scalar cost: 48
t.c:12:14: note: Basic block will be vectorized using SLP

which generally looks OK-ish.  It's true that GCC could take into account
the containing loop which could make it detect the store forwarding problem
but currently we are not analyzing data references with respect to any loop
for non-loop vectorization.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-04-25  7:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-24 14:25 [Bug c/105363] New: -ftree-slp-vectorize decreases performance significantly (x64) mtzguido at gmail dot com
2022-04-25  3:16 ` [Bug c/105363] " crazylht at gmail dot com
2022-04-25  3:37 ` crazylht at gmail dot com
2022-04-25  7:21 ` [Bug tree-optimization/105363] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).