[Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
@ 2023-05-11 14:25 aros at gmx dot com
  2023-05-11 14:26 ` [Bug tree-optimization/109812] " aros at gmx dot com
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: aros at gmx dot com @ 2023-05-11 14:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

            Bug ID: 109812
           Summary: GraphicsMagick resize is a lot slower in GCC 13.1 vs
                    Clang 16
           Product: gcc
           Version: 13.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: aros at gmx dot com
  Target Milestone: ---

Check this:

https://www.phoronix.com/review/gcc13-clang16-raptorlake/3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
@ 2023-05-11 14:26 ` aros at gmx dot com
  2023-05-11 15:20 ` [Bug target/109812] " pinskia at gcc dot gnu.org
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: aros at gmx dot com @ 2023-05-11 14:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #1 from Artem S. Tashkinov <aros at gmx dot com> ---
Created attachment 55052
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55052&action=edit
Graphs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
  2023-05-11 14:26 ` [Bug tree-optimization/109812] " aros at gmx dot com
@ 2023-05-11 15:20 ` pinskia at gcc dot gnu.org
  2023-05-11 15:50 ` aros at gmx dot com
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-11 15:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-05-11
             Status|UNCONFIRMED                 |WAITING
             Target|                            |x86_64-linux-gnu
     Ever confirmed|0                           |1
          Component|tree-optimization           |target

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This bug report and the other ones are useless really. Please read
https://gcc.gnu.org/bugs/ and report a decent bug report.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
  2023-05-11 14:26 ` [Bug tree-optimization/109812] " aros at gmx dot com
  2023-05-11 15:20 ` [Bug target/109812] " pinskia at gcc dot gnu.org
@ 2023-05-11 15:50 ` aros at gmx dot com
  2023-05-12  8:47 ` aros at gmx dot com
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: aros at gmx dot com @ 2023-05-11 15:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Artem S. Tashkinov <aros at gmx dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|WAITING                     |RESOLVED

--- Comment #3 from Artem S. Tashkinov <aros at gmx dot com> ---
According to the latest Phoronix test which can be easily downloaded, run and
reproduced, GCC 13.1 loses to Clang by a wide margin, in certain workloads it's
~30% (!) slower and I just wanted to alert its developers to a widening gap in
performance v Clang. I'm not a developer either, I'm simply no one.

My previous bug reports for performance regressions and deficiencies weren't
met with such ... words, so, I'm sorry I'm not in a mood of proving anything,
so I'll just go ahead and close it as useless, annoying and maybe even outright
invalid.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (2 preceding siblings ...)
  2023-05-11 15:50 ` aros at gmx dot com
@ 2023-05-12  8:47 ` aros at gmx dot com
  2023-05-16 22:43 ` juzhe.zhong at rivai dot ai
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: aros at gmx dot com @ 2023-05-12  8:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Artem S. Tashkinov <aros at gmx dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|RESOLVED                    |NEW
         Resolution|INVALID                     |---

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (3 preceding siblings ...)
  2023-05-12  8:47 ` aros at gmx dot com
@ 2023-05-16 22:43 ` juzhe.zhong at rivai dot ai
  2023-05-17  0:08 ` sjames at gcc dot gnu.org
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-05-16 22:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #4 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Thanks for reporting this. Unfortunately, a single report can not help us.
Would you mind file a bug with simple piece of code that we can reproduce
such issue and this issue matters for the benchmark.

Besides, I have read this report. I think this may be the X86 backend issue.
We (downstream) RISC-V GCC have tested various workloads, turns out GCC is
better
than Clang in traditional CPU benchmark. Also, Clang is much better than GCC in
AI program benchmark (For example mlperf).

Start with the benchmark you mentioned (GraphicsMagick), Could you post the
most important piece of code belongging to this benchmark ?


Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (4 preceding siblings ...)
  2023-05-16 22:43 ` juzhe.zhong at rivai dot ai
@ 2023-05-17  0:08 ` sjames at gcc dot gnu.org
  2023-05-28 16:46 ` hubicka at gcc dot gnu.org
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: sjames at gcc dot gnu.org @ 2023-05-17  0:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Sam James <sjames at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sjames at gcc dot gnu.org

--- Comment #5 from Sam James <sjames at gcc dot gnu.org> ---
All of the benchmarks in that report are from
https://github.com/phoronix-test-suite/phoronix-test-suite.

For GraphicsMagick, the relevant benchmark seems to be:
https://github.com/phoronix-test-suite/phoronix-test-suite/blob/dea5e68ba7bc0eaa3646713a8e07100ffab929b5/ob-cache/test-profiles/pts/graphics-magick-1.6.1/test-definition.xml
(it might be a different version of the test, but note that '1.6.1' does NOT
equal the graphicsmagick version)

with a script at
https://github.com/phoronix-test-suite/phoronix-test-suite/blob/dea5e68ba7bc0eaa3646713a8e07100ffab929b5/ob-cache/test-profiles/pts/graphics-magick-1.6.1/install.sh#L25.

I think it runs individual commands like this (OMP_NUM_THREADS="$NUM_CPU_CORES"
./gm benchmark -duration 60 convert DSC_6782.png $@ null), so:
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -colorspace HWB null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -blur 0x1.0 null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -lat 10x10-5% null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -resize 50% HWB null
* OMP_NUM_THREADS="$NUM_CPU_CORES" ./gm benchmark -duration 60 convert
DSC_6782.png -sharpen 0x1.0 HWB null

with GraphicsMagick (gm) built as with -fopenmp -O3 -march=native -flto -ltiff
-lfreetype -ljpeg -lXext -lSM -lICE -lX11 -lbz2 -lz -lzstd -lpthread. But I
can't actually find the test image DSC_6782.png, so...

I think we really need more information here before it's actionable. Perhaps
the reporter could reach out to Michael Larabel and ask him to comment here.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (5 preceding siblings ...)
  2023-05-17  0:08 ` sjames at gcc dot gnu.org
@ 2023-05-28 16:46 ` hubicka at gcc dot gnu.org
  2023-05-28 17:29 ` [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake hubicka at gcc dot gnu.org
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 16:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I installed the phoronix testuiste and uploaded sample data it uses to
http://www.ucw.cz/~hubicka/sample-photo-6000x4000-1.zip

I doubt they make much difference especially for resizing.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (6 preceding siblings ...)
  2023-05-28 16:46 ` hubicka at gcc dot gnu.org
@ 2023-05-28 17:29 ` hubicka at gcc dot gnu.org
  2023-05-28 17:39 ` hubicka at gcc dot gnu.org
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 17:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|GraphicsMagick resize is a  |GraphicsMagick resize is a
                   |lot slower in GCC 13.1 vs   |lot slower in GCC 13.1 vs
                   |Clang 16                    |Clang 16 on Intel Raptor
                   |                            |Lake

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
On zen3 hardware I get GCC:

GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3                     
    Estimated Time To Completion: 4 Minutes [17:00 UTC] 
        Started Run 1 @ 16:57:17
        Started Run 2 @ 16:58:22
        Started Run 3 @ 16:59:26

    Operation: Resizing:
        1390
        1386
        1383

    Average: 1386 Iterations Per Minute
    Deviation: 0.25%

clang16:

GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 4 Minutes [16:54 UTC]
        Started Run 1 @ 16:51:48
        Started Run 2 @ 16:52:52
        Started Run 3 @ 16:53:56

    Operation: Resizing:
        180
        180
        180

    Average: 180 Iterations Per Minute
    Deviation: 0.00%


GCC profile:
  52.07%  VerticalFilter._omp_fn.0                                              
  24.59%  HorizontalFilter._omp_fn.0                                            
  11.78%  ReadCachePixels.isra.0                                                

Clang does not seem to have openmp in it, so to get comparable runs I added 
OMP_THREAD_LIMIT=1

With this I get:
GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 4 Minutes [17:17 UTC]
        Started Run 1 @ 17:14:14
        Started Run 2 @ 17:15:18
        Started Run 3 @ 17:16:22

    Operation: Resizing:
        184
        186
        186

    Average: 185 Iterations Per Minute
    Deviation: 0.62%

so GCC build is still bit faster. Internal loop of VerticalFillter is:
  0.00 │4a0:┌─→mov          0x8(%rdx),%rax                                  ▒
  1.33 │    │  vmovsd       (%rdx),%xmm1                                    ▒
  1.58 │    │  add          $0x10,%rdx                                      ▒
  0.00 │    │  sub          %r13,%rax                                       ▒
  4.77 │    │  imul         %r11,%rax                                       ▒
  1.01 │    │  add          %rcx,%rax                                       ▒
  0.04 │    │  movzbl       0x2(%r15,%rax,4),%r10d                          ▒
  8.38 │    │  vcvtsi2sd    %r10d,%xmm2,%xmm0                               ▒
  2.44 │    │  movzbl       0x1(%r15,%rax,4),%r10d                          ◆
  1.55 │    │  movzbl       (%r15,%rax,4),%eax                              ▒
  0.00 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm4                               ▒
 13.91 │    │  vcvtsi2sd    %r10d,%xmm2,%xmm0                               ▒
  1.86 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm5                               ▒
 13.00 │    │  vcvtsi2sd    %eax,%xmm2,%xmm0                                ▒
  2.02 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm3                               ▒
 12.54 │    ├──cmp          %rdx,%rdi                                       ▒
  0.00 │    └──jne          4a0                                             ▒

HorisontalFiller:
  0.01 │520:┌─→mov          0x8(%r8),%rdx                         ▒
  0.96 │    │  vmovsd       (%r8),%xmm1                           ▒
  1.93 │    │  add          $0x10,%r8                             ▒
  0.50 │    │  sub          %r15,%rdx                             ▒
  4.02 │    │  add          %r11,%rdx                             ▒
  2.26 │    │  movzbl       0x2(%r14,%rdx,4),%ebx                 ▒
  0.09 │    │  vcvtsi2sd    %ebx,%xmm2,%xmm0                      ▒
 10.10 │    │  movzbl       0x1(%r14,%rdx,4),%ebx                 ◆
  0.92 │    │  movzbl       (%r14,%rdx,4),%edx                    ▒
  1.84 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm4                     ▒
  6.82 │    │  vcvtsi2sd    %ebx,%xmm2,%xmm0                      ▒
 11.15 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm3                     ▒
 13.81 │    │  vcvtsi2sd    %edx,%xmm2,%xmm0                      ▒
  6.16 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm5                     ▒
  8.61 │    ├──cmp          %rsi,%r8                              ▒
  1.56 │    └──jne          520                                   ▒

ReadCachePixels:
       │2e0:┌─→mov    (%rbx,%rax,4),%edx                          ▒
 83.03 │    │  mov    %edx,(%r12,%rax,4)                          ▒
 12.34 │    │  inc    %rax                                        ▒
  0.02 │    ├──cmp    %rsi,%rax                                   ▒

With Clang I get:
  49.08% VerticalFilter                                                         
  24.66% HorizontalFilter                                                       
  18.41% ReadCachePixels                                                        
   6.75% SyncCacheViewPixels

  0.00 │1c50:┌─→mov          (%rdx,%rsi,1),%r9                    ▒
  0.09 │     │  vmovddup     -0x8(%rdx,%rsi,1),%xmm3              ▒
  0.00 │     │  add          $0x10,%rsi                           ▒
  0.75 │     │  sub          %rdi,%r9                             ▒
  0.00 │     │  imul         %rcx,%r9                             ▒
  1.07 │     │  add          %r11,%r9                             ▒
  0.81 │     │  movzbl       0x2(%r14,%r9,4),%r10d                ▒
  3.73 │     │  movzwl       (%r14,%r9,4),%r9d                    ▒
  0.00 │     │  vcvtsi2sd    %r10d,%xmm14,%xmm2                   ▒
  0.11 │     │  vfmadd231sd  %xmm2,%xmm3,%xmm1                    ▒
  2.57 │     │  vmovd        %r9d,%xmm2                           ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                          ▒
  0.95 │     │  vcvtdq2pd    %xmm2,%xmm2                          ▒
  0.74 │     │  vfmadd231pd  %xmm2,%xmm3,%xmm0                    ▒
 11.46 │     ├──cmp          %rsi,%r8                             ▒

       │1b50:┌─→mov          (%r10,%rdi,1),%rcx                   ▒
  0.76 │     │  vmovddup     -0x8(%r10,%rdi,1),%xmm3              ▒
  0.00 │     │  add          $0x10,%rdi                           ▒
  0.05 │     │  sub          %r8,%rcx                             ▒
  0.30 │     │  add          %rsi,%rcx                            ▒
  0.27 │     │  movzbl       0x2(%r14,%rcx,4),%ebp                ▒
  0.28 │     │  movzwl       (%r14,%rcx,4),%ecx                   ▒
  4.51 │     │  vcvtsi2sd    %ebp,%xmm13,%xmm2                    ▒
  0.75 │     │  vfmadd231sd  %xmm2,%xmm3,%xmm1                    ▒
  0.99 │     │  vmovd        %ecx,%xmm2                           ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                          ▒
  0.29 │     │  vcvtdq2pd    %xmm2,%xmm2                          ▒
  0.27 │     │  vfmadd231pd  %xmm2,%xmm3,%xmm0                    ▒
 12.37 │     ├──cmp          %rdi,%r9                             ▒
  0.16 │     └──jne          1b50                                 ▒

  0.01 │        test    %r10,%r10                                 ▒
  0.01 │      ↓ jle     28b4                                      ▒
       │        lea     0x0(,%r15,4),%rcx                         ▒
  0.01 │        mov     0xd8(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r8,4),%rcx                         ▒
  0.01 │        lea     (%rcx,%rbp,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%rdi,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%rax,4),%rcx                        ▒
  0.02 │        lea     (%rcx,%rdx,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xc8(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r9,4),%rcx                         ▒
  0.01 │        lea     (%rcx,%r13,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r11,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r12,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xb8(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0xb0(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.01 │        mov     0xa8(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x98(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%rsi,4),%rcx                        ▒
  0.03 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x88(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xa0(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.01 │        mov     0x90(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x58(%rsp),%r10                           ▒
  0.02 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x50(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x48(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x40(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x38(%rsp),%r10                           ▒
  0.02 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x60(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x68(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x70(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x78(%rsp),%r10                           ▒
  0.03 │        lea     (%rcx,%r10,4),%rcx                        ◆
  0.03 │        mov     0x80(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        add     0x28(%rsp),%rcx                           ▒
  0.03 │        mov     %rcx,0xf0(%rsp)                           ▒
  0.00 │        xor     %ecx,%ecx                                 ▒
 0.00 │        xor     %ecx,%ecx                                 ▒
       │2584:   mov     0xf0(%rsp),%r10                           ▒
  0.01 │        mov     (%r10,%rcx,4),%r10d                       ▒
  3.58 │        inc     %rcx                                      ▒
  0.03 │        mov     %r10d,(%r14)                              ▒
  0.02 │        mov     0x30(%rsp),%r10                           ▒
  0.01 │        add     $0x4,%r14                                 ▒
  0.01 │        mov     (%r10),%r10                               ▒
  0.06 │        cmp     %r10,%rcx                                 ▒
  0.05 │      ↑ jl      2584                                      ▒

So I suppose the filler loops are vectorized while memcpy is unrolled (in very
odd way).  I guesss the vectorization does not help on zen3 but may help on
Raptor Lake.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (7 preceding siblings ...)
  2023-05-28 17:29 ` [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake hubicka at gcc dot gnu.org
@ 2023-05-28 17:39 ` hubicka at gcc dot gnu.org
  2023-05-28 18:11 ` hubicka at gcc dot gnu.org
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 17:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 55178
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55178&action=edit
Preprocessed source of VerticalFiller and HorisontalFiller

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (8 preceding siblings ...)
  2023-05-28 17:39 ` hubicka at gcc dot gnu.org
@ 2023-05-28 18:11 ` hubicka at gcc dot gnu.org
  2023-05-28 18:50 ` hubicka at gcc dot gnu.org
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 18:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #9 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Oddly enough simplified version of the loop SLP vectorizes for me:
struct rgb {unsigned char r,g,b;} *rgbs;
int *addr;
double *weights;
struct drgb {double r,g,b;};

struct drgb sum()
{
        struct drgb r;
        for (int i = 0; i < 100000; i++)
        {
          int j = addr[i];
          double w = weights[i];
          r.r += rgbs[j].r * w;
          r.g += rgbs[j].g * w;
          r.b += rgbs[j].b * w;
        }
        return r;
}
I get:
L2:
        movslq  (%r9,%rdx,4), %rax
        vmovsd  (%r8,%rdx,8), %xmm1
        incq    %rdx
        leaq    (%rax,%rax,2), %rax
        addq    %rsi, %rax
        movzbl  (%rax), %ecx
        vmovddup        %xmm1, %xmm4
        vmovd   %ecx, %xmm0
        movzbl  1(%rax), %ecx
        movzbl  2(%rax), %eax
        vpinsrd $1, %ecx, %xmm0, %xmm0
        vcvtdq2pd       %xmm0, %xmm0
        vfmadd231pd     %xmm4, %xmm0, %xmm2
        vcvtsi2sdl      %eax, %xmm5, %xmm0
        vfmadd231sd     %xmm1, %xmm0, %xmm3
        cmpq    $100000, %rdx
        jne     .L2


I think the actual loop is:
 <bb 53> [local count: 44202554]:
  _106 = _262->pixel;
  _109 = *source_231(D).columns;

  <bb 54> [local count: 401841405]:
  # pixel$green_332 = PHI <_124(89), pixel$green_265(53)>
  # i_357 = PHI <i_298(89), 0(53)>
  # pixel$red_371 = PHI <_119(89), pixel$red_263(53)>
  # pixel$blue_377 = PHI <_129(89), pixel$blue_267(53)>
  i.51_102 = (long unsigned int) i_357;
  _103 = i.51_102 * 16;
  _104 = _262 + _103;
  _105 = _104->pixel;
  _107 = _105 - _106;
  _108 = (long unsigned int) _107;
  _110 = _108 * _109;
  _112 = _110 + _621;
  weight_297 = _104->weight;
  _113 = _112 * 4;
  _114 = _276 + _113;
  _115 = _114->red;
  _116 = (int) _115;
  _117 = (double) _116;
  _118 = _117 * weight_297;
  _119 = _118 + pixel$red_371;
  _120 = _114->green;
 _121 = (int) _120;
  _122 = (double) _121;
  _123 = _122 * weight_297;
  _124 = _123 + pixel$green_332;
  _125 = _114->blue;
  _126 = (int) _125;
  _127 = (double) _126;
  _128 = _127 * weight_297;
  _129 = _128 + pixel$blue_377;
  i_298 = i_357 + 1;
  if (n_195 > i_298)
    goto <bb 89>; [89.00%]
  else
    goto <bb 118>; [11.00%]

  <bb 118> [local count: 44202554]:
  # _607 = PHI <_124(54)>
  # _606 = PHI <_119(54)>
  # _605 = PHI <_129(54)>
  goto <bb 55>; [100.00%]

  <bb 89> [local count: 357638851]:
  goto <bb 54>; [100.00%]


and SLP vectorizer seems to claim:
../magick/resize.c:1284:52: note:       _125 = _114->blue;
../magick/resize.c:1284:52: note:       _120 = _114->green;
../magick/resize.c:1284:52: note:       _115 = _114->red;
../magick/resize.c:1284:52: missed:   not consecutive access weight_297 =
_104->weight;
../magick/resize.c:1284:52: missed:   not consecutive access _105 =
_104->pixel;
../magick/resize.c:1284:52: missed:   not consecutive access _134->red =
iftmp.57_207;
../magick/resize.c:1284:52: missed:   not consecutive access _134->green =
iftmp.60_208;
../magick/resize.c:1284:52: missed:   not consecutive access _134->blue =
iftmp.63_209;
../magick/resize.c:1284:52: missed:   not consecutive access _134->opacity = 0;
../magick/resize.c:1284:52: missed:   not consecutive access _63 =
*source_231(D).columns;
../magick/resize.c:1284:52: missed:   not consecutive access _60 = _262->pixel;

Not sure if that is related to the real testcase:


struct rgb {unsigned char r,g,b;} *rgbs;
int *addr;
double *weights;
struct drgb {double r,g,b,o;};

struct drgb sum()
{
        struct drgb r;
        for (int i = 0; i < 100000; i++)
        {
          int j = addr[i];
          double w = weights[i];
          r.r += rgbs[j].r * w;
          r.g += rgbs[j].g * w;
          r.b += rgbs[j].b * w;
        }
        return r;
}

make us to miss the vectorization even though there is nothing using drgb->o:

sum:
.LFB0:
        .cfi_startproc
        movq    %rdi, %r8
        movq    weights(%rip), %rsi
        movq    addr(%rip), %rdi
        vxorps  %xmm2, %xmm2, %xmm2
        movq    rgbs(%rip), %rcx
        xorl    %edx, %edx
        .p2align 4
        .p2align 3
.L2:
        movslq  (%rdi,%rdx,4), %rax
        vmovsd  (%rsi,%rdx,8), %xmm0
        incq    %rdx
        leaq    (%rax,%rax,2), %rax
        addq    %rcx, %rax
        movzbl  (%rax), %r9d
        vcvtsi2sdl      %r9d, %xmm2, %xmm1
        movzbl  1(%rax), %r9d
        movzbl  2(%rax), %eax
        vfmadd231sd     %xmm0, %xmm1, %xmm3
        vcvtsi2sdl      %r9d, %xmm2, %xmm1
        vfmadd231sd     %xmm0, %xmm1, %xmm5
        vcvtsi2sdl      %eax, %xmm2, %xmm1
        vfmadd231sd     %xmm0, %xmm1, %xmm4
        cmpq    $100000, %rdx
        jne     .L2
        vmovq   %xmm4, %xmm4
        vunpcklpd       %xmm5, %xmm3, %xmm0
        movq    %r8, %rax
        vinsertf128     $0x1, %xmm4, %ymm0, %ymm0
        vmovupd %ymm0, (%r8)
        vzeroupper
        ret

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (9 preceding siblings ...)
  2023-05-28 18:11 ` hubicka at gcc dot gnu.org
@ 2023-05-28 18:50 ` hubicka at gcc dot gnu.org
  2023-05-30  0:05 ` zhangjungcc at gmail dot com
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-28 18:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
This is benchmarkeable version of the simplified testcase:

jan@localhost:/tmp> cat t.c
#define N 10000000
struct rgb {unsigned char r,g,b;} rgbs[N];
int *addr;
struct drgb {double r,g,b;
#ifdef OPACITY
             double o;
#endif
};

struct drgb sum(double w)
{
        struct drgb r;
        for (int i = 0; i < N; i++)
        {
          r.r += rgbs[i].r * w;
          r.g += rgbs[i].g * w;
          r.b += rgbs[i].b * w;
        }
        return r;
}
jan@localhost:/tmp> cat q.c
struct drgb {double r,g,b;
#ifdef OPACITY
             double o;
#endif
};
struct drgb sum(double w);
int
main()
{
        for (int i = 0; i < 1000; i++)
                sum(i);
}


jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep
vfmadd231pd  ; perf stat ./a.out
  40119d:       c4 e2 d9 b8 d1          vfmadd231pd %xmm1,%xmm4,%xmm2

 Performance counter stats for './a.out':

         12,148.04 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               736      page-faults:u                    #   60.586 /sec        
    50,018,421,148      cycles:u                         #    4.117 GHz         
           220,502      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle      
    39,950,154,369      stalled-cycles-backend:u         #   79.87% backend
cycles idle       
   120,000,191,713      instructions:u                   #    2.40  insn per
cycle            
                                                  #    0.33  stalled cycles per
insn   
    10,000,048,918      branches:u                       #  823.182 M/sec       
             7,959      branch-misses:u                  #    0.00% of all
branches           

      12.149466078 seconds time elapsed

      12.149084000 seconds user
       0.000000000 seconds sys


jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d
a.out | grep vfmadd231pd  ; perf stat ./a.out

 Performance counter stats for './a.out':

         12,141.11 msec task-clock:u                     #    1.000 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               735      page-faults:u                    #   60.538 /sec        
    50,018,839,129      cycles:u                         #    4.120 GHz         
           185,034      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle      
    29,963,999,798      stalled-cycles-backend:u         #   59.91% backend
cycles idle       
   120,000,191,729      instructions:u                   #    2.40  insn per
cycle            
                                                  #    0.25  stalled cycles per
insn   
    10,000,048,913      branches:u                       #  823.652 M/sec       
             7,311      branch-misses:u                  #    0.00% of all
branches           

      12.142252354 seconds time elapsed

      12.138237000 seconds user
       0.004000000 seconds sys


So on zen2 hardware I get same performance on both.  It may be interesting to
test it on Raptor Lake.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (10 preceding siblings ...)
  2023-05-28 18:50 ` hubicka at gcc dot gnu.org
@ 2023-05-30  0:05 ` zhangjungcc at gmail dot com
  2023-05-31 12:42 ` hubicka at ucw dot cz
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: zhangjungcc at gmail dot com @ 2023-05-30  0:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #11 from jun zhang <zhangjungcc at gmail dot com> ---
Hello, Hubicka and Artem
I try to reproduce this issue in Raptor Lake,
I use -fopenmp -O3 -flto, meet the following error,
but if use -fopenmp -O3, no -flto, build ok.
Could you help me?

libtool: link: /home/sdp/jun/gcc0/install/bin/gcc -fopenmp -O3 -flto
-march=native -Wall -o utilities/gm utilities/gm.o
-L/home/sdp/jun/omp/Ofast/pts_g_gomp/install/.phoronix-test-suite/installed-tests/pts/graphics-magick-2.1.0/gm_/lib
magick/.libs/libGraphicsMagick.a -lfreetype -ljbig -ltiff -ljpeg
-lXext -lSM -lICE -lX11 -llzma -lbz2 -lz -lzstd -lm -lpthread -fopenmp
/home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in
function `main':
<artificial>:(.text.startup+0x1): undefined reference to `GMCommand'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:6411: utilities/gm] Error 1
make[1]: Leaving directory


hubicka at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org> 于2023年5月29日周一 02:50写道：
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812
>
> --- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
> This is benchmarkeable version of the simplified testcase:
>
> jan@localhost:/tmp> cat t.c
> #define N 10000000
> struct rgb {unsigned char r,g,b;} rgbs[N];
> int *addr;
> struct drgb {double r,g,b;
> #ifdef OPACITY
>              double o;
> #endif
> };
>
> struct drgb sum(double w)
> {
>         struct drgb r;
>         for (int i = 0; i < N; i++)
>         {
>           r.r += rgbs[i].r * w;
>           r.g += rgbs[i].g * w;
>           r.b += rgbs[i].b * w;
>         }
>         return r;
> }
> jan@localhost:/tmp> cat q.c
> struct drgb {double r,g,b;
> #ifdef OPACITY
>              double o;
> #endif
> };
> struct drgb sum(double w);
> int
> main()
> {
>         for (int i = 0; i < 1000; i++)
>                 sum(i);
> }
>
>
> jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep
> vfmadd231pd  ; perf stat ./a.out
>   40119d:       c4 e2 d9 b8 d1          vfmadd231pd %xmm1,%xmm4,%xmm2
>
>  Performance counter stats for './a.out':
>
>          12,148.04 msec task-clock:u                     #    1.000 CPUs
> utilized
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>                736      page-faults:u                    #   60.586 /sec
>     50,018,421,148      cycles:u                         #    4.117 GHz
>            220,502      stalled-cycles-frontend:u        #    0.00% frontend
> cycles idle
>     39,950,154,369      stalled-cycles-backend:u         #   79.87% backend
> cycles idle
>    120,000,191,713      instructions:u                   #    2.40  insn per
> cycle
>                                                   #    0.33  stalled cycles per
> insn
>     10,000,048,918      branches:u                       #  823.182 M/sec
>              7,959      branch-misses:u                  #    0.00% of all
> branches
>
>       12.149466078 seconds time elapsed
>
>       12.149084000 seconds user
>        0.000000000 seconds sys
>
>
> jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d
> a.out | grep vfmadd231pd  ; perf stat ./a.out
>
>  Performance counter stats for './a.out':
>
>          12,141.11 msec task-clock:u                     #    1.000 CPUs
> utilized
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>                735      page-faults:u                    #   60.538 /sec
>     50,018,839,129      cycles:u                         #    4.120 GHz
>            185,034      stalled-cycles-frontend:u        #    0.00% frontend
> cycles idle
>     29,963,999,798      stalled-cycles-backend:u         #   59.91% backend
> cycles idle
>    120,000,191,729      instructions:u                   #    2.40  insn per
> cycle
>                                                   #    0.25  stalled cycles per
> insn
>     10,000,048,913      branches:u                       #  823.652 M/sec
>              7,311      branch-misses:u                  #    0.00% of all
> branches
>
>       12.142252354 seconds time elapsed
>
>       12.138237000 seconds user
>        0.004000000 seconds sys
>
>
> So on zen2 hardware I get same performance on both.  It may be interesting to
> test it on Raptor Lake.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (11 preceding siblings ...)
  2023-05-30  0:05 ` zhangjungcc at gmail dot com
@ 2023-05-31 12:42 ` hubicka at ucw dot cz
  2023-05-31 16:11 ` hubicka at gcc dot gnu.org
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at ucw dot cz @ 2023-05-31 12:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #12 from Jan Hubicka <hubicka at ucw dot cz> ---
> /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in
> function `main':
> <artificial>:(.text.startup+0x1): undefined reference to `GMCommand'

I wonder if your plugin is configured correctly.  Can you try to build
with -flto -fuse-linker-plugin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (12 preceding siblings ...)
  2023-05-31 12:42 ` hubicka at ucw dot cz
@ 2023-05-31 16:11 ` hubicka at gcc dot gnu.org
  2023-05-31 16:52 ` jamborm at gcc dot gnu.org
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-31 16:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenther at suse dot de
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=110062

--- Comment #13 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
The only difference between slp vectorization is:

-  # _68 = PHI <_5(3)>
-  # _67 = PHI <_11(3)>
-  # _66 = PHI <_16(3)>
-  <retval>.r = _68;
-  <retval>.g = _67;
-  <retval>.b = _66;
+  # _70 = PHI <_5(3)>
+  # _69 = PHI <_11(3)>
+  # _68 = PHI <_16(3)>
+  <retval>.r = _70;
+  <retval>.g = _69;
+  <retval>.b = _68;
+  <retval>.o = r$o_33(D);

so SRA invents r$o_33(D) even if that variable is undefined.

SLP vectorizer then sees it as interleaving stores:

-t.c:19:16: note:       _1 = rgbs[i_35].r;
-t.c:19:16: note:       _7 = rgbs[i_35].g;
-t.c:19:16: note:       _12 = rgbs[i_35].b;
-t.c:19:16: note:   Detected interleaving store of size 3
-t.c:19:16: note:       <retval>.r = _68;
-t.c:19:16: note:       <retval>.g = _67;
-t.c:19:16: note:       <retval>.b = _66;
+t.c:19:16: note:       _1 = rgbs[i_37].r;
+t.c:19:16: note:       _7 = rgbs[i_37].g;
+t.c:19:16: note:       _12 = rgbs[i_37].b;
+t.c:19:16: note:   Detected interleaving store of size 4
+t.c:19:16: note:       <retval>.r = _70;
+t.c:19:16: note:       <retval>.g = _69;
+t.c:19:16: note:       <retval>.b = _68;
+t.c:19:16: note:       <retval>.o = r$o_33(D);

For first case it first tries to vectorize for vector of 3 doubles and fails:

-t.c:19:16: note:     <retval>.r = _68;
-t.c:19:16: note:     <retval>.g = _67;
-t.c:19:16: note:     <retval>.b = _66;
-t.c:19:16: note:   starting SLP discovery for node 0x2cb4fe8
-t.c:19:16: note:   Build SLP for <retval>.r = _68;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   Build SLP for <retval>.g = _67;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   Build SLP for <retval>.b = _66;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   SLP discovery for node 0x2cb4fe8 failed

And later it tries to vectorize first 2 items:

-t.c:19:16: note:   Splitting SLP group at stmt 2
-t.c:19:16: note:   Split group into 2 and 1
-t.c:19:16: note:   Starting SLP discovery for
-t.c:19:16: note:     <retval>.r = _68;
-t.c:19:16: note:     <retval>.g = _67;
-t.c:19:16

... and after a lot of blablabla succeeds.

If opaque field is present we start with vector of size 4:
+t.c:19:16: note:     <retval>.r = _70;
+t.c:19:16: note:     <retval>.g = _69;
+t.c:19:16: note:     <retval>.b = _68;
+t.c:19:16: note:     <retval>.o = r$o_33(D);


+t.c:19:16: note:   vect_is_simple_use: operand _70 = PHI <_5(3)>, type of def:
internal
+t.c:19:16: note:   vect_is_simple_use: operand _69 = PHI <_11(3)>, type of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand _68 = PHI <_16(3)>, type of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand r$o_33(D), type of def:
external
+t.c:19:16: missed:   treating operand as external
+t.c:19:16: note:   SLP discovery for node 0x2e80058 succeeded
+t.c:19:16: note:   SLP size 1 vs. limit 23.
+t.c:19:16: note:   Final SLP tree for instance 0x2def840:
+t.c:19:16: note:   node 0x2e80058 (max_nunits=4, refcnt=2) vector(4) double
+t.c:19:16: note:   op template: <retval>.r = _70;
+t.c:19:16: note:       stmt 0 <retval>.r = _70;
+t.c:19:16: note:       stmt 1 <retval>.g = _69;
+t.c:19:16: note:       stmt 2 <retval>.b = _68;
+t.c:19:16: note:       stmt 3 <retval>.o = r$o_33(D);
+t.c:19:16: note:       children 0x2e800d8
+t.c:19:16: note:   node (external) 0x2e800d8 (max_nunits=1, refcnt=1)
+t.c:19:16: note:       { _70, _69, _68, r$o_33(D) }

So it seems to succeed vectorizing with 4 entries but it does so for the single
return statement:

  <bb 3> [local count: 1063004409]:
  # i_37 = PHI <i_22(5), 0(2)>
  # r$r_40 = PHI <_5(5), r$r_25(D)(2)>
  # r$g_42 = PHI <_11(5), r$g_26(D)(2)>
  # r$b_44 = PHI <_16(5), r$b_27(D)(2)>
  # ivtmp_67 = PHI <ivtmp_66(5), 10000000(2)>
  _1 = rgbs[i_37].r;
  _2 = (int) _1;
  _3 = (double) _2;
  _4 = _3 * w_21(D);
  _5 = _4 + r$r_40;
  _7 = rgbs[i_37].g;
  _8 = (int) _7;
  _9 = (double) _8;
  _10 = _9 * w_21(D);
  _11 = _10 + r$g_42;
  _12 = rgbs[i_37].b;
  _13 = (int) _12;
  _14 = (double) _13;
  _15 = _14 * w_21(D);
  _16 = _15 + r$b_44;
  i_22 = i_37 + 1;
  ivtmp_66 = ivtmp_67 - 1;
  if (ivtmp_66 != 0)
    goto <bb 5>; [99.00%]
  else
    goto <bb 4>; [1.00%]

  <bb 5> [local count: 1052374367]:
  goto <bb 3>; [100.00%]

  <bb 4> [local count: 10737416]:
  # _70 = PHI <_5(3)>
  # _69 = PHI <_11(3)>
  # _68 = PHI <_16(3)>
  _65 = {_70, _69, _68, r$o_33(D)};
  MEM <vector(4) double> [(double *)&<retval>] = _65;

that seems somewhat pointless.
If one adds code initializing opacity field then vectorization works well. So
perhaps SLP vectorizer needs to be told how to deal with uninitialized
variabels that may be common in code like this after SRA?

Richi, it is not clear to me where SLP vectorizer discards the idea of
vectorizing the loop body in this case. But I think one needs to address:
+t.c:19:16: missed:   treating operand as external

I wonder if the loop would work faster it it used vectors of size 4 with the
last field unused.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (13 preceding siblings ...)
  2023-05-31 16:11 ` hubicka at gcc dot gnu.org
@ 2023-05-31 16:52 ` jamborm at gcc dot gnu.org
  2023-06-01  9:38 ` jamborm at gcc dot gnu.org
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-05-31 16:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #14 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #13)
> The only difference between slp vectorization is:
> 
> -  # _68 = PHI <_5(3)>
> -  # _67 = PHI <_11(3)>
> -  # _66 = PHI <_16(3)>
> -  <retval>.r = _68;
> -  <retval>.g = _67;
> -  <retval>.b = _66;
> +  # _70 = PHI <_5(3)>
> +  # _69 = PHI <_11(3)>
> +  # _68 = PHI <_16(3)>
> +  <retval>.r = _70;
> +  <retval>.g = _69;
> +  <retval>.b = _68;
> +  <retval>.o = r$o_33(D);
> 
> so SRA invents r$o_33(D) even if that variable is undefined.

Is this the testcase from comment #10 ?  I don't see r$o in my dumps.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (14 preceding siblings ...)
  2023-05-31 16:52 ` jamborm at gcc dot gnu.org
@ 2023-06-01  9:38 ` jamborm at gcc dot gnu.org
  2023-06-01 11:19 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-06-01  9:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #15 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Oh, because I missed the -DOPACITY in the second command line.  The reason for
SRAs creating the repalcement is total scalarization :-/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (15 preceding siblings ...)
  2023-06-01  9:38 ` jamborm at gcc dot gnu.org
@ 2023-06-01 11:19 ` jakub at gcc dot gnu.org
  2023-06-01 12:28 ` hubicka at gcc dot gnu.org
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-06-01 11:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Shouldn't we DCE something = x_N(D); stores when x is a VAR_DECL, at least
provided
something can't trap?  I mean, the previous content is one of the possible
uninitialized values.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (16 preceding siblings ...)
  2023-06-01 11:19 ` jakub at gcc dot gnu.org
@ 2023-06-01 12:28 ` hubicka at gcc dot gnu.org
  2023-06-21  9:46 ` ubizjak at gmail dot com
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-06-01 12:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #17 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I was also thinking of DCE. It looks like plausible idea.  It may leads to a
surprise where you sture same undefined variable to two places and later
compare them for equality, but that is undefined anyway.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (17 preceding siblings ...)
  2023-06-01 12:28 ` hubicka at gcc dot gnu.org
@ 2023-06-21  9:46 ` ubizjak at gmail dot com
  2023-10-12  4:48 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2023-06-21  9:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #18 from Uroš Bizjak <ubizjak at gmail dot com> ---
One interesting observation:

clang is able to do this:

  0.09 │     │  vmovddup     -0x8(%rdx,%rsi,1),%xmm3              ▒
  ...
  0.11 │     │  vfmadd231sd  %xmm2,%xmm3,%xmm1                    ▒
  ...
  0.74 │     │  vfmadd231pd  %xmm2,%xmm3,%xmm0                    ▒

It figures out that duplicated V2DFmode value in %xmm3 can also be accessed in
the same register as DFmode value.

OTOH, current gcc does:

        vmovsd  (%rsi,%rax,8), %xmm1
        ...
        vmovddup        %xmm1, %xmm4
        ...
        vfmadd231pd     %xmm4, %xmm0, %xmm2
        ...
        vfmadd231sd     %xmm1, %xmm0, %xmm3

The above code needs two registers.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (18 preceding siblings ...)
  2023-06-21  9:46 ` ubizjak at gmail dot com
@ 2023-10-12  4:48 ` cvs-commit at gcc dot gnu.org
  2023-11-24 23:38 ` hubicka at gcc dot gnu.org
  2023-11-25 10:21 ` liuhongt at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-10-12  4:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #19 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:e1e127de18dbee47b88fa0ce74a1c7f4d658dc68

commit r14-4571-ge1e127de18dbee47b88fa0ce74a1c7f4d658dc68
Author: Zhang, Jun <jun.zhang@intel.com>
Date:   Fri Sep 22 23:56:37 2023 +0800

    x86: set spincount 1 for x86 hybrid platform

    By test, we find in hybrid platform spincount 1 is better.

    Use '-march=native -Ofast -funroll-loops -flto',
    results as follows:

    spec2017 speed   RPL     ADL
    657.xz_s         0.00%   0.50%
    603.bwaves_s     10.90%  26.20%
    607.cactuBSSN_s  5.50%   72.50%
    619.lbm_s        2.40%   2.50%
    621.wrf_s        -7.70%  2.40%
    627.cam4_s       0.50%   0.70%
    628.pop2_s       48.20%  153.00%
    638.imagick_s    -0.10%  0.20%
    644.nab_s        2.30%   1.40%
    649.fotonik3d_s  8.00%   13.80%
    654.roms_s       1.20%   1.10%
    Geomean-int      0.00%   0.50%
    Geomean-fp       6.30%   21.10%
    Geomean-all      5.70%   19.10%

    omp2012          RPL     ADL
    350.md           -1.81%  -1.75%
    351.bwaves       7.72%   12.50%
    352.nab          14.63%  19.71%
    357.bt331        -0.20%  1.77%
    358.botsalgn     0.00%   0.00%
    359.botsspar     0.00%   0.65%
    360.ilbdc        0.00%   0.25%
    362.fma3d        2.66%   -0.51%
    363.swim         10.44%  0.00%
    367.imagick      0.00%   0.12%
    370.mgrid331     2.49%   25.56%
    371.applu331     1.06%   4.22%
    372.smithwa      0.74%   3.34%
    376.kdtree       10.67%  16.03%
    GEOMEAN          3.34%   5.53%

    include/ChangeLog:

            PR target/109812
            * spincount.h: New file.

    libgomp/ChangeLog:

            * env.c (initialize_env): Use do_adjust_default_spincount.
            * config/linux/x86/spincount.h: New file.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (19 preceding siblings ...)
  2023-10-12  4:48 ` cvs-commit at gcc dot gnu.org
@ 2023-11-24 23:38 ` hubicka at gcc dot gnu.org
  2023-11-25 10:21 ` liuhongt at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-24 23:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #20 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
On zen4 hardware I now get

GCC13 with -O3 -flto -march=native -fopenmp
        2163
        2161
        2153

    Average: 2159 Iterations Per Minute

clang 17 with -O3 -flto -march=native -fopenmp
        2004
        1988
        1991

    Average: 1994 Iterations Per Minute

trunk -O3 -flto -march=native -fopenmp
    Operation: Resizing:
        2126
        2135
        2123

    Average: 2128 Iterations Per Minute

So no big changes here...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
  2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (20 preceding siblings ...)
  2023-11-24 23:38 ` hubicka at gcc dot gnu.org
@ 2023-11-25 10:21 ` liuhongt at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2023-11-25 10:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

liuhongt at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #21 from liuhongt at gcc dot gnu.org ---
The main gap is from openmp for hybrid machine.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-11-25 10:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
2023-05-11 14:26 ` [Bug tree-optimization/109812] " aros at gmx dot com
2023-05-11 15:20 ` [Bug target/109812] " pinskia at gcc dot gnu.org
2023-05-11 15:50 ` aros at gmx dot com
2023-05-12  8:47 ` aros at gmx dot com
2023-05-16 22:43 ` juzhe.zhong at rivai dot ai
2023-05-17  0:08 ` sjames at gcc dot gnu.org
2023-05-28 16:46 ` hubicka at gcc dot gnu.org
2023-05-28 17:29 ` [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake hubicka at gcc dot gnu.org
2023-05-28 17:39 ` hubicka at gcc dot gnu.org
2023-05-28 18:11 ` hubicka at gcc dot gnu.org
2023-05-28 18:50 ` hubicka at gcc dot gnu.org
2023-05-30  0:05 ` zhangjungcc at gmail dot com
2023-05-31 12:42 ` hubicka at ucw dot cz
2023-05-31 16:11 ` hubicka at gcc dot gnu.org
2023-05-31 16:52 ` jamborm at gcc dot gnu.org
2023-06-01  9:38 ` jamborm at gcc dot gnu.org
2023-06-01 11:19 ` jakub at gcc dot gnu.org
2023-06-01 12:28 ` hubicka at gcc dot gnu.org
2023-06-21  9:46 ` ubizjak at gmail dot com
2023-10-12  4:48 ` cvs-commit at gcc dot gnu.org
2023-11-24 23:38 ` hubicka at gcc dot gnu.org
2023-11-25 10:21 ` liuhongt at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).