public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/110472] New: 60% slowdown with fwrapv when using openmp
@ 2023-06-29  1:38 ryanpholt at me dot com
  2023-06-29  1:58 ` [Bug middle-end/110472] " pinskia at gcc dot gnu.org
  2023-06-29 15:13 ` ryanpholt at me dot com
  0 siblings, 2 replies; 3+ messages in thread
From: ryanpholt at me dot com @ 2023-06-29  1:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110472

            Bug ID: 110472
           Summary: 60% slowdown with fwrapv when using openmp
           Product: gcc
           Version: 10.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ryanpholt at me dot com
  Target Milestone: ---

Created attachment 55423
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55423&action=edit
Reproduction file

Compiling the attached example with -fwrapv inhibits some optimizations and 
results in a massive slowdown. It appears to be related to the use of OpenMP. I
know that fwrapv can result in slowdowns; however, I do not believe that it
needs to in this example.

In the loop nest below, gcc appears to believe the computations with the loop
induction variables (ex. 'i2 = i * 21504 + i1 * 96;') will overflow. However,
the code is looping over fixed size data and so I believe gcc should be able to
determine that overflow is not possible. Perhaps some range analysis stops
working across the openmp runtime boundary? The issue is fixed if I remove the
openmp pragma.

The issue is also fixed if I change the loop induction variables to be declared
as int64_t rather than int.

I also tried clang16 and did not observe the issue with fwrapv.

#pragma omp parallel for \
 num_threads(omp_get_max_threads()) \
 private(i1,u0,b_u0,i2,i3,i4,i5,i6,i7,i8,i10,i12,i14,i15) \
 firstprivate(r2)

  for (i = 0; i < 7; i++) {

    for (i1 = 0; i1 < 7; i1++) {
      u0 = i * -32 + 222;
      if (u0 > 32) {
        u0 = 32;
      }

      b_u0 = i1 * -32 + 222;
      if (b_u0 > 32) {
        b_u0 = 32;
      }

      i2 = i * 21504 + i1 * 96;
      i3 = i * 227328 + (i1 << 10);
      for (i4 = 0; i4 < u0; i4++) {
        for (i5 = 0; i5 < b_u0; i5++) {
          for (i6 = 0; i6 < 3; i6++) {
            i7 = i5 + i6;
            for (i8 = 0; i8 < 3; i8++) {
              for (i10 = 0; i10 < 3; i10++) {
                i12 = i4 + i10;
                for (i14 = 0; i14 < 32; i14++) {
                  i15 = ((i3 + 7104 * i4) + (i5 << 5)) + i14;
                  (r2)[i15] += ((float *)&in_0[0][0][0][0])[((i2 + 672 * i12) +
3 *
                    i7) + i8] * ((float *)&__constant_3x3x3x32xf32[0][0][0][0])
                    [((288 * i10 + 96 * i6) + (i8 << 5)) + i14];
                }
              }
            }
          }
        }
      }
    }
  }


Repro:
gcc -O3 -fopenmp -lpthread -fwrapv predict.i
./a.out

(Remove the -fwrapv to observe a major speedup)



Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 10.2.1-6'
--with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686
--with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib
--with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-10-Km9U7s/gcc-10-10.2.1/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-10-Km9U7s/gcc-10-10.2.1/debian/tmp-gcn/usr,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.1 20210110 (Debian 10.2.1-6) 
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-fopenmp' '-fwrapv'
'-mtune=generic' '-march=x86-64' '-pthread'
 /usr/lib/gcc/x86_64-linux-gnu/10/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu
-D_REENTRANT main.c -mtune=generic -march=x86-64 -fopenmp -fwrapv -O3
-fpch-preprocess -fasynchronous-unwind-tables -o main.i
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/10/include-fixed"
ignoring nonexistent directory
"/usr/lib/gcc/x86_64-linux-gnu/10/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/x86_64-linux-gnu/10/include
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-fopenmp' '-fwrapv'
'-mtune=generic' '-march=x86-64' '-pthread'
 /usr/lib/gcc/x86_64-linux-gnu/10/cc1 -fpreprocessed main.i -quiet -dumpbase
main.c -mtune=generic -march=x86-64 -auxbase main -O3 -version -fopenmp -fwrapv
-fasynchronous-unwind-tables -o main.s
GNU C17 (Debian 10.2.1-6) version 10.2.1 20210110 (x86_64-linux-gnu)
        compiled by GNU C version 10.2.1 20210110, GMP version 6.2.1, MPFR
version 4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
GNU C17 (Debian 10.2.1-6) version 10.2.1 20210110 (x86_64-linux-gnu)
        compiled by GNU C version 10.2.1 20210110, GMP version 6.2.1, MPFR
version 4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 1f803793fa2e3418c492b25e7d3eac2f
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-fopenmp' '-fwrapv'
'-mtune=generic' '-march=x86-64' '-pthread'
 as -v --64 -o main.o main.s
GNU assembler version 2.35.2 (x86_64-linux-gnu) using BFD version (GNU Binutils
for Debian) 2.35.2
COMPILER_PATH=/usr/lib/gcc/x86_64-linux-gnu/10/:/usr/lib/gcc/x86_64-linux-gnu/10/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/10/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/10/:/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/10/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/10/../../../:/lib/:/usr/lib/
Reading specs from /usr/lib/gcc/x86_64-linux-gnu/10/libgomp.spec
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-fopenmp' '-fwrapv'
'-mtune=generic' '-march=x86-64' '-pthread'
 /usr/lib/gcc/x86_64-linux-gnu/10/collect2 -plugin
/usr/lib/gcc/x86_64-linux-gnu/10/liblto_plugin.so
-plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
-plugin-opt=-fresolution=main.res -plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lpthread
-plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr -m elf_x86_64
--hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie
/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu/Scrt1.o
/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu/crti.o
/usr/lib/gcc/x86_64-linux-gnu/10/crtbeginS.o
/usr/lib/gcc/x86_64-linux-gnu/10/crtoffloadbegin.o
-L/usr/lib/gcc/x86_64-linux-gnu/10
-L/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu
-L/usr/lib/gcc/x86_64-linux-gnu/10/../../../../lib -L/lib/x86_64-linux-gnu
-L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib
-L/usr/lib/gcc/x86_64-linux-gnu/10/../../.. -lpthread main.o -lgomp -lgcc
--push-state --as-needed -lgcc_s --pop-state -lpthread -lc -lgcc --push-state
--as-needed -lgcc_s --pop-state /usr/lib/gcc/x86_64-linux-gnu/10/crtendS.o
/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu/crtn.o
/usr/lib/gcc/x86_64-linux-gnu/10/crtoffloadend.o
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-fopenmp' '-fwrapv'
'-mtune=generic' '-march=x86-64' '-pthread'

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug middle-end/110472] 60% slowdown with fwrapv when using openmp
  2023-06-29  1:38 [Bug tree-optimization/110472] New: 60% slowdown with fwrapv when using openmp ryanpholt at me dot com
@ 2023-06-29  1:58 ` pinskia at gcc dot gnu.org
  2023-06-29 15:13 ` ryanpholt at me dot com
  1 sibling, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-06-29  1:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110472

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-linux-gnu

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I think it is just wrong iv-opt choices.

Works just fine on aarch64-linux-gnu too:
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ~/upstream-gcc/bin/gcc t4.c -O2
-fopenmp
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
time: 15.191220
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ~/upstream-gcc/bin/gcc t4.c -O2
-fopenmp -fwrapv
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
time: 18.854280
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
time: 16.705876
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ~/upstream-gcc/bin/gcc t4.c -O2
-fopenmp
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
time: 17.491387
ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
time: 17.519264

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug middle-end/110472] 60% slowdown with fwrapv when using openmp
  2023-06-29  1:38 [Bug tree-optimization/110472] New: 60% slowdown with fwrapv when using openmp ryanpholt at me dot com
  2023-06-29  1:58 ` [Bug middle-end/110472] " pinskia at gcc dot gnu.org
@ 2023-06-29 15:13 ` ryanpholt at me dot com
  1 sibling, 0 replies; 3+ messages in thread
From: ryanpholt at me dot com @ 2023-06-29 15:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110472

--- Comment #2 from Ryan Holt <ryanpholt at me dot com> ---
(In reply to Andrew Pinski from comment #1)
> I think it is just wrong iv-opt choices.
> 
> Works just fine on aarch64-linux-gnu too:
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ~/upstream-gcc/bin/gcc t4.c -O2
> -fopenmp
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
> time: 15.191220
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ~/upstream-gcc/bin/gcc t4.c -O2
> -fopenmp -fwrapv
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
> time: 18.854280
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
> time: 16.705876
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ~/upstream-gcc/bin/gcc t4.c -O2
> -fopenmp
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
> time: 17.491387
> ubuntu@ubuntu:~/src/upstream-gcc-aarch64\# ./a.out ;echo
> time: 17.519264

I forgot to explicitly call out that I can only reproduce the big speedup
without fwrapv when compiling with -O3. I noticed you were using O2 on
aarch64-linux-gnu.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-06-29 15:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-29  1:38 [Bug tree-optimization/110472] New: 60% slowdown with fwrapv when using openmp ryanpholt at me dot com
2023-06-29  1:58 ` [Bug middle-end/110472] " pinskia at gcc dot gnu.org
2023-06-29 15:13 ` ryanpholt at me dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).