public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libgomp/108494] New: Slow thread creation with nested loops in GFortran
@ 2023-01-23 11:37 dewhurst@mpi-halle.mpg.de
2023-01-23 13:08 ` [Bug libgomp/108494] " rguenth at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: dewhurst@mpi-halle.mpg.de @ 2023-01-23 11:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108494
Bug ID: 108494
Summary: Slow thread creation with nested loops in GFortran
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: libgomp
Assignee: unassigned at gcc dot gnu.org
Reporter: dewhurst@mpi-halle.mpg.de
CC: jakub at gcc dot gnu.org
Target Milestone: ---
This is an issue with very slow thread creation for nested loops in code
compiled with GFortran, however I suspect it may be due to the libgomp library.
Here is a simple example the problem:
program test
implicit none
integer l
!$OMP PARALLEL DO &
!$OMP NUM_THREADS(1)
do l=1,1000
call foo
end do
!$OMP END PARALLEL DO
end program
subroutine foo
implicit none
integer, parameter :: l=200,m=100,n=10
! number of threads
integer, parameter :: nthd=10
integer i,j
! automatic arrays
real(8) a(n,l),b(n,m),x(m)
a(:,:)=2.d0
b(:,:)=3.d0
do i=1,l
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP NUM_THREADS(nthd)
do j=1,m
x(j)=dot_product(a(:,i),b(:,j))
end do
!$OMP END PARALLEL DO
end do
end subroutine
The wall-clock time is about 0.5 seconds when compiled with Intel or PGI
Fortran. However, for GFortran compiled with
gfortran -O3 -fopenmp test.f90
and OMP_NESTED set to true, the wall-clock time is about 70 seconds, or about
140 times slower. (The ‘dot_product’ can be removed from the loop – all the
time is taken with thread creation).
This only affects nested loops; if the OMP directives are removed from the loop
in the program part in the code above then GFortran is as fast as the other
compilers. I’ve tried several different versions of GFortran (from 7.5.0 to
12.1.0) on different Linux machines and it’s slow on all of them.
It may problem with libgomp. If I substitute the libgomp library for that
provided with the NVIDIA compiler (on our machine this is in the directory
nvhpcsdk/22.11/Linux_x86_64/22.11/compilers/lib/libgomp.so.1) then it’s as fast
as the others.
This has been reproduced by others and also in Windows, see here:
https://fortran-lang.discourse.group/t/slow-thread-creation-with-nested-loops-in-gfortran/5062
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug libgomp/108494] Slow thread creation with nested loops in GFortran
2023-01-23 11:37 [Bug libgomp/108494] New: Slow thread creation with nested loops in GFortran dewhurst@mpi-halle.mpg.de
@ 2023-01-23 13:08 ` rguenth at gcc dot gnu.org
2023-01-23 14:53 ` amonakov at gcc dot gnu.org
2023-01-23 14:59 ` jakub at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-23 13:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108494
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Version|unknown |12.2.1
Keywords| |openmp
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed with GCC 12. perf shows
Samples: 9M of event 'cycles', Event count (approx.): 563584363996
Overhead Samples Command Shared Object Symbol
48.21% 4487732 a.out libgomp.so.1.0.0 [.] omp_get_num_procs
4.17% 25651 a.out [unknown] [k] 0xffffffff9aba2a94
2.15% 13393 a.out [unknown] [k] 0xffffffff9ab9f8c7
1.18% 7245 a.out [unknown] [k] 0xffffffff9b0389f2
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug libgomp/108494] Slow thread creation with nested loops in GFortran
2023-01-23 11:37 [Bug libgomp/108494] New: Slow thread creation with nested loops in GFortran dewhurst@mpi-halle.mpg.de
2023-01-23 13:08 ` [Bug libgomp/108494] " rguenth at gcc dot gnu.org
@ 2023-01-23 14:53 ` amonakov at gcc dot gnu.org
2023-01-23 14:59 ` jakub at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-01-23 14:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108494
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(note, the omp_get_num_procs is just the closest dynamic symbol, with
libgomp1-debuginfo the full symbol table will be available, and perf output
will be more sensible)
We don't reuse threads from nested parallel regions. This comment in team.c
seems relevant:
/* We only allow the reuse of idle threads for non-nested PARALLEL
regions. This appears to be implied by the semantics of
threadprivate variables, but perhaps that's reading too much into
things. Certainly it does prevent any locking problems, since
only the initial program thread will modify gomp_threads. */
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug libgomp/108494] Slow thread creation with nested loops in GFortran
2023-01-23 11:37 [Bug libgomp/108494] New: Slow thread creation with nested loops in GFortran dewhurst@mpi-halle.mpg.de
2023-01-23 13:08 ` [Bug libgomp/108494] " rguenth at gcc dot gnu.org
2023-01-23 14:53 ` amonakov at gcc dot gnu.org
@ 2023-01-23 14:59 ` jakub at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-01-23 14:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108494
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Yeah, we only cache threads at the outermost parallelism level, for nested
parallelism threads are created and destructed as needed.
I know libomp basically never destroys threads (except for
omp_pause_resource{,_all}?), but am not convinced that is a good idea
resource-wise, while it makes pointless benchmarks faster, whenever some
program uses nested parallelism for a short time say from some library once and
then doesn't need it anymore, it will just waste resources.
Most programs don't use omp_pause_resource{,_all} and especially in libraries
it is pretty impossible because the library doesn't know if some other part of
the program doesn't actually use OpenMP.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-01-23 14:59 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-23 11:37 [Bug libgomp/108494] New: Slow thread creation with nested loops in GFortran dewhurst@mpi-halle.mpg.de
2023-01-23 13:08 ` [Bug libgomp/108494] " rguenth at gcc dot gnu.org
2023-01-23 14:53 ` amonakov at gcc dot gnu.org
2023-01-23 14:59 ` jakub at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).