From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 800443858D33; Mon, 23 Jan 2023 11:37:08 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 800443858D33 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1674473828; bh=xE9vZUNM+PcRwZ58L2nxyUVcz4eMIL1zMp2QHzNwveQ=; h=From:To:Subject:Date:From; b=h8QlOKWUAeAiJHbpgJhKXav8yFAAphbVoEM1ti/Wdh3rm6FfcMiARcp5dLEV5+Iua U5njQeE45fZSiFTNmfwzPgSmL+xj6lOF7PbxNTqFGWub8w+560EvGrSJ/cTQ4pzJei 3GGkR4NVOxyyPgH6odvZYU8+uDR8f5qxg4b/wnUw= From: "dewhurst@mpi-halle.mpg.de" To: gcc-bugs@gcc.gnu.org Subject: [Bug libgomp/108494] New: Slow thread creation with nested loops in GFortran Date: Mon, 23 Jan 2023 11:37:07 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libgomp X-Bugzilla-Version: unknown X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: dewhurst@mpi-halle.mpg.de X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108494 Bug ID: 108494 Summary: Slow thread creation with nested loops in GFortran Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: dewhurst@mpi-halle.mpg.de CC: jakub at gcc dot gnu.org Target Milestone: --- This is an issue with very slow thread creation for nested loops in code compiled with GFortran, however I suspect it may be due to the libgomp libr= ary. Here is a simple example the problem: program test implicit none integer l !$OMP PARALLEL DO & !$OMP NUM_THREADS(1) do l=3D1,1000 call foo end do !$OMP END PARALLEL DO end program subroutine foo implicit none integer, parameter :: l=3D200,m=3D100,n=3D10 ! number of threads integer, parameter :: nthd=3D10 integer i,j ! automatic arrays real(8) a(n,l),b(n,m),x(m) a(:,:)=3D2.d0 b(:,:)=3D3.d0 do i=3D1,l !$OMP PARALLEL DO DEFAULT(SHARED) & !$OMP NUM_THREADS(nthd) do j=3D1,m x(j)=3Ddot_product(a(:,i),b(:,j)) end do !$OMP END PARALLEL DO end do end subroutine The wall-clock time is about 0.5 seconds when compiled with Intel or PGI Fortran. However, for GFortran compiled with gfortran -O3 -fopenmp test.f90 and OMP_NESTED set to true, the wall-clock time is about 70 seconds, or abo= ut 140 times slower. (The =E2=80=98dot_product=E2=80=99 can be removed from th= e loop =E2=80=93 all the time is taken with thread creation). This only affects nested loops; if the OMP directives are removed from the = loop in the program part in the code above then GFortran is as fast as the other compilers. I=E2=80=99ve tried several different versions of GFortran (from = 7.5.0 to 12.1.0) on different Linux machines and it=E2=80=99s slow on all of them. It may problem with libgomp. If I substitute the libgomp library for that provided with the NVIDIA compiler (on our machine this is in the directory nvhpcsdk/22.11/Linux_x86_64/22.11/compilers/lib/libgomp.so.1) then it=E2=80= =99s as fast as the others. This has been reproduced by others and also in Windows, see here: https://fortran-lang.discourse.group/t/slow-thread-creation-with-nested-loo= ps-in-gfortran/5062=