From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-315373-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 7876 invoked by alias); 20 Apr 2010 12:23:51 -0000
Received: (qmail 7752 invoked by uid 48); 20 Apr 2010 12:23:12 -0000
Date: Tue, 20 Apr 2010 12:23:00 -0000
Message-ID: <20100420122312.7751.qmail@sourceware.org>
X-Bugzilla-Reason: CC
References: <bug-43706-19017@http.gcc.gnu.org/bugzilla/>
Subject: [Bug libgomp/43706] scheduling two threads on one core leads to starvation
In-Reply-To: <bug-43706-19017@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "mika dot fischer at kit dot edu" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2010-04/txt/msg02013.txt.bz2


------- Comment #7 from mika dot fischer at kit dot edu  2010-04-20 12:23 -------
> For performance reasons libgomp uses some busy waiting, which of course works
> well when there are available CPUs and cycles to burn (decreases latency a
> lot), but if you have more threads than CPUs it can make things worse.
> You can tweak this through OMP_WAIT_POLICY and GOMP_SPINCOUNT env vars.

This is definitely the reason for the behavior we're seeing. When we set
OMP_WAIT_POLICY=passive, the test program runs through normally. Without it
it takes very very long.

Here are some
measurements when "while (true)" is replaced by "for (int j=0; j<1000; ++j)":


All cores idle:
===============
$ /usr/bin/time ./openmp-bug
3.21user 0.00system 0:00.81elapsed 391%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+331minor)pagefaults 0swaps

$ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug
2.75user 0.05system 0:01.42elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+335minor)pagefaults 0swaps


1 (out of 4) cores occupied:
============================
$ /usr/bin/time ./openmp-bug
133.65user 0.02system 0:45.30elapsed 295%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+330minor)pagefaults 0swaps

$ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug
2.67user 0.00system 0:02.35elapsed 113%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+335minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=10 /usr/bin/time ./openmp-bug
2.91user 0.03system 0:01.73elapsed 169%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=100 /usr/bin/time ./openmp-bug
2.77user 0.03system 0:01.90elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=1000 /usr/bin/time ./openmp-bug
2.87user 0.00system 0:01.70elapsed 168%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=10000 /usr/bin/time ./openmp-bug
3.05user 0.06system 0:01.85elapsed 167%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+337minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=100000 /usr/bin/time ./openmp-bug
5.25user 0.03system 0:03.10elapsed 170%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+335minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=1000000 /usr/bin/time ./openmp-bug
28.84user 0.00system 0:14.13elapsed 203%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

[I ran all of these several times and took the runtime around the average]

> Although the implementation recognizes two kinds of spin counts (normal and
> throttled, the latter in use when number of threads is bigger than number of
> available CPUs), in some cases even that default might be too large (the
> default for throttled spin count is 1000 spins for OMP_WAIT_POLICY=active and
> 100 spins for no OMP_WAIT_POLICY in environment).

As the numbers show, a default spin count of 1000 would be fine. The problem is
however, that OpenMP assumes that it has all the cores of the CPU for itself.
The throttled spin count is only used if the number of OpenMP threads is larger
than the number of cores in the system (AFAICT). This will almost never happen
(AFAICT only if you set OMP_NUM_THREADS to something larger than the number of
cores).

Since it seems clear that the number of spin counts should be smaller when the
CPU cores are active, the throttled spin count must be used when the cores are
actually used at the moment the thread starts waiting. That the number of
OpenMP
threads running at that time is smaller than the number of cores is not a
sufficient condition. If it's not possible to determine this or if it's too
time-consuming, then maybe the non-throttled default spin count can be reduced
to 1000
or so.

So thanks for the workaround! But I still think the default behavior can easily
cause very significant slowdowns and thus should be reconsidered.

Finally, I still don't get why the spinlocking has these effects on the
runtime.
I would expect even 2000000 spin lock cycles to be over very quickly and not a
20-fold increase in the total runtime of the program. Just out of curiosity
maybe you can explain why this happens.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706