From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 7876 invoked by alias); 20 Apr 2010 12:23:51 -0000 Received: (qmail 7752 invoked by uid 48); 20 Apr 2010 12:23:12 -0000 Date: Tue, 20 Apr 2010 12:23:00 -0000 Message-ID: <20100420122312.7751.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug libgomp/43706] scheduling two threads on one core leads to starvation In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "mika dot fischer at kit dot edu" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2010-04/txt/msg02013.txt.bz2 ------- Comment #7 from mika dot fischer at kit dot edu 2010-04-20 12:23 ------- > For performance reasons libgomp uses some busy waiting, which of course works > well when there are available CPUs and cycles to burn (decreases latency a > lot), but if you have more threads than CPUs it can make things worse. > You can tweak this through OMP_WAIT_POLICY and GOMP_SPINCOUNT env vars. This is definitely the reason for the behavior we're seeing. When we set OMP_WAIT_POLICY=passive, the test program runs through normally. Without it it takes very very long. Here are some measurements when "while (true)" is replaced by "for (int j=0; j<1000; ++j)": All cores idle: =============== $ /usr/bin/time ./openmp-bug 3.21user 0.00system 0:00.81elapsed 391%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+331minor)pagefaults 0swaps $ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug 2.75user 0.05system 0:01.42elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+335minor)pagefaults 0swaps 1 (out of 4) cores occupied: ============================ $ /usr/bin/time ./openmp-bug 133.65user 0.02system 0:45.30elapsed 295%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+330minor)pagefaults 0swaps $ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug 2.67user 0.00system 0:02.35elapsed 113%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+335minor)pagefaults 0swaps $ GOMP_SPINCOUNT=10 /usr/bin/time ./openmp-bug 2.91user 0.03system 0:01.73elapsed 169%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps $ GOMP_SPINCOUNT=100 /usr/bin/time ./openmp-bug 2.77user 0.03system 0:01.90elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps $ GOMP_SPINCOUNT=1000 /usr/bin/time ./openmp-bug 2.87user 0.00system 0:01.70elapsed 168%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps $ GOMP_SPINCOUNT=10000 /usr/bin/time ./openmp-bug 3.05user 0.06system 0:01.85elapsed 167%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+337minor)pagefaults 0swaps $ GOMP_SPINCOUNT=100000 /usr/bin/time ./openmp-bug 5.25user 0.03system 0:03.10elapsed 170%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+335minor)pagefaults 0swaps $ GOMP_SPINCOUNT=1000000 /usr/bin/time ./openmp-bug 28.84user 0.00system 0:14.13elapsed 203%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps [I ran all of these several times and took the runtime around the average] > Although the implementation recognizes two kinds of spin counts (normal and > throttled, the latter in use when number of threads is bigger than number of > available CPUs), in some cases even that default might be too large (the > default for throttled spin count is 1000 spins for OMP_WAIT_POLICY=active and > 100 spins for no OMP_WAIT_POLICY in environment). As the numbers show, a default spin count of 1000 would be fine. The problem is however, that OpenMP assumes that it has all the cores of the CPU for itself. The throttled spin count is only used if the number of OpenMP threads is larger than the number of cores in the system (AFAICT). This will almost never happen (AFAICT only if you set OMP_NUM_THREADS to something larger than the number of cores). Since it seems clear that the number of spin counts should be smaller when the CPU cores are active, the throttled spin count must be used when the cores are actually used at the moment the thread starts waiting. That the number of OpenMP threads running at that time is smaller than the number of cores is not a sufficient condition. If it's not possible to determine this or if it's too time-consuming, then maybe the non-throttled default spin count can be reduced to 1000 or so. So thanks for the workaround! But I still think the default behavior can easily cause very significant slowdowns and thus should be reconsidered. Finally, I still don't get why the spinlocking has these effects on the runtime. I would expect even 2000000 spin lock cycles to be over very quickly and not a 20-fold increase in the total runtime of the program. Just out of curiosity maybe you can explain why this happens. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706