* pthread_create() slowdown with concurrent sched_yield()
@ 2017-03-07 21:20 Dan Bonachea
2017-03-08 16:48 ` Corinna Vinschen
0 siblings, 1 reply; 4+ messages in thread
From: Dan Bonachea @ 2017-03-07 21:20 UTC (permalink / raw)
To: cygwin; +Cc: gasnet-devel, Dan Bonachea
I suspect I may have discovered a corner-case performance bug in
Cygwin's pthread_create() implementation. The problem arises when a
call to pthread_create() is made concurrently with multiple pthreads
in the same process spinning on calls to sched_yield(). I've searched
the Cygwin mailing list archives, user guide, FAQ, and Google and not
found any mention of this particular misbehavior.
A minimal demo program is copied below and also available here:
https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=549
The demo program is a narrowed-down version of test code used in the
GASNet communication system (http://gasnet.lbl.gov).
The test code calls pthread_create to spawn a user-controlled number
of threads, which then execute 1000 "spin barriers" - implemented by
spinning on in-memory flags and stalling with sched_yield(). The test
can also optionally insert a pthread_barrier_wait() across all threads
before the first spin barrier.
Here are some experimental results - these are full-process "real"
wall-clock timings (fastest over 5 runs) collected using the bash
'time' shell built-in. The systems are otherwise idle. All code has
been compiled with the default 64-bit /usr/bin/gcc (compile line
appears as a comment in the test), but the results are similar with
clang.
8-core Win7-Cygwin/64 2.6.0 8-core Linux/64 3.13.0 (Ubuntu)
i7-4800MQ @ 2.70GHz Xeon E5420 @ 2.50GHz
4 core x 2-way hyperthread 2 socket x 4 cores/socket
thread create-vs- create-vs- create-vs- create-vs-
count spin/yield pthread_barrier spin/yield pthread_barrier
------ ------------ ---------------- ------------- -----------------
1 0m 0.000s 0m0.000s 0m0.001s 0m0.001s
2 0m 0.000s 0m0.000s 0m0.002s 0m0.002s
4 0m 0.000s 0m0.000s 0m0.002s 0m0.003s
8 0m 0.000s 0m0.016s 0m0.003s 0m0.006s
16 0m10.717s 0m0.000s 0m0.013s 0m0.012s
32 2m23.988s 0m0.016s 0m0.018s 0m0.024s
64 12m40.002s 0m0.016s 0m0.038s 0m0.046s
128 >20m* 0m0.016s 0m0.063s 0m0.067s
256 >20m* 0m0.047s 0m0.290s 0m0.631s
(*) = killed after >20m of wall time (>2.5 hours of cpu time)
When the number of pthreads start to exceed the physical core count,
Cygwin's pthread_create() starts taking exponentially longer to return
when it is competing with concurrent calls to sched_yield(). During
the long pauses, windows Task Manager shows the process consuming 100%
CPU on all cores and it becomes unresponsive to SIGINT. The observed
behavior seems to suggest that Cygwin's pthread creation operation
(and/or the newly spawned thread) is not being scheduled, despite
every OTHER application thread spamming calls to sched_yield().
If the other threads competing with pthread_create() are instead
stalled in a pthread_barrier_wait(), the problem goes away entirely
(ie by adding a semantically unnecessary pthread_barrier_wait(), the
worst-case performance gets over 75,000x better). The test results
demonstrate that the spin barriers themselves run quite fast, but
pthread_create() runs very slowly when other unrelated threads are
executing sched_yield(). Note that inserting pthread_barrier_wait() to
stall every thread in the process during a pthread_create() is not
always a practical solution in a real program, where the thread
creation behavior may be less regular than shown in this example.
Also shown are performance results for the same test on a Linux system
with somewhat comparable hardware (the CPU running Linux is 5 years
older on Intel's product calendar). The Linux system does NOT
demonstrate the problem. Similar code has run on several other POSIX
OS's (including OSX, FreeBSD, NetBSD, Solaris), in a wide variety of
architectural configurations -- all without problems.
This pthread_create() performance problem has been reproduced with
similar results on four different windows machines (including laptops
and servers), running all combinations of the following Cygwin
configurations:
Windows 7/64 Cygwin {32,64} {2.7,2.6,2.0}
Windows 10/64 Cygwin 64 2.7
I realize this may represent a parallelism pattern that cannot be
supported efficiently on Cygwin (and we've internally found an
app-specific workaround not represented here), but I thought it
responsible to report the performance issue anyhow.
Thanks for your consideration.
-Dan Bonachea
========================================================================================
// pthread-spawn.c test, by Dan Bonachea
// compile with a command like:
// gcc -std=c99 -D_REENTRANT -D_GNU_SOURCE pthread-spawn.c -o
pthread-spawn -lpthread
// usage:
// pthread-spawn <initialbarrier> <numthreads> <numiters>
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
int numthreads=256;
int numiters=1000;
int initialbarrier = 0;
pthread_barrier_t pthbarrier;
volatile int *spinbarrier;
void *thread_start(void *p) {
volatile int *myspin = p;
if (initialbarrier) {
int ret = pthread_barrier_wait(&pthbarrier);
if (ret && ret != PTHREAD_BARRIER_SERIAL_THREAD)
perror("pthread_barrier_wait");
}
if (myspin == &spinbarrier[numthreads-1]) { // last thread
printf("Running %d spin barriers...\n",numiters);
}
for (int iter=1; iter <= numiters; iter++) { // execute numiters spin barriers
if (myspin == spinbarrier) { // master thread
for (int th = 1; th < numthreads; th++) { // wait for each slave
while (spinbarrier[th] != iter) {
if (sched_yield()) perror("sched_yield"); // yield
}
}
*spinbarrier = iter; // broadcast
} else { // slave threads
*myspin = iter; // signal
while (*spinbarrier != iter) { // wait for master broadcast
if (sched_yield()) perror("sched_yield"); // yield
}
}
}
return 0;
}
int main(int argc, char **argv) {
// parse args
if (argc > 1) initialbarrier = atoi(argv[1]);
if (argc > 2) numthreads = atoi(argv[2]);
if (argc > 3) numiters = atoi(argv[3]);
// init data structures
pthread_t *th = malloc(sizeof(pthread_t)*numthreads);
spinbarrier = calloc(sizeof(int),numthreads);
if (pthread_barrier_init(&pthbarrier, NULL, numthreads))
perror("pthread_barrier_init");
printf("Creating %d threads%s...\n",numthreads,
(initialbarrier?", with initial
pthread_barrier_wait":""));fflush(stdout);
for (int i=0; i < numthreads; i++) {
if (pthread_create(&th[i], NULL, thread_start, (void
*)&spinbarrier[i])) perror("pthread_create");
}
printf("Creation complete!\n"); fflush(stdout);
for (int i=0; i < numthreads; i++) {
void *ret;
if (pthread_join(th[i], &ret)) perror("pthread_join");
}
printf("Done!\n");fflush(stdout);
return 0;
}
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: pthread_create() slowdown with concurrent sched_yield()
2017-03-07 21:20 pthread_create() slowdown with concurrent sched_yield() Dan Bonachea
@ 2017-03-08 16:48 ` Corinna Vinschen
2017-03-08 23:02 ` Dan Bonachea
0 siblings, 1 reply; 4+ messages in thread
From: Corinna Vinschen @ 2017-03-08 16:48 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]
On Mar 7 16:19, Dan Bonachea wrote:
> I suspect I may have discovered a corner-case performance bug in
> Cygwin's pthread_create() implementation. The problem arises when a
> call to pthread_create() is made concurrently with multiple pthreads
> in the same process spinning on calls to sched_yield(). I've searched
> the Cygwin mailing list archives, user guide, FAQ, and Google and not
> found any mention of this particular misbehavior.
>
> A minimal demo program is copied below and also available here:
> https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=549
> The demo program is a narrowed-down version of test code used in the
> GASNet communication system (http://gasnet.lbl.gov).
>
> The test code calls pthread_create to spawn a user-controlled number
> of threads, which then execute 1000 "spin barriers" - implemented by
> spinning on in-memory flags and stalling with sched_yield(). The test
> can also optionally insert a pthread_barrier_wait() across all threads
> before the first spin barrier.
Thanks for the thorough analysis and especially the testcase!
I applied a fix for this problem and uploaded new developer snapshots
to https://cygwin.com/snapshots/
Please give them a try.
Thanks,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: pthread_create() slowdown with concurrent sched_yield()
2017-03-08 16:48 ` Corinna Vinschen
@ 2017-03-08 23:02 ` Dan Bonachea
2017-03-09 16:41 ` Corinna Vinschen
0 siblings, 1 reply; 4+ messages in thread
From: Dan Bonachea @ 2017-03-08 23:02 UTC (permalink / raw)
To: cygwin
On Wed, Mar 8, 2017 at 11:48 AM, Corinna Vinschen
<corinna-cygwin@cygwin.com> wrote:
>
> Thanks for the thorough analysis and especially the testcase!
>
> I applied a fix for this problem and uploaded new developer snapshots
> to https://cygwin.com/snapshots/
>
> Please give them a try.
Hi Corinna -
Thanks for the quick response!
I've confirmed the fix on my system using the latest snapshot:
2017-03-08 16:47:12 UTC snapshot: 2.7.1(0.307/5/3) x86_64 and i686
The fix solves the problem for both the test program and the original
application, on both 32 and 64 bit.
Thanks again!
-D
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: pthread_create() slowdown with concurrent sched_yield()
2017-03-08 23:02 ` Dan Bonachea
@ 2017-03-09 16:41 ` Corinna Vinschen
0 siblings, 0 replies; 4+ messages in thread
From: Corinna Vinschen @ 2017-03-09 16:41 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 857 bytes --]
On Mar 8 18:00, Dan Bonachea wrote:
> On Wed, Mar 8, 2017 at 11:48 AM, Corinna Vinschen
> <corinna-cygwin@cygwin.com> wrote:
> >
> > Thanks for the thorough analysis and especially the testcase!
> >
> > I applied a fix for this problem and uploaded new developer snapshots
> > to https://cygwin.com/snapshots/
> >
> > Please give them a try.
>
> Hi Corinna -
>
> Thanks for the quick response!
>
> I've confirmed the fix on my system using the latest snapshot:
>
> 2017-03-08 16:47:12 UTC snapshot: 2.7.1(0.307/5/3) x86_64 and i686
>
> The fix solves the problem for both the test program and the original
> application, on both 32 and 64 bit.
Thanks for testing,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2017-03-09 16:41 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-07 21:20 pthread_create() slowdown with concurrent sched_yield() Dan Bonachea
2017-03-08 16:48 ` Corinna Vinschen
2017-03-08 23:02 ` Dan Bonachea
2017-03-09 16:41 ` Corinna Vinschen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).