From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 75675 invoked by alias); 9 Jan 2017 13:29:05 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 75647 invoked by uid 89); 9 Jan 2017 13:29:04 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=Meanwhile, respond X-HELO: mail-vk0-f42.google.com Received: from mail-vk0-f42.google.com (HELO mail-vk0-f42.google.com) (209.85.213.42) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 09 Jan 2017 13:29:03 +0000 Received: by mail-vk0-f42.google.com with SMTP id x75so17329238vke.2 for ; Mon, 09 Jan 2017 05:29:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=IPJJsHyJW/EnRzghhnECMddMH/HJiGSC5SqMJn8duPY=; b=FUOXX3AOdWTH/FM6/t0ONijB61sbi/88dpc1F43CKGeID403WV2e2eRnR3Ep6ngwlk lziTV9vysTNawFTaOToJ7+N8QEsu4gvjLe+QtEUI/yDi+fGDAOHgPev1b3fHFQsfqAI1 yzpNTI5iuoZWw+7Bf11nFrDojbxNV+0/g9jO58tTHRUMufVSV6qCCyuNkHWmYcZ25XDM Fi/yAScAwproJ0nC8gkagwQ5BjtoD9rAG5aNULi1yyf+ITrhLkVLEWNTxXuzNacYuOdT +yPkzlAChtUIiyPXKlcnmTRbsUTEeis8VE6wb3yrGW2dc2SnqMWh7PR/rghMoiOku1C9 XaVA== X-Gm-Message-State: AIkVDXIvz+0X+A6Vs6+XxsM+YGgdpqIVlMNVIqbLk5Z4DqRh6AlcBTl7bfvHqHEO/CorG0ywmqbVMJBGhXcMZg== X-Received: by 10.31.80.197 with SMTP id e188mr32409286vkb.109.1483968541002; Mon, 09 Jan 2017 05:29:01 -0800 (PST) MIME-Version: 1.0 Received: by 10.103.133.70 with HTTP; Mon, 9 Jan 2017 05:29:00 -0800 (PST) From: Erik Bray Date: Mon, 09 Jan 2017 13:29:00 -0000 Message-ID: Subject: Re: Hangs on connect to UNIX socket being listened on in the same process (was: Cygwin hanging in pselect) To: cygwin@cygwin.com Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes X-SW-Source: 2017-01/txt/msg00053.txt.bz2 On Mon, Jan 9, 2017 at 12:01 PM, Erik Bray wrote: > On Fri, Jan 6, 2017 at 12:40 PM, Erik Bray wrote: >> Hello, and happy new-ish year, >> >> I've been working on and off over the past few months on bringing >> Python's compatibility with Cygwin up to snuff, including having all >> pertinent tests passing. I've noticed that there are several tests >> (which I currently skip) that cause the process to hang indefinitely, >> and not respond to any signals from Cygwin (it can only be killed from >> Windows). This is Cygwin 64-bit--I have not tested 32-bit. >> >> I finally looked into this problem and found the lockup to be in >> pselect() somewhere. Attached I've provided the most minimal example >> I've been able to come up with so far that reproduces the problem, >> which I'll describe in a bit more detail next. I would attach a >> cygcheck output if requested, but I was also able to reproduce this on >> a recent build from source. >> >> So far as I've been able to tell, the problem only occurs with AF_UNIX >> sockets. In the example I have a 'server' socket and a 'client' >> socket both set to non-blocking. The client connects to the socket, >> returning errno EINPROGRESS as expected. Then I do a pselect on the >> client socket to wait until it is ready to be read from. The hang >> only happens when I pselect on the client socket, and not on the >> server socket. It doesn't seem to make a difference what the timeout >> is. One thing I have no tried is if the client and server are >> actually different processes, but the example from the Python tests >> this is reproducing is where they are both in the same process. >> >> Below is (I think) the most relevant output from strace on the test >> case. It seems to hang somewhere in socket_cleanup, but I haven't >> investigated any further than that. > > I made a little bit of progress debugging this, but now I'm stumped. > It seems the problem is this: > > For each socket whose fd is passed to select() a thread_socket is > started which calls peek_socket until there are bits ready on the > socket, or until the timeout is reached. This in turn calls > fhandler_socket::evaluate_events. > > The reason it's only locking up on my "client thread" on which > connect() is called, is that evaluate_events notes that the socket is > waiting to connect, and this passes control to > fhandler_socket::af_local_connect(). af_local_connect() temporarily > sets the socket to blocking, then sends a magic string to the socket > (you can see in my strace log that this succeeds). What's strange, > and what I don't understand, is that there are no FD_READ or FD_OOB > events recorded for the WSASendTo call from af_local_send_secret(). > Then, after af_local_send_secret() it calls af_local_recv_secret(). > This calls recv_internal() which in turn calls recursively into > fhandler_socket::evaluate_events where it waits for an FD_READ or > FD_OOB event that never arrives. And since it set the socket to > blocking it just sits in an infinite loop. > > Meanwhile the timer for the select() call expires and tries to shut > down the thread_socket but it can't because it never completes. > > What I don't understand is why there is not an event recorded for the > WSASendTo in send_internal. I even wrapped it with the following > debug code to wait for an FD_READ event immediately following the > WSASendTo: > > else if (get_socket_type () == SOCK_STREAM) > { > WSAEventSelect(get_socket (), wsock_evt, EVENT_MASK); > res = WSASendTo (get_socket (), out_buf, out_idx, &ret, flags, > wsamsg->name, wsamsg->namelen, NULL, NULL); > debug_printf("WSASendTo sent %d bytes; ret: %d", ret, res); > while (!(res=wait_for_events (FD_READ | FD_OOB, 0))) { > debug_printf("Waiting for socket to be readable"); > } > } > > > > But the strace at this point just outputs: > 62 108286 [socksel] poll_test 24152 > fhandler_socket::af_local_connect: af_local_connect called, > no_getpeereid=0 > 156 108442 [socksel] poll_test 24152 > fhandler_socket::send_internal: WSASendTo sent 16 bytes; ret: 0 > > It never returns from send_internal. I don't have deep knowledge of > WinSock, but from what I've read ISTM WSASendTo should have triggered > an FD_READ event on the socket, and it doesn't for some reason. After playing around with this a bit more I came up with a much simpler example. This has nothing to do with select( ) at all, directly. The simplified example is just: #include #include #include #include #include #include int main(void) { fd_set rfds; int sock_server, sock_client; int retval; struct sockaddr_un addr; memset(&addr, 0, sizeof(addr)); addr.sun_family = AF_UNIX; strcpy(addr.sun_path, "@test.sock"); sock_server = socket(AF_UNIX, SOCK_STREAM, 0); if (bind(sock_server, (struct sockaddr*)&addr, sizeof(addr))) { printf("binding server socket failed"); return 1; } retval = listen(sock_server, 5); printf("Ret from listen: %d\n", retval); sock_client = socket(AF_UNIX, SOCK_STREAM, 0); retval = connect(sock_client, (struct sockaddr*)&addr, sizeof(addr)); printf("Ret from client connect: %d; errno: %d\n", retval, errno); return 0; } On Linux this example works as I expect, and the connect() call returns immediately. However, on Cygwin the connect() call hangs after af_local_send_secret(), as described in my first message. However, when I split this example up into separate client and server processes it works as expected and the connect() is properly negotiated and returns immediately. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple