From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lndn.lancelotsix.com (lndn.lancelotsix.com [51.195.220.111]) by sourceware.org (Postfix) with ESMTPS id 0CEDC3858035 for ; Wed, 9 Mar 2022 00:21:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0CEDC3858035 Received: from Plymouth (unknown [IPv6:2a02:390:9086:0:c265:bada:253c:abe1]) by lndn.lancelotsix.com (Postfix) with ESMTPSA id 8A81480D5D; Wed, 9 Mar 2022 00:21:23 +0000 (UTC) Date: Wed, 9 Mar 2022 00:21:20 +0000 From: Lancelot SIX To: Simon Marchi Cc: Pedro Alves , gdb-patches@sourceware.org Subject: Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux Message-ID: <20220309002120.bbmp67zkiiwk4hpr@Plymouth> References: <20220303144020.3601082-1-pedro@palves.net> <20220303144020.3601082-10-pedro@palves.net> <308a1654-45a3-3886-0094-eff4300c15a8@simark.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <308a1654-45a3-3886-0094-eff4300c15a8@simark.ca> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.11 (lndn.lancelotsix.com [0.0.0.0]); Wed, 09 Mar 2022 00:21:23 +0000 (UTC) X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Mar 2022 00:21:26 -0000 On Mon, Mar 07, 2022 at 03:24:49PM -0500, Simon Marchi wrote: > On 2022-03-03 09:40, Pedro Alves wrote: > > This is problematic for infrun.c:stop_all_threads, which asks the > > target to report all thread exit events to infrun. For example, in > > stop_all_threads, core GDB counts 2 threads that needs to be stopped. > > It then asks the target to stop those 2 threads (with > > target_stop(ptid)), and waits for 2 events to come back from the > > target. Unfortunately, when waiting for events, the linux-nat target, > > due to the random event selecting mentioned above, reports the > > whole-process EXIT event even though the other thread has exited and > > its exit hasn't been reported yet. As a consequence, stop_all_threads > > receives one event, but waits indefinitely for the second one to come. > > Effectively, GDB hangs forever. > > I don't really understand how we end up waiting forever. Why does > reporting the leader exit event first make it so we discard the event > for the other thread? > > Other than that, it looks sensible to me to ensure we return events in > this order. > > Simon Hi, You are right, this patch is not directly linked to GDB hanging forever. It was written as part of a series originally started to fix a hang, which ended up handling more. I think there have been mixup when reworking the patches before sending. The hang can happen because of the false positive of the zombie detection (which is worked around in another patch in this series). In this case, if GDB is waiting for stop events from 2 threads in stop_all_threads and the zombie detection drops a thread at that moment, then only the second thread reports an event, causing GDB to hang forever waiting for an event from the first thread. Hanging issue left aside, we end up with a process terminating with the exit event + exit code from a non leader thread which is incorrect. This lead us to realize that there are other situations where the process exit is not reported with the correct thread, and so the exit code returned to infrun is not the expected one. When a multithreaded process terminates, we are supposed to first see the exit of each non-leader thread, and only then the exit of the leader thread. This is what we pulled from the kernel using waitpid in linux-nat. However linux-nat uses some "fairness" and randomly selects one of the events it cached to report to infrun first. In current GDB (i.e. before this series), if the random function chooses the exit from the leader first, linux-nat generates a THREAD_EXITED event for it, and will generated the EXITED event when it sees the last remaining (non-leader) thread exit. The exit code associated with this EXITED event is the one of the non-leader thread, and this is wrong. This patch improves the random selection and ensure that we do not report the leader exit as long as there are non-leader exit events pending (and there should be). For any well-behaved process, this ensures that we report the correct exit code with the EXITED event. This overall helps to reason in infrun: we receive the THREAD_EXITED events first, and the EXITED event last. We will rework the commit message and remove the mention of the hang in this patch. Best, Lancelot.