From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lsix@lancelotsix.com>
Received: from lndn.lancelotsix.com (lndn.lancelotsix.com [51.195.220.111])
 by sourceware.org (Postfix) with ESMTPS id 0CEDC3858035
 for <gdb-patches@sourceware.org>; Wed,  9 Mar 2022 00:21:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0CEDC3858035
Received: from Plymouth (unknown [IPv6:2a02:390:9086:0:c265:bada:253c:abe1])
 by lndn.lancelotsix.com (Postfix) with ESMTPSA id 8A81480D5D;
 Wed,  9 Mar 2022 00:21:23 +0000 (UTC)
Date: Wed, 9 Mar 2022 00:21:20 +0000
From: Lancelot SIX <lsix@lancelotsix.com>
To: Simon Marchi <simark@simark.ca>
Cc: Pedro Alves <pedro@palves.net>, gdb-patches@sourceware.org
Subject: Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux
Message-ID: <20220309002120.bbmp67zkiiwk4hpr@Plymouth>
References: <20220303144020.3601082-1-pedro@palves.net>
 <20220303144020.3601082-10-pedro@palves.net>
 <308a1654-45a3-3886-0094-eff4300c15a8@simark.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <308a1654-45a3-3886-0094-eff4300c15a8@simark.ca>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.11
 (lndn.lancelotsix.com [0.0.0.0]); Wed, 09 Mar 2022 00:21:23 +0000 (UTC)
X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Mar 2022 00:21:26 -0000

On Mon, Mar 07, 2022 at 03:24:49PM -0500, Simon Marchi wrote:
> On 2022-03-03 09:40, Pedro Alves wrote:
> > This is problematic for infrun.c:stop_all_threads, which asks the
> > target to report all thread exit events to infrun.  For example, in
> > stop_all_threads, core GDB counts 2 threads that needs to be stopped.
> > It then asks the target to stop those 2 threads (with
> > target_stop(ptid)), and waits for 2 events to come back from the
> > target.  Unfortunately, when waiting for events, the linux-nat target,
> > due to the random event selecting mentioned above, reports the
> > whole-process EXIT event even though the other thread has exited and
> > its exit hasn't been reported yet.  As a consequence, stop_all_threads
> > receives one event, but waits indefinitely for the second one to come.
> > Effectively, GDB hangs forever.
> 
> I don't really understand how we end up waiting forever.  Why does
> reporting the leader exit event first make it so we discard the event
> for the other thread?
> 
> Other than that, it looks sensible to me to ensure we return events in
> this order.
> 
> Simon

Hi,

You are right, this patch is not directly linked to GDB hanging forever.
It was written as part of a series originally started to fix a hang,
which ended up handling more.  I think there have been mixup when
reworking the patches before sending.

The hang can happen because of the false positive of the zombie
detection (which is worked around in another patch in this series).  In
this case, if GDB is waiting for stop events from 2 threads in
stop_all_threads and the zombie detection drops a thread at that moment,
then only the second thread reports an event, causing GDB to hang
forever waiting for an event from the first thread.  Hanging issue left
aside, we end up with a process terminating with the exit event + exit
code from a non leader thread which is incorrect.

This lead us to realize that there are other situations where the
process exit is not reported with the correct thread, and so the exit
code returned to infrun is not the expected one.

When a multithreaded process terminates, we are supposed to first see
the exit of each non-leader thread, and only then the exit of the leader
thread.  This is what we pulled from the kernel using waitpid in
linux-nat.  However linux-nat uses some "fairness" and randomly selects
one of the events it cached to report to infrun first.  In current GDB
(i.e.  before this series), if the random function chooses the exit from
the leader first, linux-nat generates a THREAD_EXITED event for it, and
will generated the EXITED event when it sees the last remaining
(non-leader) thread exit.  The exit code associated with this EXITED
event is the one of the non-leader thread, and this is wrong.

This patch improves the random selection and ensure that we do not
report the leader exit as long as there are non-leader exit events
pending (and there should be).  For any well-behaved process, this
ensures that we report the correct exit code with the EXITED event.

This overall helps to reason in infrun: we receive the THREAD_EXITED
events first, and the EXITED event last.

We will rework the commit message and remove the mention of the hang in
this patch.

Best,
Lancelot.