From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <alves.ped@gmail.com>
Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com
 [209.85.221.50])
 by sourceware.org (Postfix) with ESMTPS id E62FC3857C7E
 for <gdb-patches@sourceware.org>; Wed,  9 Mar 2022 14:45:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E62FC3857C7E
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=palves.net
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-wr1-f50.google.com with SMTP id q14so3439000wrc.4
 for <gdb-patches@sourceware.org>; Wed, 09 Mar 2022 06:45:13 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:message-id:date:mime-version:user-agent:subject
 :content-language:to:cc:references:from:in-reply-to
 :content-transfer-encoding;
 bh=jICkn93ZTHGusDhhH/eUZLm/MNXvIEfldG1nDTtY8IQ=;
 b=tZQ12OBLxOhbJHdQs1s+A869HmdirxXx/Hw/K8t1qMipZ7NP2dKLzPLU4z4jUj0rmX
 Ee3VVmIvR2ISLJAc+qvJ7Ox21ewbTRhONjtwFqeJClODFCyTR+9yqUHGXo92yzDHFurL
 hzImyV1f9d+jH1NHm5/1/c/7+fvri31Q2A5m2mK/olR4dG2B42R8o3BlbmgHIxuGGdQD
 h3nh9iv6FwVDsyBP1UVM4R6+Zfuow3PZEklKwxwgos7wnJ67Q7UeKKsv5B0hgV+Dr1Qm
 ydA/OIJ4rc0uhsa2zY+P7PqK56eKY761f0RfcqzcTdSjn+KdpniRAwAuJqi9yaemc47J
 Qimw==
X-Gm-Message-State: AOAM533iU136FnQCXbIjEV6OhE//rUq8q/LhKF1/hL9NsixD+/ER+zui
 bzutsY4glwTh3FEs0mO3TznkZ+nwfe0=
X-Google-Smtp-Source: ABdhPJy0dn8LgsWScIgl0GhR0YoKAy1cBEmurbn2htiCTqGnKW39nTMXO5tzp3ii0dnWqe0bWxZBbw==
X-Received: by 2002:a5d:424e:0:b0:1f0:3430:ffc8 with SMTP id
 s14-20020a5d424e000000b001f03430ffc8mr15676429wrr.672.1646837112897; 
 Wed, 09 Mar 2022 06:45:12 -0800 (PST)
Received: from ?IPV6:2001:8a0:f924:2600:209d:85e2:409e:8726?
 ([2001:8a0:f924:2600:209d:85e2:409e:8726])
 by smtp.gmail.com with ESMTPSA id
 f186-20020a1c38c3000000b00382a9b91515sm5134663wma.37.2022.03.09.06.45.11
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Wed, 09 Mar 2022 06:45:11 -0800 (PST)
Message-ID: <f88f7277-3801-d07e-d340-b148ed64d18e@palves.net>
Date: Wed, 9 Mar 2022 14:45:10 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.6.1
Subject: Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux
Content-Language: en-US
To: Lancelot SIX <lsix@lancelotsix.com>, Simon Marchi <simark@simark.ca>
Cc: gdb-patches@sourceware.org
References: <20220303144020.3601082-1-pedro@palves.net>
 <20220303144020.3601082-10-pedro@palves.net>
 <308a1654-45a3-3886-0094-eff4300c15a8@simark.ca>
 <20220309002120.bbmp67zkiiwk4hpr@Plymouth>
From: Pedro Alves <pedro@palves.net>
In-Reply-To: <20220309002120.bbmp67zkiiwk4hpr@Plymouth>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00,
 FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,
 KAM_DMARC_STATUS, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Mar 2022 14:45:16 -0000

On 2022-03-09 00:21, Lancelot SIX wrote:
> On Mon, Mar 07, 2022 at 03:24:49PM -0500, Simon Marchi wrote:
>> On 2022-03-03 09:40, Pedro Alves wrote:
>>> This is problematic for infrun.c:stop_all_threads, which asks the
>>> target to report all thread exit events to infrun.  For example, in
>>> stop_all_threads, core GDB counts 2 threads that needs to be stopped.
>>> It then asks the target to stop those 2 threads (with
>>> target_stop(ptid)), and waits for 2 events to come back from the
>>> target.  Unfortunately, when waiting for events, the linux-nat target,
>>> due to the random event selecting mentioned above, reports the
>>> whole-process EXIT event even though the other thread has exited and
>>> its exit hasn't been reported yet.  As a consequence, stop_all_threads
>>> receives one event, but waits indefinitely for the second one to come.
>>> Effectively, GDB hangs forever.
>>
>> I don't really understand how we end up waiting forever.  Why does
>> reporting the leader exit event first make it so we discard the event
>> for the other thread?
>>
>> Other than that, it looks sensible to me to ensure we return events in
>> this order.
>>
>> Simon
> 
> Hi,
> 
> You are right, this patch is not directly linked to GDB hanging forever.
> It was written as part of a series originally started to fix a hang,
> which ended up handling more.  I think there have been mixup when
> reworking the patches before sending.

My bad, I based the commit message off of the wrong patch's original commit
message, and never questioned it.  :-P

> 
> The hang can happen because of the false positive of the zombie
> detection (which is worked around in another patch in this series).  

Right, the zombie detection patch (#7) also ensures that we only ever
report the process-exit event for the leader thread, instead of whatever
is the last lwp in the list, which is what fixes it.

> In
> this case, if GDB is waiting for stop events from 2 threads in
> stop_all_threads and the zombie detection drops a thread at that moment,
> then only the second thread reports an event, causing GDB to hang
> forever waiting for an event from the first thread.  Hanging issue left
> aside, we end up with a process terminating with the exit event + exit
> code from a non leader thread which is incorrect.

Thanks.  I double checked now that this is already described in patch #7's
commit log, so there's nothing to change in any other patch, AFAICT.

> 
> This lead us to realize that there are other situations where the
> process exit is not reported with the correct thread, and so the exit
> code returned to infrun is not the expected one.
> 
> When a multithreaded process terminates, we are supposed to first see
> the exit of each non-leader thread, and only then the exit of the leader
> thread.  This is what we pulled from the kernel using waitpid in
> linux-nat.  However linux-nat uses some "fairness" and randomly selects
> one of the events it cached to report to infrun first.  In current GDB
> (i.e.  before this series), if the random function chooses the exit from
> the leader first, linux-nat generates a THREAD_EXITED event for it, and
> will generated the EXITED event when it sees the last remaining
> (non-leader) thread exit.  The exit code associated with this EXITED
> event is the one of the non-leader thread, and this is wrong.

Right.  This is no longer a concern after patch #7, though, as linux-nat
will only ever report the EXITED event for the leader.

> 
> This patch improves the random selection and ensure that we do not
> report the leader exit as long as there are non-leader exit events
> pending (and there should be).  For any well-behaved process, this
> ensures that we report the correct exit code with the EXITED event.

Right, agreed, if applied in isolation.  It would help in that case, though if the
zombie detection kicked in, the issue would still trigger, as you're aware.  But since
patch #7 ensured we only ever report the exit for the leader, that patch fixed both
the "well-behaved" case, and the case the leader really exits.

So the only remaining effect of this patch is the ensuring a natural order
of events out of linux-nat, and we must evaluate it on those grounds alone now.

> 
> This overall helps to reason in infrun: we receive the THREAD_EXITED
> events first, and the EXITED event last.
> 
> We will rework the commit message and remove the mention of the hang in
> this patch.
> 

As we discussed yesterday internally, for stop_all_threads, it actually doesn't
matter which order the events are sent, as all events will be left pending.
Later on, infrun itself will pick one event at random between all pending
events on the infrun side (infrun.c:random_pending_event_thread), and may well pick up
the EXITED event before the THREAD_EXITED event anyhow, so to ensure proper ordering
throughout we'd need to tweak infrun as well.

There's also another point that we're not addressing yet, which is that EXECD events are
similar to EXITED events, in that they are process-wide events.  For example, if a non-leader
thread returns some event to infrun and it is left pending, and then the leader execs, and
the target reports an EXECD event for it and that exec event is also left pending, we'd want to
be sure that the non-leader event is never ever processed after the EXECD event, as that
would result in GDB thinking that the post-exec program had a second thread, when it really
does not.

I think the "badness" with exec does not happen in practice on native Linux currently, because
before the leader sees an exec event out of the kernel, it must see an exit event for the other
thread, even if it was ptrace-stopped, and so linux-nat deletes the non-leader thread before the
leader sees the execd, and as the pending events as stored in the thread, the pending event is
deleted as well everywhere automatically.

However, for remote/gdbserver, it's not so clear something odd can't happen, because gdbserver's
linux-low.cc deleting a thread does not immediately result in gdb's corresponding thread being
deleted.  I'm not sure we're guaranteed to delete the non-leader's pending event in all stages 
that queue events.  I'm thinking of gdbserver's notification queue and remote.c's queue.

So overall, I'm thinking of dropping this patch from the series for now, as it's not really
needed AFAICT, and is incomplete.

I've dropped patches #9 and #10 locally, and left the testcase (patch #11) running in a loop
testing both native and gdbserver for over an hour and saw no failures.  Any objections
to pushing the series without those two patches?

Pedro Alves