From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) by sourceware.org (Postfix) with ESMTPS id E62FC3857C7E for ; Wed, 9 Mar 2022 14:45:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E62FC3857C7E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=palves.net Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-wr1-f50.google.com with SMTP id q14so3439000wrc.4 for ; Wed, 09 Mar 2022 06:45:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=jICkn93ZTHGusDhhH/eUZLm/MNXvIEfldG1nDTtY8IQ=; b=tZQ12OBLxOhbJHdQs1s+A869HmdirxXx/Hw/K8t1qMipZ7NP2dKLzPLU4z4jUj0rmX Ee3VVmIvR2ISLJAc+qvJ7Ox21ewbTRhONjtwFqeJClODFCyTR+9yqUHGXo92yzDHFurL hzImyV1f9d+jH1NHm5/1/c/7+fvri31Q2A5m2mK/olR4dG2B42R8o3BlbmgHIxuGGdQD h3nh9iv6FwVDsyBP1UVM4R6+Zfuow3PZEklKwxwgos7wnJ67Q7UeKKsv5B0hgV+Dr1Qm ydA/OIJ4rc0uhsa2zY+P7PqK56eKY761f0RfcqzcTdSjn+KdpniRAwAuJqi9yaemc47J Qimw== X-Gm-Message-State: AOAM533iU136FnQCXbIjEV6OhE//rUq8q/LhKF1/hL9NsixD+/ER+zui bzutsY4glwTh3FEs0mO3TznkZ+nwfe0= X-Google-Smtp-Source: ABdhPJy0dn8LgsWScIgl0GhR0YoKAy1cBEmurbn2htiCTqGnKW39nTMXO5tzp3ii0dnWqe0bWxZBbw== X-Received: by 2002:a5d:424e:0:b0:1f0:3430:ffc8 with SMTP id s14-20020a5d424e000000b001f03430ffc8mr15676429wrr.672.1646837112897; Wed, 09 Mar 2022 06:45:12 -0800 (PST) Received: from ?IPV6:2001:8a0:f924:2600:209d:85e2:409e:8726? ([2001:8a0:f924:2600:209d:85e2:409e:8726]) by smtp.gmail.com with ESMTPSA id f186-20020a1c38c3000000b00382a9b91515sm5134663wma.37.2022.03.09.06.45.11 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 09 Mar 2022 06:45:11 -0800 (PST) Message-ID: Date: Wed, 9 Mar 2022 14:45:10 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux Content-Language: en-US To: Lancelot SIX , Simon Marchi Cc: gdb-patches@sourceware.org References: <20220303144020.3601082-1-pedro@palves.net> <20220303144020.3601082-10-pedro@palves.net> <308a1654-45a3-3886-0094-eff4300c15a8@simark.ca> <20220309002120.bbmp67zkiiwk4hpr@Plymouth> From: Pedro Alves In-Reply-To: <20220309002120.bbmp67zkiiwk4hpr@Plymouth> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Mar 2022 14:45:16 -0000 On 2022-03-09 00:21, Lancelot SIX wrote: > On Mon, Mar 07, 2022 at 03:24:49PM -0500, Simon Marchi wrote: >> On 2022-03-03 09:40, Pedro Alves wrote: >>> This is problematic for infrun.c:stop_all_threads, which asks the >>> target to report all thread exit events to infrun. For example, in >>> stop_all_threads, core GDB counts 2 threads that needs to be stopped. >>> It then asks the target to stop those 2 threads (with >>> target_stop(ptid)), and waits for 2 events to come back from the >>> target. Unfortunately, when waiting for events, the linux-nat target, >>> due to the random event selecting mentioned above, reports the >>> whole-process EXIT event even though the other thread has exited and >>> its exit hasn't been reported yet. As a consequence, stop_all_threads >>> receives one event, but waits indefinitely for the second one to come. >>> Effectively, GDB hangs forever. >> >> I don't really understand how we end up waiting forever. Why does >> reporting the leader exit event first make it so we discard the event >> for the other thread? >> >> Other than that, it looks sensible to me to ensure we return events in >> this order. >> >> Simon > > Hi, > > You are right, this patch is not directly linked to GDB hanging forever. > It was written as part of a series originally started to fix a hang, > which ended up handling more. I think there have been mixup when > reworking the patches before sending. My bad, I based the commit message off of the wrong patch's original commit message, and never questioned it. :-P > > The hang can happen because of the false positive of the zombie > detection (which is worked around in another patch in this series). Right, the zombie detection patch (#7) also ensures that we only ever report the process-exit event for the leader thread, instead of whatever is the last lwp in the list, which is what fixes it. > In > this case, if GDB is waiting for stop events from 2 threads in > stop_all_threads and the zombie detection drops a thread at that moment, > then only the second thread reports an event, causing GDB to hang > forever waiting for an event from the first thread. Hanging issue left > aside, we end up with a process terminating with the exit event + exit > code from a non leader thread which is incorrect. Thanks. I double checked now that this is already described in patch #7's commit log, so there's nothing to change in any other patch, AFAICT. > > This lead us to realize that there are other situations where the > process exit is not reported with the correct thread, and so the exit > code returned to infrun is not the expected one. > > When a multithreaded process terminates, we are supposed to first see > the exit of each non-leader thread, and only then the exit of the leader > thread. This is what we pulled from the kernel using waitpid in > linux-nat. However linux-nat uses some "fairness" and randomly selects > one of the events it cached to report to infrun first. In current GDB > (i.e. before this series), if the random function chooses the exit from > the leader first, linux-nat generates a THREAD_EXITED event for it, and > will generated the EXITED event when it sees the last remaining > (non-leader) thread exit. The exit code associated with this EXITED > event is the one of the non-leader thread, and this is wrong. Right. This is no longer a concern after patch #7, though, as linux-nat will only ever report the EXITED event for the leader. > > This patch improves the random selection and ensure that we do not > report the leader exit as long as there are non-leader exit events > pending (and there should be). For any well-behaved process, this > ensures that we report the correct exit code with the EXITED event. Right, agreed, if applied in isolation. It would help in that case, though if the zombie detection kicked in, the issue would still trigger, as you're aware. But since patch #7 ensured we only ever report the exit for the leader, that patch fixed both the "well-behaved" case, and the case the leader really exits. So the only remaining effect of this patch is the ensuring a natural order of events out of linux-nat, and we must evaluate it on those grounds alone now. > > This overall helps to reason in infrun: we receive the THREAD_EXITED > events first, and the EXITED event last. > > We will rework the commit message and remove the mention of the hang in > this patch. > As we discussed yesterday internally, for stop_all_threads, it actually doesn't matter which order the events are sent, as all events will be left pending. Later on, infrun itself will pick one event at random between all pending events on the infrun side (infrun.c:random_pending_event_thread), and may well pick up the EXITED event before the THREAD_EXITED event anyhow, so to ensure proper ordering throughout we'd need to tweak infrun as well. There's also another point that we're not addressing yet, which is that EXECD events are similar to EXITED events, in that they are process-wide events. For example, if a non-leader thread returns some event to infrun and it is left pending, and then the leader execs, and the target reports an EXECD event for it and that exec event is also left pending, we'd want to be sure that the non-leader event is never ever processed after the EXECD event, as that would result in GDB thinking that the post-exec program had a second thread, when it really does not. I think the "badness" with exec does not happen in practice on native Linux currently, because before the leader sees an exec event out of the kernel, it must see an exit event for the other thread, even if it was ptrace-stopped, and so linux-nat deletes the non-leader thread before the leader sees the execd, and as the pending events as stored in the thread, the pending event is deleted as well everywhere automatically. However, for remote/gdbserver, it's not so clear something odd can't happen, because gdbserver's linux-low.cc deleting a thread does not immediately result in gdb's corresponding thread being deleted. I'm not sure we're guaranteed to delete the non-leader's pending event in all stages that queue events. I'm thinking of gdbserver's notification queue and remote.c's queue. So overall, I'm thinking of dropping this patch from the series for now, as it's not really needed AFAICT, and is incomplete. I've dropped patches #9 and #10 locally, and left the testcase (patch #11) running in a loop testing both native and gdbserver for over an hour and saw no failures. Any objections to pushing the series without those two patches? Pedro Alves