From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 52443 invoked by alias); 25 Nov 2018 23:13:50 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 51963 invoked by uid 89); 25 Nov 2018 23:13:49 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-9.1 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_2,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=posts, admit, telling, identification X-HELO: mailsec110.isp.belgacom.be Received: from mailsec110.isp.belgacom.be (HELO mailsec110.isp.belgacom.be) (195.238.20.106) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sun, 25 Nov 2018 23:13:45 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=skynet.be; i=@skynet.be; q=dns/txt; s=securemail; t=1543187626; x=1574723626; h=message-id:subject:from:to:date:in-reply-to:references: mime-version:content-transfer-encoding; bh=0X1QYTeFlybMjotDGgfaT110PLT0747KkLPBbZS2mPY=; b=ezGMS+ifCj9rc3w9xni3Ha8R58kGpHLXXdvtOjUA0rASsJ9zp7o0/D1Z yynk0yVM2UKNSNiF/8s3O79TmN9qTA==; Received: from 184.205-67-87.adsl-dyn.isp.belgacom.be (HELO md) ([87.67.205.184]) by relay.skynet.be with ESMTP/TLS/AES256-GCM-SHA384; 26 Nov 2018 00:13:35 +0100 Message-ID: <1543187614.10222.3.camel@skynet.be> Subject: Re: [RFA] Factorize killing the children in linux-ptrace.c, and fix a 'process leak'. From: Philippe Waroquiers To: Pedro Alves , Simon Marchi , "gdb-patches@sourceware.org" Date: Sun, 25 Nov 2018 23:13:00 -0000 In-Reply-To: <1541957075.1531.1.camel@skynet.be> References: <20181103201929.10345-1-philippe.waroquiers@skynet.be> <3830a73e-d630-1176-b6e9-00f88ed6468d@ericsson.com> <734d1ee3-c452-bf04-0084-26a59238a139@redhat.com> <1541957075.1531.1.camel@skynet.be> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-IsSubscribed: yes X-SW-Source: 2018-11/txt/msg00431.txt.bz2 Ping ? The patch discussed here is already a first improvement (which makes running tests under valgrind more comfortable). Some more factorisation might be done with the 4 other places where children are killed, but these seems more risky. So, maybe this patch could be pushed as a first low risk step towards the good direction ? Thanks Philippe On Sun, 2018-11-11 at 18:24 +0100, Philippe Waroquiers wrote: > On Sat, 2018-11-10 at 18:29 +0000, Pedro Alves wrote: > > There's some long and ugly history around PTRACE_KILL vs SIGKILL. > > See gdb/linux-nat.c kill_wait_one_lwp and kill_one_lwp, and the comments > > within those functions. I don't know whether the kernels those were for > > are relevant any more. Likely using git blame and finding the corresponding > > patch posts would shed some more light. Whatever we do, it'd be nice to > > make gdb/linux-nat.c use the unified code as well. > > I have done some digging in the source code and searched some > more info using git blame. > => I found 7 different linux logics related to kill(SIGKILL) > and ptrace(PTRACE_KILL). > I found only one identification of kernels giving problems. > ChangeLogs/commit msg were not really giving much > background information : these are telling what is being changed, > not why it is being changed. > linux-low.c has the most interesting comments explaining > the kill related problems. > > Below is some updated text with the found info (sorry, you > will have to re-read some parts already in the RFA). > > I have to admit that a 'linux-ptrace.c only factorization' > as suggested in the RFA (at least as a first step, till we have > validated with it that a simpler kill logic is ok on various platforms) > is somewhat less frightening me than reducing all 7 different > logics to a single one. > (there is a table at the end of this mail that summarises all > 7 logics). > > I must say that all this is killing me :). > Maybe I would feel better if I would just have added my own > local new 8th kill logic where I need it :). > > Philippe > > > nat/linux-ptrace.c has 3 different logics to kill a child process. > So, this patch factorizes killing a child in the function kill_child. > > The 3 different logics are: > * linux_ptrace_test_ret_to_nx is calling both kill (child, SIGKILL) >   and ptrace (PTRACE_KILL, child, ...), and then is calling once >   waitpid. > * linux_check_ptrace_features is calling ptrace (PTRACE_KILL, child, ...) >   + my_waitpid in a loop, as long as the waitpid status was WIFSTOPPED. > * linux_test_for_tracefork is calling once ptrace (PTRACE_KILL, child, ...) >   + my_waitpid. > > The linux ptrace documentation indicates that PTRACE_KILL is deprecated, > and tells to not use it, as it might return success but not kill the tracee. > The documentation indicates to send SIGKILL directly. > > I suspect that linux_ptrace_test_ret_to_nx calls both kill and ptrace just > to be sure ... > git blame indicates that the call to kill(SIGKILL) was added in addition to the > pre-existing ptrace PTRACE_KILL. The commit log also indicates to ignore > PTRACE_KILL errors. But there is no explanation why adding the SIGKILL call > was needed, and why PTRACE_KILL was kept. > > I suspect that linux_check_ptrace_features calls ptrace in a loop > to bypass the PTRACE_KILL limitation. > git blame indicates that this function was heavily based on > linux-nat.c:linux_test_for_tracefork (which was moved to linux-ptrace.c, > see next paragraph). I found no log/comments about the killing subtilities. > > linux_test_for_tracefork seems to not handle the PTRACE_KILL > limitation/has not been updated like linux_check_ptrace_features > (which has a loop). > > Also, 2 of the 3 logics are calling my_waitpid, which seems better, > as this is protecting the waitpid syscall against EINTR. > > We also find a bunch of other kill logics in gdbserver/linux-low.c and > linux-nat.c and linux-fork.c. > > gdbserver/linux-low.c has linux_kill_one_lwp, that has an extensive > comment explaining why it uses both PTRACE_KILLME and SIGKILL > Note however that the gdbserver code does the SIGKILL using > 'syscall (__NR_tkill', not using 'kill (... SIGKILL). > gdbserver/linux-low.c does a loop calling both PTRACE_KILLME and SIGKILL, > but has a comment telling that the loop is most likely unnecessary. > > Also, gdbservers/linux-low.c handles the case of waiting for "clone" children, > indicating that __WALL depends on SIGCHLD, and killing a stopped process > does not generate this signal. > It also indicates it supports old kernels not providing clone(CLONE_THREAD), > by sending a SIGKILL for each thread (not only one for the process). > There is also a comment telling that only PTRACE_KILL was used for years, > and that paranoia tells to do both SIGKILL and PTRACE_KILL, in case > PTRACE_KILL might work better on old kernels (the comment says it is dubious > there is such an old kernel, but explains that this is the paranoia). > linux-low.c also has a special logic to kill the thread that has its > lwpid == pid, because killing the parent before the children can > cause zombies on kernels at least 2.6.0-test7 ... 2.6.8-rc4. > > linux-nat.c first uses SIGKILL, and then PTRACE_KILL. A comment (from 2011) > is telling that PTRACE_KILL may resume the inferior (so the need for SIGKILL). > Another comment (also 2011) tells that some kernels ignore even SIGKILL > for processes under ptrace (so the reason for keeping the PTRACE_KILL). > Note that linux-nat.c also loops (like linux-low.c), but indicates > the loop is needed as the linux kernel sometimes fails to kill a thread > after PTRACE_KILL. > To the contrary of linux-low.c, linux-nat.c uses waitpid (__WALL) rather > than do 2 successive waitpid calls like linux-low.c. > linux-nat.c also stops all threads before killing them, with a comment > explaining that this is needed to have PTRACE_KILL working. > > linux-fork.c has also its own duplicated kill logics. > One uses (only) SIGKILL (with a comment telling this always works, > contrary to PTRACE_KILL). > The other logic (related to killing checkpoints) only uses PTRACE_KILL. > This logic does not call waitpid, instead it does an inferior call > to have waitpid called by an inferior (no idea why, must be something > special related to checkpoint implementation). > > Below is a table summarising what has been found. > loop&stopcond is '-' if the logic at 'where' does not loop, otherwise > it details when the loop continues. SIGKILL and PTRACE_KILL column contains > a YES if the logic uses them, otherwise '-' indicates > absence of call. A 'V' indicates the return status is verified, > otherwise errors are ignored. TK indicates SIGKILL is sent with __NR_tkill, > not with the usual kill(SIGKILL). > > waitpid gives some details about how the waitpid is done: >   'MY' indicates my_waitpid is called (to protect against EINTR), >   otherwise, it is a direct call to waitpid. >   Note that the return status of waitpid is not systematically checked, >   or is not checked the same way or errors are not necessarily reported. >   Flags are 0, or the specifically used flags. 2 set of flags separated >   by a , means there are 2 calls to waitpid. >    > > where                        loop&cond        SIGKILL PTRACE_KILL waitpid > --------------------         -------------    ------- ----------- ------- > linux-ptrace.c >  linux_ptrace_test_ret_to_nx -                YES     YES         0 >  linux_check_ptrace_features WIFSTOPPED       -       YES V       MY 0 >  linux_test_for_tracefork    -                -       YES V       MY 0 >   > linux-low.c                  WIFSTOPPED       YES TK  YES         MY 0,__WCLONE > > linux-nat.c                  waitpid == pid   YES TK  YES         MY __WALL > > linux-fork.c >  linux_fork_killall          WIFSTOPPED       YES     -           0 >  delete_checkpoint_command   -                -       YES V       (infcall) 0 >