From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 42196 invoked by alias); 28 Jun 2016 03:20:35 -0000 Mailing-List: contact systemtap-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: systemtap-owner@sourceware.org Received: (qmail 42173 invoked by uid 89); 28 Jun 2016 03:20:33 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.2 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=Anand, anand, D*c, 10995 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Tue, 28 Jun 2016 03:20:22 +0000 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 537B485545; Tue, 28 Jun 2016 03:20:21 +0000 (UTC) Received: from [10.10.48.75] (vpn-48-75.rdu2.redhat.com [10.10.48.75]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u5S3KJpl028705; Mon, 27 Jun 2016 23:20:20 -0400 Subject: Re: exercising current aarch64 kprobe support with systemtap To: Pratyush Anand References: <8f40d0b9-5550-92f9-d1c5-8769f52304c0@redhat.com> <576B5501.1030106@linaro.org> <576C29E1.8060805@linaro.org> <0a594132-796b-779d-b473-a06c0f3e8ae8@redhat.com> <20160627141840.GB8139@dhcppc9> Cc: David Long , systemtap@sourceware.org, Mark Brown , Jeremy Linton , David Smith From: William Cohen Message-ID: Date: Tue, 28 Jun 2016 03:20:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <20160627141840.GB8139@dhcppc9> Content-Type: multipart/mixed; boundary="------------A1E4708A87E6D0A4A79A796F" X-IsSubscribed: yes X-SW-Source: 2016-q2/txt/msg00301.txt.bz2 This is a multi-part message in MIME format. --------------A1E4708A87E6D0A4A79A796F Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Content-length: 4847 On 06/27/2016 10:18 AM, Pratyush Anand wrote: > Hi Will, > > On 23/06/2016:03:22:44 PM, William Cohen wrote: >> On 06/23/2016 02:26 PM, David Long wrote: >>> On 06/23/2016 11:49 AM, William Cohen wrote: >>>> On 06/22/2016 11:18 PM, David Long wrote: >>>>> On 06/22/2016 04:24 PM, William Cohen wrote: >>>>>> Hi all, >>>>>> >>>>>> When running the current systemtap checked out from the git repository >>>>>> and a locally built kernel with the kprobes64-v13 patches (the >>>>>> test_upstream_arm64_devel branch of >>>>>> https://github.com/pratyushanand/linux) on Fedora 23 machine one of >>>>>> the kprobes_onthefly.exp tests is causing the machine to get in a >>>>>> state that requires rebooting to fix. This can be triggered by running a >>>>>> portion of the systemtap tests with: >>>>>> >>>>>> make installcheck RUNTESTFLAGS="--debug systemtap.onthefly/kprobes_onthefly.exp" >>>>>> >>>>>> When it gets to the kprobes_onthefly - otf_stress_max_iter_5000 test the >>>>>> console starts spewing the following and needs to be rebooted: >>>>>> >>>>>> [23394.036860] Unexpected kernel single-step exception at EL1 >>>>>> [23394.042434] Unexpected kernel single-step exception at EL1 >>>>>> [23394.048008] Unexpected kernel single-step exception at EL1 >>>>>> [23394.053541] Unexpected kernel single-step exception at EL1 >>>>>> [23394.059053] Unexpected kernel single-step exception at EL1 >>>>>> [23394.064545] Unexpected kernel single-step exception at EL1 >>>>>> >>>>>> Sorry I don't have the start of the failure it scrolled off the screen very quickly. >>>>>> >>>>>> -Will >>>>>> >>>>>> >>>>> >>>>> I'll take a look and see what I can figure out. >>>>> >>>>> In the meantime I did just push a v14 branch. I'm doubtful that it will address the above problem even though it contains a few bug fixes. >>>>> >>>>> -dl >>>>> >>>> >>>> Hi Dave and Pratyush, >>>> >>>> I tried the kprobes64-v13 kernel and it also seems to work, so it lookw like the problem might be in the the >>>> test_upstream_arm64_devel branch of https://github.com/pratyushanand/linux . >>>> >>>> -Will >>>> >>> >>> I'm going to interpret that as meaning you know of no problem in the kprobes v14 patch that would give me pause to email it upstream. Do you disagree? >>> >>> -dl >>> >> >> Hi Dave, >> >> Yes, the problem only seems to be in that other kernel from https://github.com/pratyushanand/linux with the kprobe and uprobe patches, so the arm64 patches do not appear to be the problem. I don't know what is causing the problem maybe there is something going on with the porting of the patches to that kernel or the patches included in there (uprobes/kexec) in there. > > Just to update: > > I confirm that problem arises after uprobe patches only, but not yet sure that > actual culprit is uprobe code. > > I can see that kprobes_onthefly.exp also exercises uprobes in the test. It > seems, when problem happens, there was a kprobe at print_worker_info(). > > Most likely re-entrant kprobe is called when kprobe is instrumented at > print_worker_info(). I guessed it could be show_regs() from arm64/kprobe code, > but commenting show_regs() did not make any difference. Even blacklisting > print_worker_info() also did not resolve it, probelem reproduced in a different > way after blacklisting. > > So, still its vague and debugging is continued. > If I can clearly understand the systemtap test code, then probably it will be > easier to debug. I mean, if I can get the kernel and user space symbols name > where this test is instrumenting probes then that would help a lot to zero it > down. > > ~Pratyush > Hi Pratyush, My understanding is that the systemtap onthefly support enables/disable the probe as metnioned in the following sytemtap bugzilla entry (and the ones that it is dependent on): https://sourceware.org/bugzilla/show_bug.cgi?id=10995. It would be handy to things pared down to the systemtap script that triggers the problem. Putting some diagnostic puts it looks like the script that triggers the problems it looks like it is something like the attached onthefly_trigger.stp (that was gathered on a x86_64 machine so it might not be exactly what is causing the problem on aarch64. David Smith, any suggestions on how to debug based on your experiences from https://sourceware.org/bugzilla/show_bug.cgi?id=17126 where the ppc64 had a similar issue with onthefly testing? The "Unexpected kernel single-step exception at EL1" reminds me of the times when kprobes couldn't find a handler. Maybe there is some situation where the kprobe is being removed but the breakpoint is still around. Did you get a backtrace with the insertino of the "BUG()" where that message is printed out? I wonder if it might be triggered by the (thread_flags & _TIF_UPROBE) somehow being true and the aarch64 do_notify_resume starts running. -Will --------------A1E4708A87E6D0A4A79A796F Content-Type: text/plain; charset=UTF-8; name="onthefly_trigger.stp" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="onthefly_trigger.stp" Content-length: 2276 # We want these probes to fire as deterministically as possible so that # their outputs can easily be predicted and compared. Unfortunately this # is complicated by multiple facts, including # (1) the timer might be so fast that the probes don't have time to # fire and print something, # (2) the kprobe might print something after it is re-enabled, but # before the kretprobe is re-enabled, # (3) disabling a kretprobe won't stop an already running handler. # # To get around these issues, we use a simple state machine. The states # are as follow: # # 0 = cond disabled # 1 = cond enabled, but kprobe && kretprobe not yet enabled # 2 = cond enabled, kprobe enabled, kretprobe not yet enabled # 3 = cond enabled, kprobe && kretprobe enabled, nothing printed yet # 4 = cond enabled, 'hit' printed but not 'rethit' # 5 = cond enabled, 'rethit' printed global state = 1 global toggles = 0 probe kernel.function("vfs_read").call if (state > 0) { if (state == 3) { println("hit") state++; } else if (state == 1) { state++; } } probe kernel.function("vfs_read").return if (state > 0) { # ensure that nothing changed during the vfs_read body if (state != @entry(state)) next if (state == 4) { println("rethit") state++; } else if (state == 2) { state++; } } probe begin, end, error, kernel.function("*@workqueue.c"), process.begin, process.end, kernel.trace("*"), process("echo").function("*"), netfilter.pf("NFPROTO_IPV4").hook("NF_INET_LOCAL_IN")?, netfilter.pf("NFPROTO_IPV4").hook("NF_INET_LOCAL_OUT")?, perf.sw.cpu_clock.sample(1000000)?, timer.profile.tick? { if (state != 0 && state != 5) next # give probes more time to move through the states toggles++ if (toggles > 5000) exit() else { println("toggling") state = !state } } probe timer.s(360) { println("timed out") exit() } --------------A1E4708A87E6D0A4A79A796F--