From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 30099 invoked by alias); 23 Jun 2011 14:13:49 -0000 Received: (qmail 30092 invoked by uid 22791); 23 Jun 2011 14:13:48 -0000 X-SWARE-Spam-Status: No, hits=-6.1 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI,SPF_HELO_PASS,T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 23 Jun 2011 14:13:34 +0000 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p5NEDYRW024413 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 23 Jun 2011 10:13:34 -0400 Received: from fche.csb (vpn-8-195.rdu.redhat.com [10.11.8.195]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p5NEDX1G010273; Thu, 23 Jun 2011 10:13:34 -0400 Received: by fche.csb (Postfix, from userid 2569) id 5ADD25812E; Thu, 23 Jun 2011 10:13:33 -0400 (EDT) To: "Richard W.M. Jones" Cc: Josh Stone , systemtap@sourceware.org Subject: Re: Rapidly running systemtap causing hangs or oops References: <20110622230025.GG18438@amd.home.annexia.org> <4E028028.4010603@redhat.com> <20110623075126.GJ803@amd.home.annexia.org> From: fche@redhat.com (Frank Ch. Eigler) Date: Thu, 23 Jun 2011 14:13:00 -0000 In-Reply-To: <20110623075126.GJ803@amd.home.annexia.org> (Richard W. M. Jones's message of "Thu, 23 Jun 2011 08:51:26 +0100") Message-ID: User-Agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Mailing-List: contact systemtap-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: systemtap-owner@sourceware.org X-SW-Source: 2011-q2/txt/msg00323.txt.bz2 Hi, Richard - rjones wrote: > [...] >> Can you try running stap with "-D STP_ALIBI"? This alibi mode compiles >> out most of stap's code, so each probe handler is reduced to just an >> atomic increment, then a final hit count is reported on exit. > Adding -D STP_ALIBI [...] did not change the behaviour. The mount > process crashed quickly with the oops below: > [ 159.454020] [] ext2_fill_super+0x9b5/0xc3b [ext2] > [ 159.454020] [] mount_bdev+0x155/0x1b7 > [ 159.454020] [] ? ext2_error+0x112/0x112 [ext2] > [...] OK, that does seem to implicate the kernel or our registration / unregistration process. Telling which is a bit tricky because the kernel's own 'perf probe' widget cannot register/unregister as many probes as quickly as we can, which means that if the kernel has race conditions in all that text-segment manipulation, we are more likely to hit it than e.g. perf. Such has happened before, and it's tough to diagnose. An intermediate option is to extract all the kprobe addresses from the "stap -p2" processing loop, and modify systemtap source-tree scripts/kprobes_test/gen_code.py to take a symbol+offset list rather than just a symbol list, to generate a non-systemtap pure-kprobes module. Then one could insmod;test;rmmod in a tight loop to see if the same problem reappears. At that point, one punts to the kernel folks. Another hacky intermediate possibility is to put some deliberate time delays here and there, like between your while true; do stap; done loop iterations. Or disable runtime/autoconf-unregister-kprobes.c, so stap doesn't use the kernel bulk-unregistration functions but rather goes one by one. - FChE