public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* Updating from 0303 to 0505 has problems
@ 2007-05-08 20:14 Quentin Barnes
  2007-05-09 21:41 ` Quentin Barnes
  0 siblings, 1 reply; 6+ messages in thread
From: Quentin Barnes @ 2007-05-08 20:14 UTC (permalink / raw)
  To: systemtap

I wanted to start pushing some of my systemtap runtime changes back
to the mailing list, so I just updated Systemtap on my ARM system
from 20070303 to 20070505 to verify my work.

With 0505 I'm getting dozens of new "eof" failures such as:

	FAIL: systemtap.base/add.stp shutdown (eof)

This test worked as expected in 0303, but with 0505 so it's failing
with "eof".  I've gone over all my ARM patches and I don't see
anything I did wrong or needs to be changed with the move.  I don't
see any other new failures (so far), so I think everything else is
working.  Here's a longer version of the failing run:
====
....
Pass 4: compiled C into "stap_e26f197cff9d02147ee429af631d0efb_473.ko" in 71350usr/23200sys/98020real ms.
Pass 5: starting run.
systemtap starting probe
PASS: systemtap.base/add.stp startup
PASS: systemtap.base/add.stp load generation
poll warning: Interrupted system call
systemtap ending probe
systemtap test success
Pass 5: run completed in 400usr/1940sys/3481real ms.
FAIL: systemtap.base/add.stp shutdown (eof)
testcase /usr/src/systemtap-20070505/testsuite/systemtap.base/add.exp completed in 122 seconds
====

The common theme in the eof failures is the "poll warning: Interrupted
system call" diagnostic.  (Nothing unusual in the system log either,
standard systemtap load notification in both cases.)  This poll warning
message seems to come from new code in runtime/staprun/relay.c.  I've
noticed a lot of changes in the relayfs area between 0303 and 0505.

I'm using a 2.6.20.4 kernel compiling with gcc version 4.1.1
(CodeSourcery ARM Sourcery G++ 2006q3-26) on a ARM PXA27x board.
On the exact same system, kernel, and compiler, I can still run
0303 tests and they pass fine, so it should have to do with 0505.
(Okay, maybe an unexercised problem in one of the other items
is possible too.)

Anybody else see a problem like this?  Anything to try to help
isolate the cause?

Quentin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Updating from 0303 to 0505 has problems
  2007-05-08 20:14 Updating from 0303 to 0505 has problems Quentin Barnes
@ 2007-05-09 21:41 ` Quentin Barnes
  2007-05-10  3:22   ` Quentin Barnes
  0 siblings, 1 reply; 6+ messages in thread
From: Quentin Barnes @ 2007-05-09 21:41 UTC (permalink / raw)
  To: systemtap

On Tue, May 08, 2007 at 03:14:49PM -0500, Quentin Barnes wrote:
>I wanted to start pushing some of my systemtap runtime changes back
>to the mailing list, so I just updated Systemtap on my ARM system
>from 20070303 to 20070505 to verify my work.
>
>With 0505 I'm getting dozens of new "eof" failures such as:
>
>	FAIL: systemtap.base/add.stp shutdown (eof)

I upgraded from 2.6.20.4 to 2.6.21.1 and still have the problem, so I
continued researching it.

It seems to be an intermittent problem.  I can run the same test
several times in a row.  Sometimes it will pass just fine where
other times I'll get the (eof) failure.

Here's a run that works:
================
Pass 5: starting run.
systemtap starting probe
PASS: systemtap.base/be_order.stp startup
PASS: systemtap.base/be_order.stp load generation
systemtap ending probe
systemtap test success
systemtap test success
PASS: systemtap.base/be_order.stp shutdown and output
Pass 5: run completed in 400usr/1280sys/2862real ms.
metric: systemtap.base/be_order.stp     8240    8680    16931   180     70      250     0       0       0       0       0       0       400     1280    2862
testcase /usr/src/systemtap-20070505/testsuite/systemtap.base/be_order.exp completed in 22 seconds
================

Here's the same test, but from a run when it fails:
================
Pass 5: starting run.
systemtap starting probe
PASS: systemtap.base/be_order.stp startup
PASS: systemtap.base/be_order.stp load generation
poll warning: Interrupted system call
systemtap ending probe
systemtap test success
systemtap test success
Pass 5: run completed in 440usr/1520sys/3159real ms.^M
FAIL: systemtap.base/be_order.stp shutdown (eof)
testcase /usr/src/systemtap-20070505/testsuite/systemtap.base/be_order.exp completed in 22 seconds
================

Looking at stap_run.exp, it's looking for "systemtap ending probe\n
systemtap test success".  When I run stap directly on the .stp file,
it always outputs the right string.  Is this some sort of race condition
where sometimes "expect" is seeing the end-of-file condition on the file
descriptor before expect has a chance to match the pattern?  Any ideas
to try out?

Quentin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Updating from 0303 to 0505 has problems
  2007-05-09 21:41 ` Quentin Barnes
@ 2007-05-10  3:22   ` Quentin Barnes
  2007-05-10  5:28     ` Martin Hunt
  0 siblings, 1 reply; 6+ messages in thread
From: Quentin Barnes @ 2007-05-10  3:22 UTC (permalink / raw)
  To: systemtap

On Wed, May 09, 2007 at 04:41:33PM -0500, Quentin Barnes wrote:
>Here's a run that works:
>================
>Pass 5: starting run.
>systemtap starting probe
>PASS: systemtap.base/be_order.stp startup
>PASS: systemtap.base/be_order.stp load generation
>systemtap ending probe
>systemtap test success
>systemtap test success
>PASS: systemtap.base/be_order.stp shutdown and output
>Pass 5: run completed in 400usr/1280sys/2862real ms.
>metric: systemtap.base/be_order.stp     8240    8680    16931   180     70  
>250     0       0       0       0       0       0       400     1280    2862
>testcase /usr/src/systemtap-20070505/testsuite/systemtap.base/be_order.exp 
>completed in 22 seconds
>================
>
>Here's the same test, but from a run when it fails:
>================
>Pass 5: starting run.
>systemtap starting probe
>PASS: systemtap.base/be_order.stp startup
>PASS: systemtap.base/be_order.stp load generation
>poll warning: Interrupted system call
>systemtap ending probe
>systemtap test success
>systemtap test success
>Pass 5: run completed in 440usr/1520sys/3159real ms.^M
>FAIL: systemtap.base/be_order.stp shutdown (eof)
>testcase /usr/src/systemtap-20070505/testsuite/systemtap.base/be_order.exp 
>completed in 22 seconds
>================
>
>Looking at stap_run.exp, it's looking for "systemtap ending probe\n
>systemtap test success".  When I run stap directly on the .stp file,
>it always outputs the right string.  Is this some sort of race condition
>where sometimes "expect" is seeing the end-of-file condition on the file
>descriptor before expect has a chance to match the pattern?  Any ideas
>to try out?

I cracked the problem and came up with a fix/workaround that solves
the "eof" failures.  (Well, most all of them.  There are other
similar ones in the testsuite that are still unaddressed, but not
due to this particular problem exactly.)

The problem is the "poll warning: Interrupted system call"
diagnostic message from staprun that's throwing off expect.  Below
is a patch that fixes this problem for this message.  Now whether
this is a fix or a workaround I don't know.  There are two possible
fixes: 1) change stap_run.exp to account for the extra diagnostic,
2) change staprun's reader_thread() not to generate the diagnostic
in the first place.  I chose the former for this patch.  Which
approach would be the better fix?



Index: testsuite/lib/stap_run.exp
===================================================================
--- testsuite/lib/stap_run.exp	(revision 155)
+++ testsuite/lib/stap_run.exp	(working copy)
@@ -50,7 +50,7 @@ proc stap_run { TEST_NAME {LOAD_GEN_FUNC
 	    send "\003"
 
 	    # check the output to see if it is sane
-	    set output "^systemtap ending probe\r\n$OUTPUT_CHECK_STRING$"
+	    set output "^(poll warning: \[^\r\]*\r\n)?systemtap ending probe\r\n$OUTPUT_CHECK_STRING$"
 
 	    expect {
 		-re  $output {

Quentin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Updating from 0303 to 0505 has problems
  2007-05-10  3:22   ` Quentin Barnes
@ 2007-05-10  5:28     ` Martin Hunt
  2007-05-13  5:03       ` Quentin Barnes
  0 siblings, 1 reply; 6+ messages in thread
From: Martin Hunt @ 2007-05-10  5:28 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: systemtap

On Wed, 2007-05-09 at 22:22 -0500, Quentin Barnes wrote:

> 
> The problem is the "poll warning: Interrupted system call"
> diagnostic message from staprun that's throwing off expect.  Below
> is a patch that fixes this problem for this message.  Now whether
> this is a fix or a workaround I don't know.  There are two possible
> fixes: 1) change stap_run.exp to account for the extra diagnostic,
> 2) change staprun's reader_thread() not to generate the diagnostic
> in the first place.  I chose the former for this patch.  Which
> approach would be the better fix?

I think I checked in a fix to this a few days ago. Certainly that error
message no longer is in the sources.  Of course, there could be new
problems now... Could you give either the current CVS version a try or a
snapshot from May 8 or later?  

Martin



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Updating from 0303 to 0505 has problems
  2007-05-10  5:28     ` Martin Hunt
@ 2007-05-13  5:03       ` Quentin Barnes
  2007-05-14  2:02         ` Quentin Barnes
  0 siblings, 1 reply; 6+ messages in thread
From: Quentin Barnes @ 2007-05-13  5:03 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

On Thu, May 10, 2007 at 01:28:44AM -0400, Martin Hunt wrote:
>On Wed, 2007-05-09 at 22:22 -0500, Quentin Barnes wrote:
>
>> 
>> The problem is the "poll warning: Interrupted system call"
>> diagnostic message from staprun that's throwing off expect.  Below
>> is a patch that fixes this problem for this message.  Now whether
>> this is a fix or a workaround I don't know.  There are two possible
>> fixes: 1) change stap_run.exp to account for the extra diagnostic,
>> 2) change staprun's reader_thread() not to generate the diagnostic
>> in the first place.  I chose the former for this patch.  Which
>> approach would be the better fix?
>
>I think I checked in a fix to this a few days ago. Certainly that error
>message no longer is in the sources.  Of course, there could be new
>problems now... Could you give either the current CVS version a try or a
>snapshot from May 8 or later?  

I tried it with the 0512 snapshot and it still has problems.

Running it with add.stp:
====================
# staprun .systemtap/cache/51/stap_51971f530865fabd17b2f297433cc8f6_473.ko    
systemtap starting probe
[pressed ^C here]
#
====================

After I pressed ^C there should have been the additional messages:
====================
systemtap ending probe
systemtap test success
====================

But there was nothing.  It just exited.  This is now also causing
a lot of "eof" failures with the test suite since those messages
are lost.


Rerunning now with -vvv:
====================
# staprun -vvv .systemtap/cache/51/stap_51971f530865fabd17b2f297433cc8f6_473.ko
main:141 modpath=".systemtap/cache/51/stap_51971f530865fabd17b2f297433cc8f6_473.ko", modname="stap_51971f530865fabd17b2f297433cc8f6_473"
run_stp_check:191 executing /usr/local/libexec/systemtap/stp_check
init_ctl_channel:59 Opening /sys/kernel/debug/systemtap/stap_51971f530865fabd17b2f297433cc8f6_473/cmd
stp_main_loop:335 in main loop
stp_main_loop:423 STP_SYMBOLS request received
stp_main_loop:416 STP_MODULES request received
init_relayfs:104 initializing relayfs
init_relayfs:120 attempting to open /sys/kernel/debug/systemtap/stap_51971f530865fabd17b2f297433cc8f6_473/trace0
init_relayfs:120 attempting to open /sys/kernel/debug/systemtap/stap_51971f530865fabd17b2f297433cc8f6_473/trace1
init_relayfs:126 ncpus=1
init_relayfs:164 starting threads
stp_main_loop:382 probe_start() returned 0
systemtap starting probe
[pressed ^C here]
sigproc:33 sigproc 2 (Interrupt)
reader_thread:74 poll=-1 errno=4
# stp_main_loop:375 got STP_EXIT, closed=0
cleanup_and_exit:288 CLEANUP AND EXIT  closed=0
close_relayfs:183 closing
close_relayfs:190 sent SIGUSR2
close_relayfs:204 done
cleanup_and_exit:301 closing control channel
cleanup_and_exit:305 removing module
sigproc:33 sigproc 17 (Child exited)
====================

I think what is going on is that the ^C forces the child in
reader_thread() to asynchronously send the STP_EXIT command making
for an early abort prior to it getting down to its read/write loop
to send the rest of the final messages first.

My guess at what you need to do is not have sigproc() actually make
the send_request(STP_EXIT, NULL, 0) call itself, but to just set
the "stop_threads" state to 1.  Only after letting reader_thread()
finish up its inner while and its outer do/while loops, have it only
then send the STP_EXIT directive.

I don't know the code well enough to know how complex this is to do
this tweak with its current multi-threaded design.

>Martin

Quentin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Updating from 0303 to 0505 has problems
  2007-05-13  5:03       ` Quentin Barnes
@ 2007-05-14  2:02         ` Quentin Barnes
  0 siblings, 0 replies; 6+ messages in thread
From: Quentin Barnes @ 2007-05-14  2:02 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

On Sun, May 13, 2007 at 12:02:54AM -0500, Quentin Barnes wrote:
>On Thu, May 10, 2007 at 01:28:44AM -0400, Martin Hunt wrote:
>>On Wed, 2007-05-09 at 22:22 -0500, Quentin Barnes wrote:
>>
>>>
>>>The problem is the "poll warning: Interrupted system call"
>>>diagnostic message from staprun that's throwing off expect.  Below
>>>is a patch that fixes this problem for this message.  Now whether
>>>this is a fix or a workaround I don't know.  There are two possible
>>>fixes: 1) change stap_run.exp to account for the extra diagnostic,
>>>2) change staprun's reader_thread() not to generate the diagnostic
>>>in the first place.  I chose the former for this patch.  Which
>>>approach would be the better fix?
>>
>>I think I checked in a fix to this a few days ago. Certainly that error
>>message no longer is in the sources.  Of course, there could be new
>>problems now... Could you give either the current CVS version a try or a
>>snapshot from May 8 or later?  
>
>I tried it with the 0512 snapshot and it still has problems.
>
>Running it with add.stp:
>====================
># staprun .systemtap/cache/51/stap_51971f530865fabd17b2f297433cc8f6_473.ko  
>systemtap starting probe
>[pressed ^C here]
>#
>====================
>
>After I pressed ^C there should have been the additional messages:
>====================
>systemtap ending probe
>systemtap test success
>====================
>
>But there was nothing.  It just exited.  This is now also causing
>a lot of "eof" failures with the test suite since those messages
>are lost.

I played some and found the change below fixed losing the post
interrupt messages.  Whether it is a hack workaround or a real fix,
I don't know the code well enough to know.  So far, this also seemed
to fix many of the "eof" failures.


--- runtime/staprun/Orig/relay.c	2007-05-12 13:17:34.000000000 -0500
+++ runtime/staprun/relay.c	2007-05-13 18:08:13.000000000 -0500
@@ -76,7 +76,6 @@ static void *reader_thread(void *data)
 				fprintf(stderr, "poll error: %s\n",strerror(errno));
 				pthread_exit(NULL);
                         }
-			stop_threads = 1;
                 }
 		while ((rc = read(relay_fd[cpu], buf, sizeof(buf))) > 0) {
 			if (write(out_fd[cpu], buf, rc) != rc) {

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-05-14  2:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-08 20:14 Updating from 0303 to 0505 has problems Quentin Barnes
2007-05-09 21:41 ` Quentin Barnes
2007-05-10  3:22   ` Quentin Barnes
2007-05-10  5:28     ` Martin Hunt
2007-05-13  5:03       ` Quentin Barnes
2007-05-14  2:02         ` Quentin Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).