public inbox for frysk@sourceware.org
 help / color / mirror / Atom feed
* Patch for TearDownProcess
@ 2007-09-07  2:22 Kris Van Hees
  2007-09-07 13:31 ` Andrew Cagney
  0 siblings, 1 reply; 5+ messages in thread
From: Kris Van Hees @ 2007-09-07  2:22 UTC (permalink / raw)
  To: frysk

[-- Attachment #1: Type: text/plain, Size: 1014 bytes --]

The attached patch has been commitedfor TearDownProcess, to handle various
cases of TestRunner hanging due to stray child process.  The bugzilla ticket
that handles this case is #4996.

The symptom was typically that a test execution started a child process, but
failed to register that process.  This left the process hanging, sometimes in
a utrace call, causing the next invocation of tearDown() to hang on the waitAll
request because the wait was remaining in effect due to a stuck process.

The attached patch *does* open the possibility that a stray process may remain
after tearDown() has been executed if e.g. a process termination causes another
process to be created, etc...  However, that will not cause a testsuite hang.
It is also not expected to pose a problem with the further execution of the
testsuite (in fact, it is very likely to get cleaned up during the tearDown()
of the next test).

The change has been tested multiple times to verify that it does not cause
any regressions.

	Cheers,
	Kris

[-- Attachment #2: FOO --]
[-- Type: text/plain, Size: 7289 bytes --]

--- TearDownProcess.java-orig	2007-09-06 17:37:05.000000000 -0400
+++ TearDownProcess.java	2007-09-06 20:07:51.000000000 -0400
@@ -1,5 +1,6 @@
 // This file is part of the program FRYSK.
 //
+// Copyright 2007 Oracle Corporation.
 // Copyright 2005, 2006, 2007, Red Hat Inc.
 //
 // FRYSK is free software; you can redistribute it and/or modify it
@@ -43,15 +44,16 @@
 import java.util.logging.Level;
 import java.util.Set;
 import java.util.HashSet;
+import frysk.junit.TestCase;
 import frysk.sys.Errno;
 import frysk.sys.Sig;
 import frysk.sys.Signal;
 import frysk.sys.Wait;
 import frysk.sys.proc.ProcBuilder;
 import frysk.sys.ProcessIdentifier;
-import java.util.Iterator;
 import frysk.sys.Ptrace;
 import frysk.sys.WaitBuilder;
+import frysk.sys.SignalBuilder;
 
 /**
  * Framework for cleaning up temporary processes created as part of a
@@ -160,7 +162,7 @@
      * continued. Work around this by first sending all tasks a
      * continue ...
      */
-    private static ProcessIdentifier capturedSendDetachContKill (ProcessIdentifier pid)
+    private static boolean capturedSendDetachContKill (ProcessIdentifier pid)
     {
 	// Do the detach
 	try {
@@ -181,11 +183,10 @@
 	    log("tkill -CONT", pid, "(failed - ESRCH)\n");
 	}
 	// Finally send it a kill to finish things off.
-	capturedSendTkill(pid);
-	return pid;
+	return capturedSendTkill(pid);
     }
 
-    private static ProcessIdentifier capturedSendDetachContKill (int pid)
+    private static boolean capturedSendDetachContKill (int pid)
     {
 	return capturedSendDetachContKill (new ProcessIdentifier (pid));
     }
@@ -195,9 +196,11 @@
 	// Make a preliminary pass through all the registered
 	// pidsToKillDuringTearDown trying to simply kill
 	// each. Someone else may have waited on their deaths already.
-	for (Iterator i = pidsToKillDuringTearDown.iterator(); i.hasNext();) {
-	    ProcessIdentifier child = (ProcessIdentifier) i.next();
-	    capturedSendTkill(child);
+	Object[] pidsToKill = pidsToKillDuringTearDown.toArray();
+	for (int i = 0; i < pidsToKill.length; i++) {
+	    ProcessIdentifier child = (ProcessIdentifier) pidsToKill[i];
+	    if (!capturedSendTkill(child))
+		pidsToKillDuringTearDown.remove(child);
 	}
 
 	// Go through all registered processes / tasks adding any of
@@ -214,16 +217,18 @@
 	// Iterate over a copy of the tids's collection as the
 	// missingTidsToKillDuringTearDown may modify the underlying
 	// collection.
-	Object[] pidsToKill = pidsToKillDuringTearDown.toArray();
+	pidsToKill = pidsToKillDuringTearDown.toArray();
 	for (int i = 0; i < pidsToKill.length; i++) {
 	    ProcessIdentifier child = (ProcessIdentifier) pidsToKill[i];
 	    missingTidsToKillDuringTearDown.construct(child);
 	}
 
 	// Blast all the processes for real.
-	for (Iterator i = pidsToKillDuringTearDown.iterator(); i.hasNext();) {
-	    ProcessIdentifier child = (ProcessIdentifier) i.next();
-	    capturedSendDetachContKill(child);
+	pidsToKill = pidsToKillDuringTearDown.toArray();
+	for (int i = 0; i < pidsToKill.length; i++) {
+	    ProcessIdentifier child = (ProcessIdentifier) pidsToKill[i];
+	    if (!capturedSendDetachContKill(child))
+		pidsToKillDuringTearDown.remove(child);
 	}
 
 	// Drain the wait event queue. This ensures that: there are
@@ -235,77 +240,77 @@
 	// For attached tasks, which will generate non-exit wait
 	// events (clone et.al.), the task is detached / killed.
 	// Doing that frees up the task so that it can run to exit.
-	try {
-	    while (! pidsToKillDuringTearDown.isEmpty()) {
-		log("waitAll -1 ....");
-		Wait.waitAll(-1, new WaitBuilder() {
-			public void cloneEvent (int pid, int clone)
-			{
-			    capturedSendDetachContKill(pid);
-			}
-
-			public void forkEvent (int pid, int child)
-			{
-			    capturedSendDetachContKill(pid);
-			}
-
-			public void exitEvent (int pid, boolean signal, int value,
-					       boolean coreDumped)
-			{
-			    capturedSendDetachContKill(pid);
-			    // Do not remove PID from
-			    // pidsToKillDuringTearDown list; need to
-			    // let the terminated event behind it
-			    // bubble up.
-			}
-
-			public void execEvent (int pid)
-			{
-			    capturedSendDetachContKill(pid);
-			}
-
-			public void syscallEvent (int pid)
-			{
-			    capturedSendDetachContKill(pid);
-			}
-
-			public void stopped (int pid, int signal)
-			{
-			    capturedSendDetachContKill(pid);
-			}
-
-			private void drainTerminated (int pid)
-			{
-			    // To be absolutly sure, again make
-			    // certain that the thread is detached.
-			    ProcessIdentifier id = capturedSendDetachContKill(pid);
-			    // True pidsToKillDuringTearDown can have
-			    // a second exit status behind this first
-			    // one, drain that also. Give up when
-			    // this PID has no outstanding events.
-			    log("Wait.drain", id, "\n");
-			    id.blockingDrain ();
-			    // Hopefully done with this PID.
-			    pidsToKillDuringTearDown.remove(id);
-			}
-
-			public void terminated (int pid, boolean signal,
-						int value,
-						boolean coreDumped)
-			{
-			    drainTerminated(pid);
-			}
-
-			public void disappeared (int pid, Throwable w)
-			{
-			    // The task vanished somehow, drain it.
-			    drainTerminated(pid);
-			}
-		    });
-	    }
-	}
-	catch (Errno.Echild e) {
-	    // No more events.
+	if (!pidsToKillDuringTearDown.isEmpty()) {
+	    log("waitAll " + pidsToKillDuringTearDown + " ....");
+	    Wait.waitAll(
+		TestCase.getTimeoutMilliseconds(),
+		new WaitBuilder() {
+		    public void cloneEvent (int pid, int clone)
+		    {
+			capturedSendDetachContKill(pid);
+		    }
+
+		    public void forkEvent (int pid, int child)
+		    {
+			capturedSendDetachContKill(pid);
+		    }
+
+		    public void exitEvent (int pid, boolean signal, int value,
+					   boolean coreDumped)
+		    {
+			capturedSendDetachContKill(pid);
+			// Do not remove PID from
+			// pidsToKillDuringTearDown list; need to
+			// let the terminated event behind it
+			// bubble up.
+		    }
+
+		    public void execEvent (int pid)
+		    {
+			capturedSendDetachContKill(pid);
+		    }
+
+		    public void syscallEvent (int pid)
+		    {
+			capturedSendDetachContKill(pid);
+		    }
+
+		    public void stopped (int pid, int signal)
+		    {
+			capturedSendDetachContKill(pid);
+		    }
+
+		    private void drainTerminated (int pid)
+		    {
+			// To be absolutly sure, again make
+			// certain that the thread is detached.
+			capturedSendDetachContKill(pid);
+			ProcessIdentifier id = new ProcessIdentifier (pid);
+			// True pidsToKillDuringTearDown can have
+			// a second exit status behind this first
+			// one, drain that also. Give up when
+			// this PID has no outstanding events.
+			log("Wait.drain", id, "\n");
+			id.blockingDrain ();
+			// Hopefully done with this PID.
+			pidsToKillDuringTearDown.remove(id);
+		    }
+
+		    public void terminated (int pid, boolean signal, int value,
+					    boolean coreDumped)
+		    {
+			drainTerminated(pid);
+		    }
+
+		    public void disappeared (int pid, Throwable w)
+		    {
+			// The task vanished somehow, drain it.
+			drainTerminated(pid);
+		    }
+		},
+		new SignalBuilder() {
+		    public void signal(Sig sig) {}
+		});
 	}
 
 	// Drain all the pending signals. Note that the process of killing

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Patch for TearDownProcess
  2007-09-07  2:22 Patch for TearDownProcess Kris Van Hees
@ 2007-09-07 13:31 ` Andrew Cagney
  2007-09-07 14:30   ` Kris Van Hees
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Andrew Cagney @ 2007-09-07 13:31 UTC (permalink / raw)
  To: Kris Van Hees; +Cc: frysk

Kris Van Hees wrote:
> The attached patch *does* open the possibility that a stray process may remain
> after tearDown() has been executed if e.g. a process termination causes another
> process to be created, etc...  However, that will not cause a testsuite hang.
> It is also not expected to pose a problem with the further execution of the
> testsuite (in fact, it is very likely to get cleaned up during the tearDown()
> of the next test).
>   
Unfortunately this isn't theoretical.  On a slower f5 machine; this 
happens consistently; invalidating test results.

Given the choices between a potential test-suite hang, and tear-down 
leaving waitpid events around, the decision was made in favor of the 
latter.  That decision hasn't changed.  Having the test run hang, is a 
lesser evil then the test-suite continuing but producing bogus results.. 

I restored the old behavior; and then added a timeout.  It currently 
logs a message, I suspect it should abandon the test run, since the 
problem state hasn't gone away.

Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Patch for TearDownProcess
  2007-09-07 13:31 ` Andrew Cagney
@ 2007-09-07 14:30   ` Kris Van Hees
  2007-09-07 18:39   ` Elena Zannoni
  2007-09-13 12:34   ` Kris Van Hees
  2 siblings, 0 replies; 5+ messages in thread
From: Kris Van Hees @ 2007-09-07 14:30 UTC (permalink / raw)
  To: Andrew Cagney; +Cc: Kris Van Hees, frysk

On Fri, Sep 07, 2007 at 09:30:28AM -0400, Andrew Cagney wrote:
> Kris Van Hees wrote:
>> The attached patch *does* open the possibility that a stray process may 
>> remain
>> after tearDown() has been executed if e.g. a process termination causes 
>> another
>> process to be created, etc...  However, that will not cause a testsuite 
>> hang.
>> It is also not expected to pose a problem with the further execution of 
>> the
>> testsuite (in fact, it is very likely to get cleaned up during the 
>> tearDown()
>> of the next test).
>>   
> Unfortunately this isn't theoretical.  On a slower f5 machine; this happens 
> consistently; invalidating test results.

The information you added to ticket 4996 doesn't quite show anything that
indicates that this is a problem with my patch.  In the future, could you
at least add output after a -c FINEST run or something, to ensure there is
relevant information there to help work out a fix?

> Given the choices between a potential test-suite hang, and tear-down 
> leaving waitpid events around, the decision was made in favor of the 
> latter.  That decision hasn't changed.  Having the test run hang, is a 
> lesser evil then the test-suite continuing but producing bogus results.. 
> I restored the old behavior; and then added a timeout.  It currently logs a 
> message, I suspect it should abandon the test run, since the problem state 
> hasn't gone away.

Obviously, there can be a difference of opinion on this matter, and I believe
that reversing this patch without any consideration for the matter it aims at
solving is useless.  You deliberately restore a problem behaviour.  Why?

It would have been more constructive to open a bug about the problem you have
encountered, and have that problem resolved, so that in the end we have a
fully working testsuite that doesn't need a tradeoff between hanging and
stray waitpid events.

Finally, given that you use FC5 as your reference platform here, perhaps you
could add it to the automated test system (i.e. have it submit results there)
so that test coverage can be expended to that release as well (especially since
it seems to uncover some problems that other systems do not - assuming of
course that no kernel issues are involved).

	Cheers,
	Kris

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Patch for TearDownProcess
  2007-09-07 13:31 ` Andrew Cagney
  2007-09-07 14:30   ` Kris Van Hees
@ 2007-09-07 18:39   ` Elena Zannoni
  2007-09-13 12:34   ` Kris Van Hees
  2 siblings, 0 replies; 5+ messages in thread
From: Elena Zannoni @ 2007-09-07 18:39 UTC (permalink / raw)
  To: Andrew Cagney; +Cc: Kris Van Hees, frysk

Andrew Cagney wrote:
> Unfortunately this isn't theoretical.  On a slower f5 machine; this 
> happens consistently; invalidating test results.

Can you post more information about the things you are seeing? Whatever 
is in the bugzilla does not have enough details to diagnose the problem. 
At this point you are the only one knowing what is going on in
your tests, it would be nice if you could share the information with the 
rest of the team. If nothing else, just
out of respect for the work that Kris put into finding the cause of the 
hangs in the first place, explaining that
to you last night on irc and writing a patch.
If the patch exposed other problems, a civil procedure is to discuss 
things with the patch author
instead of steamrolling your changes in, obliterating other's people work.

>
> Given the choices between a potential test-suite hang, and tear-down 
> leaving waitpid events around, the decision was made in favor of the 
> latter.  That decision hasn't changed.  Having the test run hang, is a 
> lesser evil then the test-suite continuing but producing bogus results..

Who made that decision? Can you explain why hangs in the testsuite 
(which have been there for a long time)
are reasonable to have? I would think that an intelligent person and a 
hard core proponent of extreme programming and other formal development 
methods like yourself would want to put a big emphasis on
the QA of this project, so I am sure missing something here.

> I restored the old behavior; and then added a timeout.  It currently 
> logs a message, I suspect it should abandon the test run, since the 
> problem state hasn't gone away.
>

I don't think that's what you did. You reimplemented Kris change.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Patch for TearDownProcess
  2007-09-07 13:31 ` Andrew Cagney
  2007-09-07 14:30   ` Kris Van Hees
  2007-09-07 18:39   ` Elena Zannoni
@ 2007-09-13 12:34   ` Kris Van Hees
  2 siblings, 0 replies; 5+ messages in thread
From: Kris Van Hees @ 2007-09-13 12:34 UTC (permalink / raw)
  To: frysk; +Cc: Kris Van Hees

In view of the quote below, and history over the past couple of months, I'd
like to pose an important question to the overall developer community behind
Frysk:

	Should we strive to have a consistent, dependable testsuite?

The reason to ask is that on one hand, it seems some people prefer to look
at tests as individual units, regardless of their side-effects on later tests
in the testsuite (see the past discussions about the file descriptor leak
testing and its effects on the testsuite), whereas others prefer to see the
testsuite as a collection of tests that needs to be capable of producing
dependable results regardless of the outcome of one or more individual tests.

Right now it seems that not everyone is even on a specific side of the issue,
preferring one behaviour for some test cases, and the other for other test
cases.  The end result is that the Frysk testsuite is not producing results
that can be considered a valid indicator in terms of correctness.

In addition, just yesterday we encountered a case where some build-time testing
is conditional, causing a commit to go through (tested on F7 prior to commit)
that broke the build on other platforms (FC6) because on the latter platform
the test was enabled.  That kind of thing obviously gives developers a wrong
impression on whether their patch is going to be break builds or not.

Today I am seeing the following (first time I say this was July 3rd, and it
was reported in bugzilla):

| Running testDataFuncPeekBytes(frysk.sys.TestPtrace) ...PASS
| Running testDataFuncPeekBytes(frysk.sys.TestPtrace) ...ERROR
|   close: Bad file descriptor (fd 0)
| Running testLengthUnderBound(frysk.sys.TestPtrace) ...PASS
| Running testLengthUnderBound(frysk.sys.TestPtrace) ...ERROR
|   close: Bad file descriptor (fd 0)
| Running testOffsetUnderBound(frysk.sys.TestPtrace) ...PASS
| Running testOffsetUnderBound(frysk.sys.TestPtrace) ...ERROR
|   close: Bad file descriptor (fd 0)
| Running testLengthOverBound(frysk.sys.TestPtrace) ...PASS
| Running testLengthOverBound(frysk.sys.TestPtrace) ...ERROR
|   close: Bad file descriptor (fd 0)
| Running testOffsetOverBound(frysk.sys.TestPtrace) ...PASS
| Running testOffsetOverBound(frysk.sys.TestPtrace) ...PASS
| Running testLengthOnBound(frysk.sys.TestPtrace) ...ERROR
|   close: Bad file descriptor (fd 0)
| Running testOffsetOnBound(frysk.sys.TestPtrace) ...PASS
| Running testOffsetOnBound(frysk.sys.TestPtrace) ...PASS

While it can be argued that a test reporting its result twice is hardly an
issue, I would think that a single test reporting both a PASS *and* an ERROR
result would be more reason for concern...

In all, most development methodologies put a lot of value in testing.  It is
also my experience that even without a formal methodology, testing tends to
be considered rather important.  I do not believe that the Frysk testsuite can
be taken seriously as long as we cannot have a consistent policy in terms of
test quality.  Either test we allow failing tests to potentially corrupt the
validity of following tests, or we prohibit failing tests from corrupting
the validity of following tests.  Having it one way in some cases, and the
other way in some other cases simply leads to not being able to trust the
results of the testsuite at all.

	Cheers,
	Kris

On Fri, Sep 07, 2007 at 09:30:28AM -0400, Andrew Cagney wrote:
> Given the choices between a potential test-suite hang, and tear-down 
> leaving waitpid events around, the decision was made in favor of the 
> latter.  That decision hasn't changed.  Having the test run hang, is a 
> lesser evil then the test-suite continuing but producing bogus results.. 
> I restored the old behavior; and then added a timeout.  It currently logs a 
> message, I suspect it should abandon the test run, since the problem state 
> hasn't gone away.
>
> Andrew
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-09-13 12:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-07  2:22 Patch for TearDownProcess Kris Van Hees
2007-09-07 13:31 ` Andrew Cagney
2007-09-07 14:30   ` Kris Van Hees
2007-09-07 18:39   ` Elena Zannoni
2007-09-13 12:34   ` Kris Van Hees

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).