* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-28 18:53 Boehm, Hans
2002-03-28 23:08 ` Tom Tromey
2002-04-01 8:14 ` Michael Smith
0 siblings, 2 replies; 36+ messages in thread
From: Boehm, Hans @ 2002-03-28 18:53 UTC (permalink / raw)
To: Boehm, Hans, 'Michael Smith', 'tromey@redhat.com'
Cc: 'Jeff Sturm ', ''Bryce McKinlay ' ',
'java@gcc.gnu.org'
Here's a patch. It's bigger than it needs to be in that it includes some
fixes for related things that I wasn't quite comfortable with while
proofreading the code. Most of those were already in my version. Many of
them make it easier to debug any similar problems we might encounter in the
future. Unless someone wants to object, or I encounter further test
failures, I will check the whole thing into both the trunk and the branch
sometime tomorrow.
The crucial part of the patch is the reclaim.c fix.
I have seen no more failures with the patch, so far. And I did finally
reproduce them without it. It would be interesting if particularly Jeff and
Michael could confirm that. (Michael - you already had some of the other
changes, so this probably won't apply cleanly without backing those out.)
Hans
* linux_threads.c (return_free_lists): Clear fl[i] unconditionally.
(GC_local_gcj_malloc): Add assertion.
(start_mark_threads): Fix abort message.
* mark.c (GC_mark_from): Generalize assertion.
* reclaim.c (GC_clear_fl_links): New function.
(GC_start_reclaim): Must clear some freelist links.
* specific.h, specific.c: Add assertions. Safer definition for
INVALID_QTID, quick_thread_id. Fix/add comments.
Rearrange tse fields.
Index: linux_threads.c
===================================================================
RCS file: /cvs/gcc/gcc/boehm-gc/linux_threads.c,v
retrieving revision 1.18.2.1
diff -u -r1.18.2.1 linux_threads.c
--- linux_threads.c 2002/03/25 18:23:36 1.18.2.1
+++ linux_threads.c 2002/03/29 02:15:18
@@ -231,15 +231,16 @@
nwords = i * (GRANULARITY/sizeof(word));
qptr = fl + i;
q = *qptr;
- if ((word)q < HBLKSIZE) continue;
- if (gfl[nwords] == 0) {
+ if ((word)q >= HBLKSIZE) {
+ if (gfl[nwords] == 0) {
gfl[nwords] = q;
- } else {
+ } else {
/* Concatenate: */
for (; (word)q >= HBLKSIZE; qptr = &(obj_link(q)), q = *qptr);
GC_ASSERT(0 == q);
*qptr = gfl[nwords];
gfl[nwords] = fl[i];
+ }
}
/* Clear fl[i], since the thread structure may hang around. */
/* Do it in a way that is likely to trap if we access it. */
@@ -412,6 +413,7 @@
/* A memory barrier is probably never needed, since the */
/* action of stopping this thread will cause prior writes */
/* to complete. */
+ GC_ASSERT(((void * volatile *)result)[1] == 0);
*(void * volatile *)result = ptr_to_struct_containing_descr;
return result;
} else if ((word)my_entry - 1 < DIRECT_GRANULES) {
@@ -544,7 +546,7 @@
ABORT("pthread_attr_getstacksize failed\n");
if (old_size < MIN_STACK_SIZE) {
if (pthread_attr_setstacksize(&attr, MIN_STACK_SIZE) != 0)
- ABORT("pthread_attr_getstacksize failed\n");
+ ABORT("pthread_attr_setstacksize failed\n");
}
}
# endif /* HPUX */
Index: mark.c
===================================================================
RCS file: /cvs/gcc/gcc/boehm-gc/mark.c,v
retrieving revision 1.12.2.1
diff -u -r1.12.2.1 mark.c
--- mark.c 2002/03/12 18:31:12 1.12.2.1
+++ mark.c 2002/03/29 02:15:19
@@ -546,13 +546,13 @@
/* Large length. */
/* Process part of the range to avoid pushing too much on the
*/
/* stack. */
+ GC_ASSERT(descr < GC_greatest_plausible_heap_addr
+ - GC_least_plausible_heap_addr);
# ifdef PARALLEL_MARK
# define SHARE_BYTES 2048
if (descr > SHARE_BYTES && GC_parallel
&& mark_stack_top < mark_stack_limit - 1) {
int new_size = (descr/2) & ~(sizeof(word)-1);
- GC_ASSERT(descr < GC_greatest_plausible_heap_addr
- - GC_least_plausible_heap_addr);
mark_stack_top -> mse_start = current_p;
mark_stack_top -> mse_descr = new_size + sizeof(word);
/* makes sure we handle */
Index: reclaim.c
===================================================================
RCS file: /cvs/gcc/gcc/boehm-gc/reclaim.c,v
retrieving revision 1.11
diff -u -r1.11 reclaim.c
--- reclaim.c 2002/02/12 04:37:53 1.11
+++ reclaim.c 2002/03/29 02:15:19
@@ -862,6 +862,25 @@
#endif /* NO_DEBUGGING */
/*
+ * Clear all obj_link pointers in the list of free objects *flp.
+ * Clear *flp.
+ * This must be done before dropping a list of free gcj-style objects,
+ * since may otherwise end up with dangling "descriptor" pointers.
+ * It may help for other pointer-containg objects.
+ */
+void GC_clear_fl_links(flp)
+ptr_t *flp;
+{
+ ptr_t next = *flp;
+
+ while (0 != next) {
+ *flp = 0;
+ flp = &(obj_link(next));
+ next = *flp;
+ }
+}
+
+/*
* Perform GC_reclaim_block on the entire heap, after first clearing
* small object free lists (if we are not just looking for leaks).
*/
@@ -875,17 +894,24 @@
# endif
/* Clear reclaim- and free-lists */
for (kind = 0; kind < GC_n_kinds; kind++) {
- register ptr_t *fop;
- register ptr_t *lim;
- register struct hblk ** rlp;
- register struct hblk ** rlim;
- register struct hblk ** rlist = GC_obj_kinds[kind].ok_reclaim_list;
+ ptr_t *fop;
+ ptr_t *lim;
+ struct hblk ** rlp;
+ struct hblk ** rlim;
+ struct hblk ** rlist = GC_obj_kinds[kind].ok_reclaim_list;
+ GC_bool should_clobber = (GC_obj_kinds[kind].ok_descriptor != 0);
if (rlist == 0) continue; /* This kind not used. */
if (!report_if_found) {
lim = &(GC_obj_kinds[kind].ok_freelist[MAXOBJSZ+1]);
for( fop = GC_obj_kinds[kind].ok_freelist; fop < lim; fop++ ) {
- *fop = 0;
+ if (*fop != 0) {
+ if (should_clobber) {
+ GC_clear_fl_links(fop);
+ } else {
+ *fop = 0;
+ }
+ }
}
} /* otherwise free list objects are marked, */
/* and its safe to leave them */
Index: specific.c
===================================================================
RCS file: /cvs/gcc/gcc/boehm-gc/specific.c,v
retrieving revision 1.3
diff -u -r1.3 specific.c
--- specific.c 2001/10/16 09:01:36 1.3
+++ specific.c 2002/03/29 02:15:19
@@ -16,17 +16,27 @@
#include "private/gc_priv.h" /* For GC_compare_and_exchange,
GC_memory_barrier */
#include "private/specific.h"
-static tse invalid_tse; /* 0 qtid is guaranteed to be invalid */
+static tse invalid_tse = {INVALID_QTID, 0, 0, INVALID_THREADID};
+ /* A thread-specific data entry which will never
*/
+ /* appear valid to a reader. Used to fill in empty
*/
+ /* cache entries to avoid a check for 0.
*/
int PREFIXED(key_create) (tsd ** key_ptr, void (* destructor)(void *)) {
int i;
tsd * result = (tsd *)MALLOC_CLEAR(sizeof (tsd));
+ /* A quick alignment check, since we need atomic stores */
+ GC_ASSERT((unsigned long)(&invalid_tse.next) % sizeof(tse *) == 0);
if (0 == result) return ENOMEM;
pthread_mutex_init(&(result -> lock), NULL);
for (i = 0; i < TS_CACHE_SIZE; ++i) {
result -> cache[i] = &invalid_tse;
}
+# ifdef GC_ASSERTIONS
+ for (i = 0; i < TS_HASH_SIZE; ++i) {
+ GC_ASSERT(result -> hash[i] == 0);
+ }
+# endif
*key_ptr = result;
return 0;
}
@@ -36,12 +46,14 @@
int hash_val = HASH(self);
volatile tse * entry = (volatile tse *)MALLOC_CLEAR(sizeof (tse));
+ GC_ASSERT(self != INVALID_THREADID);
if (0 == entry) return ENOMEM;
pthread_mutex_lock(&(key -> lock));
/* Could easily check for an existing entry here. */
entry -> next = key -> hash[hash_val];
entry -> thread = self;
entry -> value = value;
+ GC_ASSERT(entry -> qtid == INVALID_QTID);
/* There can only be one writer at a time, but this needs to be */
/* atomic with respect to concurrent readers. */
*(volatile tse **)(key -> hash + hash_val) = entry;
@@ -70,6 +82,10 @@
*link = entry -> next;
/* Atomic! concurrent accesses still work. */
/* They must, since readers don't lock. */
+ /* We shouldn't need a volatile access here, */
+ /* since both this and the preceding write */
+ /* should become visible no later than */
+ /* the pthread_mutex_unlock() call. */
}
/* If we wanted to deallocate the entry, we'd first have to clear */
/* any cache entries pointing to it. That probably requires */
@@ -91,6 +107,7 @@
unsigned hash_val = HASH(self);
tse *entry = key -> hash[hash_val];
+ GC_ASSERT(qtid != INVALID_QTID);
while (entry != NULL && entry -> thread != self) {
entry = entry -> next;
}
@@ -99,6 +116,8 @@
entry -> qtid = qtid;
/* It's safe to do this asynchronously. Either value */
/* is safe, though may produce spurious misses. */
+ /* We're replacing one qtid with another one for the */
+ /* same thread. */
*cache_ptr = entry;
/* Again this is safe since pointer assignments are */
/* presumed atomic, and either pointer is valid. */
Index: include/private/specific.h
===================================================================
RCS file: /cvs/gcc/gcc/boehm-gc/include/private/specific.h,v
retrieving revision 1.2
diff -u -r1.2 specific.h
--- specific.h 2001/08/17 18:30:50 1.2
+++ specific.h 2002/03/29 02:15:19
@@ -27,16 +27,22 @@
#define TS_HASH_SIZE 1024
#define HASH(n) (((((long)n) >> 8) ^ (long)n) & (TS_HASH_SIZE - 1))
+/* An entry describing a thread-specific value for a given thread. */
+/* All such accessible structures preserve the invariant that if either
*/
+/* thread is a valid pthread id or qtid is a valid "quick tread id" */
+/* for a thread, then value holds the corresponding thread specific */
+/* value. This invariant must be preserved at ALL times, since
*/
+/* asynchronous reads are allowed. */
typedef struct thread_specific_entry {
unsigned long qtid; /* quick thread id, only for cache */
void * value;
- pthread_t thread;
struct thread_specific_entry *next;
+ pthread_t thread;
} tse;
/* We represent each thread-specific datum as two tables. The first is
*/
-/* a cache, index by a "quick thread identifier". The "quick" thread */
+/* a cache, indexed by a "quick thread identifier". The "quick" thread
*/
/* identifier is an easy to compute value, which is guaranteed to */
/* determine the thread, though a thread may correspond to more than */
/* one value. We typically use the address of a page in the stack. */
@@ -45,12 +51,15 @@
/* Return the "quick thread id". Default version. Assumes page size, */
/* or at least thread stack separation, is at least 4K.
*/
-static __inline__ long quick_thread_id() {
+/* Must be defined so that it never returns 0. (Page 0 can't really */
+/* be part of any stack, since that would make 0 a valid stack pointer.)*/
+static __inline__ unsigned long quick_thread_id() {
int dummy;
- return (long)(&dummy) >> 12;
+ return (unsigned long)(&dummy) >> 12;
}
-#define INVALID_QTID ((unsigned long)(-1))
+#define INVALID_QTID ((unsigned long)0)
+#define INVALID_THREADID ((pthread_t)0)
typedef struct thread_specific_data {
tse * volatile cache[TS_CACHE_SIZE];
@@ -76,7 +85,10 @@
unsigned hash_val = CACHE_HASH(qtid);
tse * volatile * entry_ptr = key -> cache + hash_val;
tse * entry = *entry_ptr; /* Must be loaded only once. */
- if (entry -> qtid == qtid) return entry -> value;
+ if (entry -> qtid == qtid) {
+ GC_ASSERT(entry -> thread == pthread_self());
+ return entry -> value;
+ }
return PREFIXED(slow_getspecific) (key, qtid, entry_ptr);
}
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-28 18:53 GC failure w/ THREAD_LOCAL_ALLOC ? Boehm, Hans
@ 2002-03-28 23:08 ` Tom Tromey
2002-03-29 14:05 ` Bryce McKinlay
2002-04-01 8:14 ` Michael Smith
1 sibling, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2002-03-28 23:08 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Michael Smith', 'Jeff Sturm ',
''Bryce McKinlay ' ', 'java@gcc.gnu.org'
>>>>> "Hans" == Boehm, Hans <hans_boehm@hp.com> writes:
Hans> Here's a patch. It's bigger than it needs to be in that it
Hans> includes some fixes for related things that I wasn't quite
Hans> comfortable with while proofreading the code. Most of those
Hans> were already in my version. Many of them make it easier to
Hans> debug any similar problems we might encounter in the future.
Hans> Unless someone wants to object, or I encounter further test
Hans> failures, I will check the whole thing into both the trunk and
Hans> the branch sometime tomorrow.
I applied this and tested in on alpha Linux.
First I found a situation where I could reliably make GCTest hang.
Then I rebuilt with the patch and re-ran GCTest with the same
arguments. I was unable to make it hang again. Then I tried running
it with other (larger, and thus presumably "harder") arguments. I was
still unable to make it hang.
So, I'm happy. Thanks for debugging this.
Hans> I have seen no more failures with the patch, so far. And I did
Hans> finally reproduce them without it. It would be interesting if
Hans> particularly Jeff and Michael could confirm that.
I'd also like to hear from Bryce.
Tom
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-28 23:08 ` Tom Tromey
@ 2002-03-29 14:05 ` Bryce McKinlay
2002-03-30 5:20 ` Jeff Sturm
0 siblings, 1 reply; 36+ messages in thread
From: Bryce McKinlay @ 2002-03-29 14:05 UTC (permalink / raw)
To: tromey
Cc: Boehm, Hans, 'Michael Smith', 'Jeff Sturm ',
'java@gcc.gnu.org'
Tom Tromey wrote:
>So, I'm happy. Thanks for debugging this.
>
>Hans> I have seen no more failures with the patch, so far. And I did
>Hans> finally reproduce them without it. It would be interesting if
>Hans> particularly Jeff and Michael could confirm that.
>
>I'd also like to hear from Bryce.
>
Yup, this seems to have fixed it! Thanks Hans.
regards
Bryce.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-29 14:05 ` Bryce McKinlay
@ 2002-03-30 5:20 ` Jeff Sturm
0 siblings, 0 replies; 36+ messages in thread
From: Jeff Sturm @ 2002-03-30 5:20 UTC (permalink / raw)
To: Bryce McKinlay
Cc: tromey, Boehm, Hans, 'Michael Smith', 'java@gcc.gnu.org'
On Sat, 30 Mar 2002, Bryce McKinlay wrote:
> >Hans> I have seen no more failures with the patch, so far. And I did
> >Hans> finally reproduce them without it. It would be interesting if
> >Hans> particularly Jeff and Michael could confirm that.
> >
> >I'd also like to hear from Bryce.
> >
>
> Yup, this seems to have fixed it! Thanks Hans.
I thought I saw the problem too, but it turns out to be harder to
reproduce than I expected. I'll try again later.
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-28 18:53 GC failure w/ THREAD_LOCAL_ALLOC ? Boehm, Hans
2002-03-28 23:08 ` Tom Tromey
@ 2002-04-01 8:14 ` Michael Smith
1 sibling, 0 replies; 36+ messages in thread
From: Michael Smith @ 2002-04-01 8:14 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'tromey@redhat.com', 'Jeff Sturm ',
''Bryce McKinlay ' ', 'java@gcc.gnu.org'
Boehm, Hans wrote:
> I have seen no more failures with the patch, so far. And I did
> finally reproduce them without it. It would be interesting if
> particularly Jeff and Michael could confirm that. (Michael - you
> already had some of the other changes, so this probably won't apply
> cleanly without backing those out.)
I managed to manually merge them in, except:
> Index: linux_threads.c
> ===================================================================
> RCS file: /cvs/gcc/gcc/boehm-gc/linux_threads.c,v
> retrieving revision 1.18.2.1
> diff -u -r1.18.2.1 linux_threads.c
> --- linux_threads.c 2002/03/25 18:23:36 1.18.2.1
> +++ linux_threads.c 2002/03/29 02:15:18
[snip]
> @@ -544,7 +546,7 @@
> ABORT("pthread_attr_getstacksize failed\n");
> if (old_size < MIN_STACK_SIZE) {
> if (pthread_attr_setstacksize(&attr, MIN_STACK_SIZE) != 0)
> - ABORT("pthread_attr_getstacksize failed\n");
> + ABORT("pthread_attr_setstacksize failed\n");
> }
> }
> # endif /* HPUX */
this hunk didn't apply -- the code doesn't seem to even exist in my
version. Doesn't appear to matter though: a) the change does not appear
to be substantial, and b) HPUX isn't defined, so it probably wouldn't
have touched this anyway. :)
With the rest of the changes applied, I can no longer replicate the
problem using GCTest.
Thanks Hans!
Do you recommend running without GC_IGNORE_GCJ_INFO now that this fix is in?
regards,
michael
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-04-01 8:53 Boehm, Hans
0 siblings, 0 replies; 36+ messages in thread
From: Boehm, Hans @ 2002-04-01 8:53 UTC (permalink / raw)
To: 'Michael Smith', Boehm, Hans
Cc: 'tromey@redhat.com', 'Jeff Sturm ',
''Bryce McKinlay ' ', 'java@gcc.gnu.org'
> -----Original Message-----
> From: Michael Smith [mailto:msmith@spinnakernet.com]
> With the rest of the changes applied, I can no longer replicate the
> problem using GCTest.
>
Great! Thanks for checking that.
>
> Do you recommend running without GC_IGNORE_GCJ_INFO now that
> this fix is in?
>
Yes, I would no longer set the environment variable. For at least 9 out of
10 applications it won't matter. The relative speed is probably platform
dependent. But every once in a while you may find one that introduces
enough misidentified pointers to cause appreciable leakage. (In my
experience, this tends to be fairly testable, i.e. it's usually more a
property of the application than the input, but still ...) In the absence
of known bugs, I would always use the available type information.
Hans
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-29 17:09 Boehm, Hans
0 siblings, 0 replies; 36+ messages in thread
From: Boehm, Hans @ 2002-03-29 17:09 UTC (permalink / raw)
To: 'Bryce McKinlay', tromey
Cc: Boehm, Hans, 'Michael Smith', 'Jeff Sturm ',
'java@gcc.gnu.org'
Thanks everyone for your help.
I checked in the patch.
Hans
> -----Original Message-----
> From: Bryce McKinlay [mailto:bryce@waitaki.otago.ac.nz]
> Sent: Friday, March 29, 2002 2:05 PM
> To: tromey@redhat.com
> Cc: Boehm, Hans; 'Michael Smith'; 'Jeff Sturm '; 'java@gcc.gnu.org'
> Subject: Re: GC failure w/ THREAD_LOCAL_ALLOC ?
>
>
> Tom Tromey wrote:
>
> >So, I'm happy. Thanks for debugging this.
> >
> >Hans> I have seen no more failures with the patch, so far. And I did
> >Hans> finally reproduce them without it. It would be interesting if
> >Hans> particularly Jeff and Michael could confirm that.
> >
> >I'd also like to hear from Bryce.
> >
>
> Yup, this seems to have fixed it! Thanks Hans.
>
> regards
>
> Bryce.
>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-28 12:08 Boehm, Hans
0 siblings, 0 replies; 36+ messages in thread
From: Boehm, Hans @ 2002-03-28 12:08 UTC (permalink / raw)
To: 'Michael Smith', tromey, Boehm, Hans
Cc: 'Jeff Sturm ', ''Bryce McKinlay ' ',
'java@gcc.gnu.org'
FWIW -
I believe I have an explanation of the original problem, though I need to
think a little more about the best way to fix it, and I can't completely
confirm this without the fix (and perhaps not even then).
The problem should be limited to gcj objects. Turning off gcj type
information (GC_IGNORE_GCJ_INFO environment variable) should be an adequate
work-around for now. My guess is that this can occur with or without
thread-local allocation, though there are probably reasons to expect that
thread-local allocation increases the occurrence probability.
Details:
The problem is that at the end of a collection cycle, in GC_start_reclaim,
free lists remaining from the last GC cycle are just dropped, since they
will be rebuilt. My guess is:
1) A short free list of gcj objects is dropped. The page containing it is
scheduled to be reclaimed (swept) in the next cycle.
2) Objects of that size end up being in low demand or the GC is invoked
explicitly, and thus the page is never actually swept, and the free list
remains.
3) The next mark phase sees a bogus pointer to part of the free list. Since
the referenced object appears to have a 0 mark descriptor, the rest of the
free list is not marked.
4) The rest of the free list is reallocated. The one object that was
accidentally marked remmains as it was.
5) The next mark cycle sees an object (the accidentally marked one) with a
vtable/free list pointer that points to an in-use object. It's second word
is nonzero and causes the collector crash when it is misinterpreted as a
mark descriptor.
I clearly need to be more careful about just dropping freelists with gcj
objects. But this shouldn't be hard to fix.
Hans
> -----Original Message-----
> From: Michael Smith [mailto:msmith@spinnakernet.com]
> Sent: Thursday, March 28, 2002 6:52 AM
> To: tromey@redhat.com; Boehm, Hans
> Cc: 'Michael Smith'; 'Jeff Sturm '; ''Bryce McKinlay ' '
> Subject: RE: GC failure w/ THREAD_LOCAL_ALLOC ?
>
>
> > And here I thought that bug was mostly theoretical.
> > It's always nice to find out when the details matter. Thanks.
> >
> > Tom
>
> No, thank *you*. Saved me some work trying to fix it myself. :)
>
> Do you want the test case I wrote that showed the Process stuff
> breaking? It wasn't that hard to write one (Once I figured
> out how/why
> it was happening):
>
> 1. Exec process with executable name that doesn't exist.
> 2. Open a file for reading.
> 3. Force garbage collection.
> 4. Read from file. --> IOException: Bad File Descriptor
>
> regards,
> michael
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-22 22:13 Boehm, Hans
@ 2002-03-27 12:48 ` Michael Smith
0 siblings, 0 replies; 36+ messages in thread
From: Michael Smith @ 2002-03-27 12:48 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Jeff Sturm ', ''Bryce McKinlay ' ',
'''java@gcc.gnu.org' ' '
Boehm, Hans wrote:
> Can you put a breakpoint in close() to verify that the descriptor is being
> closed by a finalizer?
>
> If you can repeat this, and the GC is responsible, and you can figure out
> which object is getting collected, a good thing to do is to put a breakpoint
> in GC_finish_collection() during the offending GC, and then invoke
> GC_is_marked() on all objects in the reference chain from the debugger, with
> the process still stopped in GC_finish_collection. If you find a marked
> object pointing to an unmarked one, it would be good to know the results of
>
> print *GC_find_header(<last marked>)
>
> If it's a gcj object, so would the mark descriptor.
>
> I'm also suspicious that this is a completely separate problem, possibly
> outside the GC.
After a bit of investigation, I can confidently say that the GC is not
collecting objects prematurely, causing my /dev/urandom file descriptor
to get closed.
It turns out, I was hitting a double-close related to a failed Process
exec, which had been fixed in CVS since I got my version. I've pulled
down the patches to natPosixProcess.cc, applied them to my local tree
and my file descriptors are no longer disappearing unexpectedly.
thanks,
michael
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-26 16:11 ` Tom Tromey
@ 2002-03-26 19:49 ` Bryce McKinlay
0 siblings, 0 replies; 36+ messages in thread
From: Bryce McKinlay @ 2002-03-26 19:49 UTC (permalink / raw)
To: tromey
Cc: Boehm, Hans, 'Michael Smith', 'java@gcc.gnu.org',
'Jeff Sturm'
Tom Tromey wrote:
>>>>>>"Bryce" == Bryce McKinlay <bryce@waitaki.otago.ac.nz> writes:
>>>>>>
>
>Bryce> Well, I noticed that it is much more likely to happen when the
>Bryce> machine is busy with other tasks. If the machine is otherwise
>Bryce> idle, the problem doesn't seem to occur.
>
>I tried a lot of things, including rebuilding gcc while running it,
>but can't cause it to fail on x86. On the alpha I can make it fail
>pretty easily, as it turns out, by using different parameters. `50
>50' makes it fail there.
>
Hmm, must be some other problem on alpha, because alpha does not use
THREAD_LOCAL_ALLOC. My problem definately occurs only when
THREAD_LOCAL_ALLOC is enabled.
regards
Bryce.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-26 15:57 ` Bryce McKinlay
@ 2002-03-26 16:11 ` Tom Tromey
2002-03-26 19:49 ` Bryce McKinlay
0 siblings, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2002-03-26 16:11 UTC (permalink / raw)
To: Bryce McKinlay
Cc: Boehm, Hans, 'Michael Smith', 'java@gcc.gnu.org',
'Jeff Sturm'
>>>>> "Bryce" == Bryce McKinlay <bryce@waitaki.otago.ac.nz> writes:
Bryce> Well, I noticed that it is much more likely to happen when the
Bryce> machine is busy with other tasks. If the machine is otherwise
Bryce> idle, the problem doesn't seem to occur.
I tried a lot of things, including rebuilding gcc while running it,
but can't cause it to fail on x86. On the alpha I can make it fail
pretty easily, as it turns out, by using different parameters. `50
50' makes it fail there.
BTW I submitted a high-priority PR for this so we can track it.
I found I was losing track of everything going on, so I've started
entering PRs for anything I think we need to fix for 3.1.
Tom
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-26 15:45 ` Tom Tromey
@ 2002-03-26 15:57 ` Bryce McKinlay
2002-03-26 16:11 ` Tom Tromey
0 siblings, 1 reply; 36+ messages in thread
From: Bryce McKinlay @ 2002-03-26 15:57 UTC (permalink / raw)
To: tromey
Cc: Boehm, Hans, 'Michael Smith', 'java@gcc.gnu.org',
'Jeff Sturm'
Tom Tromey wrote:
>>>>>>"Hans" == Boehm, Hans <hans_boehm@hp.com> writes:
>>>>>>
>
>Hans> I can't reproduce the problem on X86 either. Questions:
>Hans> 0) Do other people see similar problems?
>
>I finally tried GCTest on my x86 box. It worked fine for me. I ran
>it with no arguments. I also tried in on alpha Linux and it worked
>fine. My PPC build is still running so I haven't tried it there yet.
>
Well, I noticed that it is much more likely to happen when the machine
is busy with other tasks. If the machine is otherwise idle, the problem
doesn't seem to occur.
Doing something like:
while (true) do ./gctest; done
then running something else like a compile or benchmark at the same
time, seems to help reproduce the problem on some machines.
regards
Bryce.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 17:03 Boehm, Hans
2002-03-20 17:54 ` Bryce McKinlay
@ 2002-03-26 15:45 ` Tom Tromey
2002-03-26 15:57 ` Bryce McKinlay
1 sibling, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2002-03-26 15:45 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Michael Smith', 'Bryce McKinlay',
'java@gcc.gnu.org', 'Jeff Sturm'
>>>>> "Hans" == Boehm, Hans <hans_boehm@hp.com> writes:
Hans> I can't reproduce the problem on X86 either. Questions:
Hans> 0) Do other people see similar problems?
I finally tried GCTest on my x86 box. It worked fine for me. I ran
it with no arguments. I also tried in on alpha Linux and it worked
fine. My PPC build is still running so I haven't tried it there yet.
Tom
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-22 22:13 Boehm, Hans
2002-03-27 12:48 ` Michael Smith
0 siblings, 1 reply; 36+ messages in thread
From: Boehm, Hans @ 2002-03-22 22:13 UTC (permalink / raw)
To: 'Michael Smith ', Boehm, Hans
Cc: 'Jeff Sturm ', ''Bryce McKinlay ' ',
'''java@gcc.gnu.org' ' '
Can you put a breakpoint in close() to verify that the descriptor is being
closed by a finalizer?
If you can repeat this, and the GC is responsible, and you can figure out
which object is getting collected, a good thing to do is to put a breakpoint
in GC_finish_collection() during the offending GC, and then invoke
GC_is_marked() on all objects in the reference chain from the debugger, with
the process still stopped in GC_finish_collection. If you find a marked
object pointing to an unmarked one, it would be good to know the results of
print *GC_find_header(<last marked>)
If it's a gcj object, so would the mark descriptor.
I'm also suspicious that this is a completely separate problem, possibly
outside the GC.
Hans
-----Original Message-----
From: Michael Smith
>
> Do you know how the FileInputStream is referenced? Is it only
referenced
> from the stack? Does the reference chain go through large objects,
> particularly through objects which have pointers at displacements >
100 or
> so?
Referenced as a member variable of an instantiated java object, so its
referenced from the heap. That instantiated java object is held by a
static class variable. Neither of those two objects are particularly
large, with no more than 6 member variables.
>>Hans, do you have enough information about the mark descriptor
>>clobbering to recommend a workaround even if you don't know
>>exactly what
>>the problem is? For example, are you reasonably confident
>>that building
>>with THREAD_LOCAL_ALLOC or USE_PTHREAD_SPECIFIC or something
>>else will
>>eliminate the problem (and not just "hide" it like GC_IGNORE_GCJ_INFO
>>seems to do)?
>>
>
> Assuming it appears to work, I have the greatest confidence in
> GC_IGNORE_GCJ_INFO. This sidesteps a lot of subtle issues in the
collector,
> at the expense of possibly retaining extraneous memory. But without
fully
> understanding the problem, it's hard to say for sure. Turning off
> THREAD_LOCAL_ALLOC would be my next choice. My current assumption is
that
> Jeff was seeing a different problem.
I've been running with GC_IGNORE_GCJ_INFO for a month now. I haven't
seen the GC deadlock (same problem Bryce described) since I started
using this option. I *am* seeing the my file descriptor disappear using
this option. So, maybe this file descriptor thing really is a separate
problem.
> I'm currently in the middle of tracking down the (n+1)st Itanium stack
> unwinding issue. I'll get back to trying to reproduce this after
that. If
> someone else has a similar failure report, especially with older glibc
> versions, that would be helpful.
Any pointers on what other information may be useful to you? If I can
do some tests now, then maybe you can have an easier time once you can
pick this problem up. Right now, I know of no workaround for this file
descriptor problem, so the sooner it gets resolved, the better.
regards,
michael
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-22 18:25 ` Jeff Sturm
@ 2002-03-22 21:22 ` Jeff Sturm
0 siblings, 0 replies; 36+ messages in thread
From: Jeff Sturm @ 2002-03-22 21:22 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Michael Smith', 'Bryce McKinlay ',
''java@gcc.gnu.org' '
On Fri, 22 Mar 2002, Jeff Sturm wrote:
> Notice the thread is created before blocking on accept(). After accept()
> returns we start() the thread and create another... isn't it true that
> this thread is momentarily invisible to the collector? Since a return
> from t.start() doesn't guarantee that t is running?
ugh... I forgot about the pthread_create() wrapper. So never mind the
above. Nevertheless, that makes the results even more puzzling. I'll
keep investigating.
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-22 11:41 Boehm, Hans
2002-03-22 12:59 ` Michael Smith
@ 2002-03-22 18:25 ` Jeff Sturm
2002-03-22 21:22 ` Jeff Sturm
1 sibling, 1 reply; 36+ messages in thread
From: Jeff Sturm @ 2002-03-22 18:25 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Michael Smith', 'Bryce McKinlay ',
''java@gcc.gnu.org' '
On Fri, 22 Mar 2002, Boehm, Hans wrote:
> Assuming it appears to work, I have the greatest confidence in
> GC_IGNORE_GCJ_INFO. This sidesteps a lot of subtle issues in the collector,
> at the expense of possibly retaining extraneous memory. But without fully
> understanding the problem, it's hard to say for sure. Turning off
> THREAD_LOCAL_ALLOC would be my next choice. My current assumption is that
> Jeff was seeing a different problem.
Well, I'm seeing problems with and without GC_IGNORE_GCJ_INFO, but only a
deadlock with. The more I think about it, I'm starting to suspect my
problem is related to Bryce's and Michael's reports after all.
Hmm... might be on to something here. What happens if a collection takes
place after a thread is created but before it gets a chance to start?
Our code is proprietary, but portions are based on free software
such as Apache JServ. Here's the main loop from JServ:
while (true) {
//Here we make sure that the number of
//parallel connections is limited
semaphore.throttle();
try {
JServConnection connection = new JServConnection();
Thread t = new Thread(connection);
t.setDaemon(true);
Socket clientSocket = listenSocket.accept();
connection.init(clientSocket, semaphore);
t.start();
} catch (...
...
}
Notice the thread is created before blocking on accept(). After accept()
returns we start() the thread and create another... isn't it true that
this thread is momentarily invisible to the collector? Since a return
from t.start() doesn't guarantee that t is running?
As an experiment I rearranged the loop above so that we keep a reference
to the previous thread until the next accept() returns. That did seem
to make a difference... the application no longer deadlocks easily.
What's the right thing to do? I'd guess that Thread.start() should
arrange to keep a pointer to the new thread somewhere the collector can
see it.
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-22 11:41 Boehm, Hans
@ 2002-03-22 12:59 ` Michael Smith
2002-03-22 18:25 ` Jeff Sturm
1 sibling, 0 replies; 36+ messages in thread
From: Michael Smith @ 2002-03-22 12:59 UTC (permalink / raw)
To: Boehm, Hans
Cc: Jeff Sturm, 'Bryce McKinlay ',
''java@gcc.gnu.org' '
Boehm, Hans wrote:
>>From: Michael Smith [mailto:msmith@spinnakernet.com]
>>I've been seeing this as well and have been having difficulty tracking
>>it down. After seeing this message, I'm starting to think this may be
>>the cause.
>>
>>I have an open FileInputStream to /dev/urandom. An "ls -l" of
>>"/proc/<pid>/fd" shows file descriptor 51 linked to
>>/dev/urandom. Upon
>>a seemingly innocuous command in my app (doesn't touch
>>anything near the
>>FileInputStream), file descriptor 51 no longer exists in
>>/proc/<pid>/fd.
>> Sometimes it exists as a socket instead.
>>
>>When I turned on GC_PRINT_STATS, I noticed this seemingly inncouous
>>command involves a garbage collection. When I started the
>>application
>>with an enormous initial heap size, the GC did not occur, and
>>the fd for
>>/dev/urandom stays around.
>
> Do you know how the FileInputStream is referenced? Is it only referenced
> from the stack? Does the reference chain go through large objects,
> particularly through objects which have pointers at displacements > 100 or
> so?
Referenced as a member variable of an instantiated java object, so its
referenced from the heap. That instantiated java object is held by a
static class variable. Neither of those two objects are particularly
large, with no more than 6 member variables.
>>Hans, do you have enough information about the mark descriptor
>>clobbering to recommend a workaround even if you don't know
>>exactly what
>>the problem is? For example, are you reasonably confident
>>that building
>>with THREAD_LOCAL_ALLOC or USE_PTHREAD_SPECIFIC or something
>>else will
>>eliminate the problem (and not just "hide" it like GC_IGNORE_GCJ_INFO
>>seems to do)?
>>
>
> Assuming it appears to work, I have the greatest confidence in
> GC_IGNORE_GCJ_INFO. This sidesteps a lot of subtle issues in the collector,
> at the expense of possibly retaining extraneous memory. But without fully
> understanding the problem, it's hard to say for sure. Turning off
> THREAD_LOCAL_ALLOC would be my next choice. My current assumption is that
> Jeff was seeing a different problem.
I've been running with GC_IGNORE_GCJ_INFO for a month now. I haven't
seen the GC deadlock (same problem Bryce described) since I started
using this option. I *am* seeing the my file descriptor disappear using
this option. So, maybe this file descriptor thing really is a separate
problem.
> I'm currently in the middle of tracking down the (n+1)st Itanium stack
> unwinding issue. I'll get back to trying to reproduce this after that. If
> someone else has a similar failure report, especially with older glibc
> versions, that would be helpful.
Any pointers on what other information may be useful to you? If I can
do some tests now, then maybe you can have an easier time once you can
pick this problem up. Right now, I know of no workaround for this file
descriptor problem, so the sooner it gets resolved, the better.
regards,
michael
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-22 11:41 Boehm, Hans
2002-03-22 12:59 ` Michael Smith
2002-03-22 18:25 ` Jeff Sturm
0 siblings, 2 replies; 36+ messages in thread
From: Boehm, Hans @ 2002-03-22 11:41 UTC (permalink / raw)
To: 'Michael Smith', Jeff Sturm
Cc: Boehm, Hans, 'Bryce McKinlay ',
''java@gcc.gnu.org' '
> From: Michael Smith [mailto:msmith@spinnakernet.com]
> I've been seeing this as well and have been having difficulty tracking
> it down. After seeing this message, I'm starting to think this may be
> the cause.
>
> I have an open FileInputStream to /dev/urandom. An "ls -l" of
> "/proc/<pid>/fd" shows file descriptor 51 linked to
> /dev/urandom. Upon
> a seemingly innocuous command in my app (doesn't touch
> anything near the
> FileInputStream), file descriptor 51 no longer exists in
> /proc/<pid>/fd.
> Sometimes it exists as a socket instead.
>
> When I turned on GC_PRINT_STATS, I noticed this seemingly inncouous
> command involves a garbage collection. When I started the
> application
> with an enormous initial heap size, the GC did not occur, and
> the fd for
> /dev/urandom stays around.
Do you know how the FileInputStream is referenced? Is it only referenced
from the stack? Does the reference chain go through large objects,
particularly through objects which have pointers at displacements > 100 or
so?
>
>
> Hans, do you have enough information about the mark descriptor
> clobbering to recommend a workaround even if you don't know
> exactly what
> the problem is? For example, are you reasonably confident
> that building
> with THREAD_LOCAL_ALLOC or USE_PTHREAD_SPECIFIC or something
> else will
> eliminate the problem (and not just "hide" it like GC_IGNORE_GCJ_INFO
> seems to do)?
>
Assuming it appears to work, I have the greatest confidence in
GC_IGNORE_GCJ_INFO. This sidesteps a lot of subtle issues in the collector,
at the expense of possibly retaining extraneous memory. But without fully
understanding the problem, it's hard to say for sure. Turning off
THREAD_LOCAL_ALLOC would be my next choice. My current assumption is that
Jeff was seeing a different problem.
I'm currently in the middle of tracking down the (n+1)st Itanium stack
unwinding issue. I'll get back to trying to reproduce this after that. If
someone else has a similar failure report, especially with older glibc
versions, that would be helpful.
Hans
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-21 10:59 ` Jeff Sturm
@ 2002-03-22 11:00 ` Michael Smith
0 siblings, 0 replies; 36+ messages in thread
From: Michael Smith @ 2002-03-22 11:00 UTC (permalink / raw)
To: Jeff Sturm
Cc: Boehm, Hans, 'Bryce McKinlay ',
''java@gcc.gnu.org' '
Jeff Sturm wrote:
> On Thu, 21 Mar 2002, Boehm, Hans wrote:
>>You should be able to get the same effect with the GC_IGNORE_GCJ_INFO
>>environment variable. (See boehm-gc/doc/README.environment.) I
would still
>>really like to track this down, though.
>
>
> Thanks; that'll come in handy.
>
>
>>2) The mark descriptor was clobbered. This can be a consequence of a
>>premature object collection, due to any sort of collector bug. Since
it's
>>most likely to be overwritten by a pointer, this tends to make the object
>>look VERY large.
>
>
> I strongly suspect premature collection. Another clue is that I'm
> getting files closed prematurely at random. That would happen if the
> finalizer ran while the file is in use.
I've been seeing this as well and have been having difficulty tracking
it down. After seeing this message, I'm starting to think this may be
the cause.
I have an open FileInputStream to /dev/urandom. An "ls -l" of
"/proc/<pid>/fd" shows file descriptor 51 linked to /dev/urandom. Upon
a seemingly innocuous command in my app (doesn't touch anything near the
FileInputStream), file descriptor 51 no longer exists in /proc/<pid>/fd.
Sometimes it exists as a socket instead.
When I turned on GC_PRINT_STATS, I noticed this seemingly inncouous
command involves a garbage collection. When I started the application
with an enormous initial heap size, the GC did not occur, and the fd for
/dev/urandom stays around.
Hans, do you have enough information about the mark descriptor
clobbering to recommend a workaround even if you don't know exactly what
the problem is? For example, are you reasonably confident that building
with THREAD_LOCAL_ALLOC or USE_PTHREAD_SPECIFIC or something else will
eliminate the problem (and not just "hide" it like GC_IGNORE_GCJ_INFO
seems to do)?
thanks,
michael
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-21 10:24 Boehm, Hans
@ 2002-03-21 10:59 ` Jeff Sturm
2002-03-22 11:00 ` Michael Smith
0 siblings, 1 reply; 36+ messages in thread
From: Jeff Sturm @ 2002-03-21 10:59 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Bryce McKinlay ', ''Michael Smith' ',
''java@gcc.gnu.org' '
On Thu, 21 Mar 2002, Boehm, Hans wrote:
> You should be able to get the same effect with the GC_IGNORE_GCJ_INFO
> environment variable. (See boehm-gc/doc/README.environment.) I would still
> really like to track this down, though.
Thanks; that'll come in handy.
> 2) The mark descriptor was clobbered. This can be a consequence of a
> premature object collection, due to any sort of collector bug. Since it's
> most likely to be overwritten by a pointer, this tends to make the object
> look VERY large.
I strongly suspect premature collection. Another clue is that I'm getting
files closed prematurely at random. That would happen if the finalizer
ran while the file is in use.
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-21 10:24 Boehm, Hans
2002-03-21 10:59 ` Jeff Sturm
0 siblings, 1 reply; 36+ messages in thread
From: Boehm, Hans @ 2002-03-21 10:24 UTC (permalink / raw)
To: 'Jeff Sturm', Boehm, Hans
Cc: 'Bryce McKinlay ', ''Michael Smith' ',
''java@gcc.gnu.org' '
> From: Jeff Sturm [mailto:jsturm@one-point.com]
>
> On Wed, 20 Mar 2002, Boehm, Hans wrote:
> > GC_greatest_plausible_heap_addr is a very rough upper bound
> on the heap.
>
> OK, I was wondering because in my case it's awfully far from the
> heap, there's something like a 40MB hole in between the highest heap
> address and GC_greatest_plausible_heap_addr.
I think that's normal.
>
> I can do that again. For now I've rebuilt libgcj with conservative
> marking (no descriptors), and the failure I reported has vanished.
You should be able to get the same effect with the GC_IGNORE_GCJ_INFO
environment variable. (See boehm-gc/doc/README.environment.) I would still
really like to track this down, though.
>
> When marking from a descriptor, what tests are done to ensure
> a pointer is
> within heap boundaries?
>
That test should be accurate. The approximate bounds are used only to
eliminate obvious nonpointers. It looks up the header (descriptor) for the
block (roughly page) the object resides in. If it's not a valid object
pointer, the object isn't pushed on the mark stack. The reasons for
segmentation faults in the marker are generally one of:
1) The collector is confused about where the roots are, e.g. because there's
an unexpected whole in the main data segment, it's confused about the
boundaries of the main data segment, or it got the wrong stack base address.
This is typically a porting problem.
2) The mark descriptor was clobbered. This can be a consequence of a
premature object collection, due to any sort of collector bug. Since it's
most likely to be overwritten by a pointer, this tends to make the object
look VERY large.
Hans
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 22:10 Boehm, Hans
@ 2002-03-21 10:00 ` Jeff Sturm
0 siblings, 0 replies; 36+ messages in thread
From: Jeff Sturm @ 2002-03-21 10:00 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Bryce McKinlay ', ''Michael Smith' ',
''java@gcc.gnu.org' '
On Wed, 20 Mar 2002, Boehm, Hans wrote:
> GC_greatest_plausible_heap_addr is a very rough upper bound on the heap.
OK, I was wondering because in my case it's awfully far from the
heap, there's something like a 40MB hole in between the highest heap
address and GC_greatest_plausible_heap_addr.
> If you see another crash/hang, it's useful to call GC_dump() from the
> debugger. That also dumps the root segments as well as some other stuff.
I can do that again. For now I've rebuilt libgcj with conservative
marking (no descriptors), and the failure I reported has vanished.
When marking from a descriptor, what tests are done to ensure a pointer is
within heap boundaries?
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 20:35 ` Jeff Sturm
@ 2002-03-20 22:58 ` Bryce McKinlay
0 siblings, 0 replies; 36+ messages in thread
From: Bryce McKinlay @ 2002-03-20 22:58 UTC (permalink / raw)
To: Jeff Sturm
Cc: Boehm, Hans, 'Michael Smith', 'java@gcc.gnu.org'
Jeff Sturm wrote:
>On Thu, 21 Mar 2002, Bryce McKinlay wrote:
>
>>Its a 650Mhz mobile Pentium 3, redhat kernel 2.4.9-13, redhat
>>glibc-2.2.4-19.3. I can try it on some other machines later to be sure.
>>
>
>I'm having trouble reproducing the GCTest failure. My configuration is
>similar to Bryce's, though with glibc-2.2.4-13 and kernel 2.4.7-10 on a
>PIII.
>
>It appeared to hang the very first time I ran it, and never again
>thereafter.
>
I confirmed that I can also reproduce the problem on a 2x 450Mhz celeron
machine, this time with a 2001-12-03 GCC. It is harder to reproduce on
that machine, only failing about 1 in 3 attempts rather than nearly 100%
on the mobile P3. But here's the thing: if you run another task on the
machine (eg a long running benchmark), it seems to greatly increase the
chance of the race occuring. It almost never happens on an otherwise
idle machine!
regards
Bryce.
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-20 22:10 Boehm, Hans
2002-03-21 10:00 ` Jeff Sturm
0 siblings, 1 reply; 36+ messages in thread
From: Boehm, Hans @ 2002-03-20 22:10 UTC (permalink / raw)
To: 'Jeff Sturm ', 'Bryce McKinlay '
Cc: Boehm, Hans, ''Michael Smith' ',
''java@gcc.gnu.org' '
GC_greatest_plausible_heap_addr is a very rough upper bound on the heap.
There will normally be unmapped addresses between it and the heap. It's
used primarily for black-lisitng. If we see something within the extreme
bounds but not on a proper heap page, we keep track that that page is likely
to contains a false reference.
If you see another crash/hang, it's useful to call GC_dump() from the
debugger. That also dumps the root segments as well as some other stuff.
Hans
-----Original Message-----
From: Jeff Sturm
...
On further inspection the problem I reported has probably has nothing to
do with THREAD_LOCAL_ALLOC. I'm still looking at it.
GC_greatest_plausible_heap_addr looks suspicious though... is there
supposed to be any assurance addresses below this are mapped?
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 17:54 ` Bryce McKinlay
@ 2002-03-20 20:35 ` Jeff Sturm
2002-03-20 22:58 ` Bryce McKinlay
0 siblings, 1 reply; 36+ messages in thread
From: Jeff Sturm @ 2002-03-20 20:35 UTC (permalink / raw)
To: Bryce McKinlay
Cc: Boehm, Hans, 'Michael Smith', 'java@gcc.gnu.org'
On Thu, 21 Mar 2002, Bryce McKinlay wrote:
> Its a 650Mhz mobile Pentium 3, redhat kernel 2.4.9-13, redhat
> glibc-2.2.4-19.3. I can try it on some other machines later to be sure.
I'm having trouble reproducing the GCTest failure. My configuration is
similar to Bryce's, though with glibc-2.2.4-13 and kernel 2.4.7-10 on a
PIII.
It appeared to hang the very first time I ran it, and never again
thereafter.
On further inspection the problem I reported has probably has nothing to
do with THREAD_LOCAL_ALLOC. I'm still looking at it.
GC_greatest_plausible_heap_addr looks suspicious though... is there
supposed to be any assurance addresses below this are mapped?
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 12:13 ` Jeff Sturm
2002-03-20 14:42 ` Tom Tromey
2002-03-20 14:51 ` Bryce McKinlay
@ 2002-03-20 19:32 ` Loren James Rittle
2 siblings, 0 replies; 36+ messages in thread
From: Loren James Rittle @ 2002-03-20 19:32 UTC (permalink / raw)
To: java; +Cc: rittle
>On Wed, 20 Mar 2002, Bryce McKinlay wrote:
>> It runs fine without THREAD_LOCAL_ALLOC.
> You're having better luck than I am, then. I did get random, infrequent
> crashes in my application. But without THREAD_LOCAL_ALLOC I don't get
> that far:
> 0x40c45363 in GC_mark_from (mark_stack_top=0x805b0e8,
> mark_stack=0x805b000,
> mark_stack_limit=0x8063000) at ../../../boehm-gc/mark.c:654
> 654 deferred = *limit;
Based on my experience porting threaded boehm-gc to i386-*-freebsd*
(thanks really to Hans for further generalizing thread support before
my final required tweaks went in), I have developed unposted patches
required to get threaded boehm-gc on alpha*-*-freebsd* working (at
least I thought they would). Neither port uses THREAD_LOCAL_ALLOC to
my knowledge. i386-*-freebsd* is working. The new alpha port is not.
Two days ago, I saw this exact crash point on alpha*-*-freebsd* using
a blend of configuration from i386-*-freebsd* (for per-OS stuff) and
other supported alpha targets (for per-CPU stuff). I hadn't reported
it yet since I thought it was something boneheaded I had done... ;-)
...and I hadn't found time to fully debug it yet. I still haven't
debugged it yet but I feel better about reporting it now.
> (gdb) p limit
> $1 = (word *) 0x86e90c4
If memory serves me, ``p limit'' printed an address as you see.
``p *limit'' failed since no memory was actually mapped at that point
(i.e. why it SEGV upon *limit).
> the address isn't mapped:
>
> (gdb) call GC_print_heap_sects()
> Total heap size: 5672960
> Section 0 from 0x806b000 to 0x807b000 2/16 blacklisted
[...]
>
> (gdb) p GC_greatest_plausible_heap_addr
> $2 = 0xaed9000
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 17:03 Boehm, Hans
@ 2002-03-20 17:54 ` Bryce McKinlay
2002-03-20 20:35 ` Jeff Sturm
2002-03-26 15:45 ` Tom Tromey
1 sibling, 1 reply; 36+ messages in thread
From: Bryce McKinlay @ 2002-03-20 17:54 UTC (permalink / raw)
To: Boehm, Hans
Cc: 'Michael Smith', 'java@gcc.gnu.org',
'Jeff Sturm'
Boehm, Hans wrote:
>For Bryce and Jeff:
>
>1) How was gcj configured?
>
$ gcc -v
Reading specs from
/home/bryce/gcc/bin/../lib/gcc-lib/i686-pc-linux-gnu/3.1/specs
Configured with: ../configure --prefix=/home/bryce/gcc31
--enable-threads --enable-languages=c++,java : (reconfigured)
Thread model: posix
gcc version 3.1 20020320 (prerelease)
>2) What compiler was it built with?
>
GCC was built with itself. The problem occurs when the GC is built with
or without -O, and both with mainline and 3.1 branch.
>3) Does it appear that this problem was recently introduced (and thus
>different from Michael Smith's)? (My 3.1 tree is a few days old, and I
>built with a stable compiler.)
>
The test case certainly used to work, but I don't know if I ever tried
it since thread local alloc was turned on.
>4) What was the machine configuration, e.g. which X86 processor(s) was/were
>used? Is it reproducible on older processors? (I tried a plain Pentium, a
>Pentium II, and a 4xPPro machine, all of which are rather old.)
>
Its a 650Mhz mobile Pentium 3, redhat kernel 2.4.9-13, redhat
glibc-2.2.4-19.3. I can try it on some other machines later to be sure.
regards
Bryce.
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-20 17:03 Boehm, Hans
2002-03-20 17:54 ` Bryce McKinlay
2002-03-26 15:45 ` Tom Tromey
0 siblings, 2 replies; 36+ messages in thread
From: Boehm, Hans @ 2002-03-20 17:03 UTC (permalink / raw)
To: Boehm, Hans, 'Michael Smith', 'Bryce McKinlay'
Cc: 'java@gcc.gnu.org', 'Jeff Sturm'
I can't reproduce the problem on X86 either. Questions:
0) Do other people see similar problems?
For Bryce and Jeff:
1) How was gcj configured?
2) What compiler was it built with?
3) Does it appear that this problem was recently introduced (and thus
different from Michael Smith's)? (My 3.1 tree is a few days old, and I
built with a stable compiler.)
4) What was the machine configuration, e.g. which X86 processor(s) was/were
used? Is it reproducible on older processors? (I tried a plain Pentium, a
Pentium II, and a 4xPPro machine, all of which are rather old.)
5) Which Linux distribution? Was this a standard RedHat kernel? Any danger
the main stack starts at 0xffffffff? Anything weird about the kernel?
Hans
> -----Original Message-----
> From: Boehm, Hans
> Sent: Wednesday, March 20, 2002 2:03 PM
> To: 'Michael Smith'; Bryce McKinlay
> Cc: java@gcc.gnu.org; Boehm, Hans
> Subject: RE: GC failure w/ THREAD_LOCAL_ALLOC ?
>
>
> I just tried Bryce's test on an Itanium here, since I had a
> prebuilt gcj 3.1. It uses the stock CVS garbage collector.
> I couldn't get it to fail. I will try on X86, though that
> will take a bit longer.
>
> Hans
>
> > -----Original Message-----
> > From: Michael Smith [mailto:msmith@spinnakernet.com]
> > Sent: Wednesday, March 20, 2002 10:15 AM
> > To: Bryce McKinlay
> > Cc: java@gcc.gnu.org; Boehm, Hans
> > Subject: Re: GC failure w/ THREAD_LOCAL_ALLOC ?
> >
> >
> > Bryce McKinlay wrote:
> > > While testing thread local allocation on PowerPC, I ran
> > into a problem
> > > which is also reproducable on x86. The attached stress-test-case
> > > GCTest.java will lock up with ~100% reproducability with
> > > THREAD_LOCAL_ALLOC enabled. It runs fine without
> THREAD_LOCAL_ALLOC.
> > >
> > > What I am seeing in the debugger is most threads waiting in
> > > GC_suspend_handler, but one thread segfaulting in GC_mark_read.
> > > libjava's segv handler gets called and the collector is
> re-entered
> > > during the stack trace, causing the freeze.
> >
> > I actually ran into this problem in my application 2 months
> > ago (using
> > gcc version 3.1 20010911 (experimental)), and reported it
> to Hans. I
> > couldn't water down my application to create such a simple
> > test case, so
> > tracking it down was somewhat difficult.
> >
> > From the stack trace I provided back in January, Hans intially
> > responded with:
> >
> > Hans Boehm wrote:
> > > I'm not terribly worried about the SIGSEGV getting turned into a
> > > deadlock. Such things seem to be largely unavoidable.
> > >
> > > I would like to understand where the SIGSEGV is coming
> > from. Typically
> > > a failure here is caused by a bogus object descriptor. This may
> > > happen because something was overwritten by client code,
> or because
> > > there's an undiscovered bug in the GC, or in the gcj generated
> > > descriptor.
> >
> > With some further pointers, it turns out there _was_ a bogus object
> > descriptor. At my last contact with Hans, he suspected the
> > problem was
> > related to THREAD_LOCAL_ALLOC, but was unable to find any likely
> > problems when reviewing the code. Here's an excerpt:
> >
> > Hans Boehm wrote:
> > > I spent a bit of time:
> > >
> > > - Staring at the thread-specific-storage implementation, and
> > >
> > > - adding some tests for thread-local allocation to gctest.
> > >
> > > The new tests failed to make the problem reproducible here.
> > >
> > > I cleaned up a few things. The only thing substantive I
> found was
> > > that specific.c could fail if one of the thread stacks
> > ended up at the
> > > extreme high end of the addres space, i.e. if 0xfffff000 is the
> > > address of a valid stack page. Are you configuring your
> kernel in
> > > some nonstandard way, e.g. to maximize virtual address space?
> > > Otherwise this seems unlikely to account for the problem,
> > since that's
> > > normally kernel address space on Linux/X86, as I recall.
> > (I vaguely
> > > recall that Mandrake Linux might do something strange in
> > this area.)
> >
> > Hans sent me new versions of specific.c and specific.h to fix
> > the above
> > mentioned problem (thread stacks at the high end of the
> > address space),
> > but I never had the chance to try them out. I had a
> workaround that
> > made the problem go away for me, and other work priorities are
> > preventing me from continuing to dig into the issue.
> >
> > My workarounds were to increase the initial heap size of my
> > application
> > (reducing the required garbage collections), and turning on
> > GC_IGNORE_GCJ_INFO (which I had to add to gcj's version of
> > the collector
> > since it was added after the version I am using). Neither of which
> > really "fixes" the problem though. They just make it much
> > more unlikely
> > that I'll hit the problem (I haven't since then).
> >
> > regards,
> > michael
> >
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 14:51 ` Bryce McKinlay
@ 2002-03-20 15:07 ` Jeff Sturm
0 siblings, 0 replies; 36+ messages in thread
From: Jeff Sturm @ 2002-03-20 15:07 UTC (permalink / raw)
To: Bryce McKinlay; +Cc: java
On Thu, 21 Mar 2002, Bryce McKinlay wrote:
> Did you reconfigure libjava and touch things like boehm.cc, boehm.h, etc
> to make sure that all the traces of THREAD_LOCAL_ALLOC get removed?
I removed the entire build tree, twice. Is there anything to change
besides boehm-gc/configure.in?
I also made sure libgcj.so actually got installed, as I know others have
had problems with that lately.
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 14:42 ` Tom Tromey
@ 2002-03-20 15:03 ` Jeff Sturm
0 siblings, 0 replies; 36+ messages in thread
From: Jeff Sturm @ 2002-03-20 15:03 UTC (permalink / raw)
To: Tom Tromey; +Cc: Bryce McKinlay, java, Boehm, Hans
On 20 Mar 2002, Tom Tromey wrote:
> Yikes! I've been thinking that gcj 3.1 was, even right now, the best
> one yet. So I'm really surprised to hear this.
Certainly 3.1 will be the most complete. As for bugs, well, hard to
say... each application stresses the compiler/runtime in different ways,
so what might seem perfect to one will be buggy to another.
> Is the GC problem the only one?
I don't understand this one yet. I might've done something stupid, but
other than turning off THREAD_LOCAL_ALLOC this is a pretty normal build.
> Do you know of other regressions?
Problems with java.net.Socket, for which I'm going to check in a partial
fix soon. I don't know if we have any good testsuites for this... haven't
checked Mauve lately.
> Are there PRs you consider critical?
Due to java/5986 I can't compile from bytecode at all. I first thought
this was a fairly isolated case but I've found others in which simply
compiling from source rather than bytecode corrected the problem.
I can't say for sure this is a regression though... the test case in that
PR failed on 3.0.4 too.
> I do have some time right now for dealing with 3.1 release issues. I
> don't know how long this time will last, so I'm trying to strike while
> the iron is hot. If you could submit PRs for the serious
> (regression-level) problems you know of, and send me the numbers, that
> would help a lot.
Thanks. I have very limited time right now, but I'd like to be able to
use 3.1 for production, so I have an interest in stablizing it too.
Unfortunately it's hard to say what must be fixed until I understand the
problem... frequently it takes more effort to file a good PR than to
actually fix it. The remaining issues (GC, EH) are likely to be difficult
to isolate. I'll do what I can.
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 12:13 ` Jeff Sturm
2002-03-20 14:42 ` Tom Tromey
@ 2002-03-20 14:51 ` Bryce McKinlay
2002-03-20 15:07 ` Jeff Sturm
2002-03-20 19:32 ` Loren James Rittle
2 siblings, 1 reply; 36+ messages in thread
From: Bryce McKinlay @ 2002-03-20 14:51 UTC (permalink / raw)
To: Jeff Sturm; +Cc: java
Jeff Sturm wrote:
>On Wed, 20 Mar 2002, Bryce McKinlay wrote:
>
>>It runs fine without THREAD_LOCAL_ALLOC.
>>
>
>You're having better luck than I am, then. I did get random, infrequent
>crashes in my application. But without THREAD_LOCAL_ALLOC I don't get
>that far:
>
>0x40c45363 in GC_mark_from (mark_stack_top=0x805b0e8,
>mark_stack=0x805b000,
> mark_stack_limit=0x8063000) at ../../../boehm-gc/mark.c:654
>654 deferred = *limit;
>
Did you reconfigure libjava and touch things like boehm.cc, boehm.h, etc
to make sure that all the traces of THREAD_LOCAL_ALLOC get removed?
regards
Bryce.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-20 12:13 ` Jeff Sturm
@ 2002-03-20 14:42 ` Tom Tromey
2002-03-20 15:03 ` Jeff Sturm
2002-03-20 14:51 ` Bryce McKinlay
2002-03-20 19:32 ` Loren James Rittle
2 siblings, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2002-03-20 14:42 UTC (permalink / raw)
To: Jeff Sturm; +Cc: Bryce McKinlay, java, Boehm, Hans
>>>>> "Jeff" == Jeff Sturm <jsturm@one-point.com> writes:
Jeff> I give up for now, other than to note that 3.1 isn't ready, at
Jeff> least for my uses.
Yikes! I've been thinking that gcj 3.1 was, even right now, the best
one yet. So I'm really surprised to hear this.
Is the GC problem the only one? Do you know of other regressions?
Are there PRs you consider critical?
I do have some time right now for dealing with 3.1 release issues. I
don't know how long this time will last, so I'm trying to strike while
the iron is hot. If you could submit PRs for the serious
(regression-level) problems you know of, and send me the numbers, that
would help a lot.
Tom
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-20 14:03 Boehm, Hans
0 siblings, 0 replies; 36+ messages in thread
From: Boehm, Hans @ 2002-03-20 14:03 UTC (permalink / raw)
To: 'Michael Smith', Bryce McKinlay; +Cc: java, Boehm, Hans
I just tried Bryce's test on an Itanium here, since I had a prebuilt gcj
3.1. It uses the stock CVS garbage collector. I couldn't get it to fail.
I will try on X86, though that will take a bit longer.
Hans
> -----Original Message-----
> From: Michael Smith [mailto:msmith@spinnakernet.com]
> Sent: Wednesday, March 20, 2002 10:15 AM
> To: Bryce McKinlay
> Cc: java@gcc.gnu.org; Boehm, Hans
> Subject: Re: GC failure w/ THREAD_LOCAL_ALLOC ?
>
>
> Bryce McKinlay wrote:
> > While testing thread local allocation on PowerPC, I ran
> into a problem
> > which is also reproducable on x86. The attached stress-test-case
> > GCTest.java will lock up with ~100% reproducability with
> > THREAD_LOCAL_ALLOC enabled. It runs fine without THREAD_LOCAL_ALLOC.
> >
> > What I am seeing in the debugger is most threads waiting in
> > GC_suspend_handler, but one thread segfaulting in GC_mark_read.
> > libjava's segv handler gets called and the collector is re-entered
> > during the stack trace, causing the freeze.
>
> I actually ran into this problem in my application 2 months
> ago (using
> gcc version 3.1 20010911 (experimental)), and reported it to Hans. I
> couldn't water down my application to create such a simple
> test case, so
> tracking it down was somewhat difficult.
>
> From the stack trace I provided back in January, Hans intially
> responded with:
>
> Hans Boehm wrote:
> > I'm not terribly worried about the SIGSEGV getting turned into a
> > deadlock. Such things seem to be largely unavoidable.
> >
> > I would like to understand where the SIGSEGV is coming
> from. Typically
> > a failure here is caused by a bogus object descriptor. This may
> > happen because something was overwritten by client code, or because
> > there's an undiscovered bug in the GC, or in the gcj generated
> > descriptor.
>
> With some further pointers, it turns out there _was_ a bogus object
> descriptor. At my last contact with Hans, he suspected the
> problem was
> related to THREAD_LOCAL_ALLOC, but was unable to find any likely
> problems when reviewing the code. Here's an excerpt:
>
> Hans Boehm wrote:
> > I spent a bit of time:
> >
> > - Staring at the thread-specific-storage implementation, and
> >
> > - adding some tests for thread-local allocation to gctest.
> >
> > The new tests failed to make the problem reproducible here.
> >
> > I cleaned up a few things. The only thing substantive I found was
> > that specific.c could fail if one of the thread stacks
> ended up at the
> > extreme high end of the addres space, i.e. if 0xfffff000 is the
> > address of a valid stack page. Are you configuring your kernel in
> > some nonstandard way, e.g. to maximize virtual address space?
> > Otherwise this seems unlikely to account for the problem,
> since that's
> > normally kernel address space on Linux/X86, as I recall.
> (I vaguely
> > recall that Mandrake Linux might do something strange in
> this area.)
>
> Hans sent me new versions of specific.c and specific.h to fix
> the above
> mentioned problem (thread stacks at the high end of the
> address space),
> but I never had the chance to try them out. I had a workaround that
> made the problem go away for me, and other work priorities are
> preventing me from continuing to dig into the issue.
>
> My workarounds were to increase the initial heap size of my
> application
> (reducing the required garbage collections), and turning on
> GC_IGNORE_GCJ_INFO (which I had to add to gcj's version of
> the collector
> since it was added after the version I am using). Neither of which
> really "fixes" the problem though. They just make it much
> more unlikely
> that I'll hit the problem (I haven't since then).
>
> regards,
> michael
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-19 19:30 Bryce McKinlay
2002-03-20 10:19 ` Michael Smith
@ 2002-03-20 12:13 ` Jeff Sturm
2002-03-20 14:42 ` Tom Tromey
` (2 more replies)
1 sibling, 3 replies; 36+ messages in thread
From: Jeff Sturm @ 2002-03-20 12:13 UTC (permalink / raw)
To: Bryce McKinlay; +Cc: java, Boehm, Hans
On Wed, 20 Mar 2002, Bryce McKinlay wrote:
> It runs fine without THREAD_LOCAL_ALLOC.
You're having better luck than I am, then. I did get random, infrequent
crashes in my application. But without THREAD_LOCAL_ALLOC I don't get
that far:
0x40c45363 in GC_mark_from (mark_stack_top=0x805b0e8,
mark_stack=0x805b000,
mark_stack_limit=0x8063000) at ../../../boehm-gc/mark.c:654
654 deferred = *limit;
(gdb) p limit
$1 = (word *) 0x86e90c4
the address isn't mapped:
(gdb) call GC_print_heap_sects()
Total heap size: 5672960
Section 0 from 0x806b000 to 0x807b000 2/16 blacklisted
Section 1 from 0x808b000 to 0x809b000 0/16 blacklisted
Section 2 from 0x809d000 to 0x80ad000 0/16 blacklisted
Section 3 from 0x80ad000 to 0x80be000 0/17 blacklisted
Section 4 from 0x80d9000 to 0x80f3000 0/26 blacklisted
Section 5 from 0x80f3000 to 0x8112000 0/31 blacklisted
Section 6 from 0x8131000 to 0x815a000 0/41 blacklisted
Section 7 from 0x815a000 to 0x8191000 0/55 blacklisted
Section 8 from 0x8191000 to 0x81fe000 0/109 blacklisted
Section 9 from 0x8268000 to 0x82d6000 0/110 blacklisted
Section 10 from 0x82f4000 to 0x8386000 1/146 blacklisted
Section 11 from 0x8386000 to 0x8449000 1/195 blacklisted
Section 12 from 0x8459000 to 0x855d000 0/260 blacklisted
Section 13 from 0x857e000 to 0x86d9000 15/347 blacklisted
(gdb) p GC_greatest_plausible_heap_addr
$2 = 0xaed9000
I give up for now, other than to note that 3.1 isn't ready, at least for
my uses. I can still use 3.0.
Jeff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: GC failure w/ THREAD_LOCAL_ALLOC ?
2002-03-19 19:30 Bryce McKinlay
@ 2002-03-20 10:19 ` Michael Smith
2002-03-20 12:13 ` Jeff Sturm
1 sibling, 0 replies; 36+ messages in thread
From: Michael Smith @ 2002-03-20 10:19 UTC (permalink / raw)
To: Bryce McKinlay; +Cc: java, Boehm, Hans
Bryce McKinlay wrote:
> While testing thread local allocation on PowerPC, I ran into a problem
> which is also reproducable on x86. The attached stress-test-case
> GCTest.java will lock up with ~100% reproducability with
> THREAD_LOCAL_ALLOC enabled. It runs fine without THREAD_LOCAL_ALLOC.
>
> What I am seeing in the debugger is most threads waiting in
> GC_suspend_handler, but one thread segfaulting in GC_mark_read.
> libjava's segv handler gets called and the collector is re-entered
> during the stack trace, causing the freeze.
I actually ran into this problem in my application 2 months ago (using
gcc version 3.1 20010911 (experimental)), and reported it to Hans. I
couldn't water down my application to create such a simple test case, so
tracking it down was somewhat difficult.
From the stack trace I provided back in January, Hans intially
responded with:
Hans Boehm wrote:
> I'm not terribly worried about the SIGSEGV getting turned into a
> deadlock. Such things seem to be largely unavoidable.
>
> I would like to understand where the SIGSEGV is coming from. Typically
> a failure here is caused by a bogus object descriptor. This may
> happen because something was overwritten by client code, or because
> there's an undiscovered bug in the GC, or in the gcj generated
> descriptor.
With some further pointers, it turns out there _was_ a bogus object
descriptor. At my last contact with Hans, he suspected the problem was
related to THREAD_LOCAL_ALLOC, but was unable to find any likely
problems when reviewing the code. Here's an excerpt:
Hans Boehm wrote:
> I spent a bit of time:
>
> - Staring at the thread-specific-storage implementation, and
>
> - adding some tests for thread-local allocation to gctest.
>
> The new tests failed to make the problem reproducible here.
>
> I cleaned up a few things. The only thing substantive I found was
> that specific.c could fail if one of the thread stacks ended up at the
> extreme high end of the addres space, i.e. if 0xfffff000 is the
> address of a valid stack page. Are you configuring your kernel in
> some nonstandard way, e.g. to maximize virtual address space?
> Otherwise this seems unlikely to account for the problem, since that's
> normally kernel address space on Linux/X86, as I recall. (I vaguely
> recall that Mandrake Linux might do something strange in this area.)
Hans sent me new versions of specific.c and specific.h to fix the above
mentioned problem (thread stacks at the high end of the address space),
but I never had the chance to try them out. I had a workaround that
made the problem go away for me, and other work priorities are
preventing me from continuing to dig into the issue.
My workarounds were to increase the initial heap size of my application
(reducing the required garbage collections), and turning on
GC_IGNORE_GCJ_INFO (which I had to add to gcj's version of the collector
since it was added after the version I am using). Neither of which
really "fixes" the problem though. They just make it much more unlikely
that I'll hit the problem (I haven't since then).
regards,
michael
^ permalink raw reply [flat|nested] 36+ messages in thread
* GC failure w/ THREAD_LOCAL_ALLOC ?
@ 2002-03-19 19:30 Bryce McKinlay
2002-03-20 10:19 ` Michael Smith
2002-03-20 12:13 ` Jeff Sturm
0 siblings, 2 replies; 36+ messages in thread
From: Bryce McKinlay @ 2002-03-19 19:30 UTC (permalink / raw)
To: java; +Cc: Boehm, Hans
[-- Attachment #1: Type: text/plain, Size: 3437 bytes --]
While testing thread local allocation on PowerPC, I ran into a problem
which is also reproducable on x86. The attached stress-test-case
GCTest.java will lock up with ~100% reproducability with
THREAD_LOCAL_ALLOC enabled. It runs fine without THREAD_LOCAL_ALLOC.
What I am seeing in the debugger is most threads waiting in
GC_suspend_handler, but one thread segfaulting in GC_mark_read.
libjava's segv handler gets called and the collector is re-entered
during the stack trace, causing the freeze.
Here's an example:
(gdb) thread 13
[Switching to thread 13 (Thread 11276 (LWP 25862))]#0 0x40632b85 in
__sigsuspend (
set=0x41d810d4) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
45 in ../sysdeps/unix/sysv/linux/sigsuspend.c
(gdb) bt
#0 0x40632b85 in __sigsuspend (set=0x41d810d4)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
#1 0x405f41c9 in __pthread_wait_for_restart_signal (self=0x41d81be0) at
pthread.c:969
#2 0x405f5f09 in __pthread_alt_lock (lock=0x40533f88, self=0x0) at
restart.h:34
#3 0x405f2d16 in __pthread_mutex_lock (mutex=0x40533f78) at mutex.c:120
#4 0x403ed4c7 in GC_lock () at ../../../boehm-gc/linux_threads.c:1603
#5 0x403edcfd in GC_malloc_atomic (lb=88) at ../../../boehm-gc/malloc.c:256
#6 0x403ebf47 in GC_local_malloc_atomic (bytes=88)
at ../../../boehm-gc/linux_threads.c:367
#7 0x40223913 in _Jv_NewPrimArray (eltype=0x8091f50, count=80) at
include/java-gc.h:53
#8 0x402583c6 in java::lang::Throwable::fillInStackTrace() (this=0x8092ff0)
at ../../../libjava/gcj/array.h:91
#9 0x40222e32 in _Jv_ThrowSignal (throwable=0x50) at
../../../libjava/prims.cc:111
#10 0x40222e66 in catch_segv(int) () at ../../../libjava/prims.cc:121
#11 <signal handler called>
#12 0x403efb8e in GC_mark_from (mark_stack_top=0x40599d74,
mark_stack=0x41d81944,
mark_stack_limit=0x403e62ac) at ../../../boehm-gc/mark.c:654
#13 0x403ef423 in GC_mark_some (
cold_gc_frame=0x41d81908
"¨\020\005\b£ö>@t\235Y@D\031ÃA¬b>@x\\>@ì\030¸Aì\030¸At\235Y@ì") at
../../../boehm-gc/mark.c:357
#14 0x403e660f in GC_stopped_mark (stop_func=0x403e5c78
<GC_never_stop_func>)
at ../../../boehm-gc/alloc.c:489
#15 0x403e62ac in GC_try_to_collect_inner (stop_func=0x403e5c78
<GC_never_stop_func>)
at ../../../boehm-gc/alloc.c:350
#16 0x403e6be3 in GC_try_to_collect (stop_func=0x403e5c78
<GC_never_stop_func>)
at ../../../boehm-gc/alloc.c:735
#17 0x403e6c39 in GC_gcollect () at ../../../boehm-gc/alloc.c:746
#18 0x403e38b9 in _Jv_RunGC() () at ../../../libjava/boehm.cc:401
#19 0x40253d59 in java::lang::Runtime::gc() (this=0x8076fa0)
at ../../../libjava/java/lang/natRuntime.cc:118
#20 0x40272e9f in java.lang.System.gc() () at
../../../libjava/java/lang/System.java:129
#21 0x08049651 in GCTest.gc() (this=0x40599d74) at GCTest.java:141
#22 0x080498ec in GCTest.testObjArray() (this=0x8076f30) at GCTest.java:187
#23 0x0804939a in GCTest.run() (this=0x8076f30) at GCTest.java:104
#24 0x40273673 in java.lang.Thread.run() (this=0x40599d74)
at ../../../libjava/java/lang/Thread.java:132
#25 0x40257edc in _Jv_ThreadRun(java::lang::Thread*) (thread=0x8076f30)
at ../../../libjava/java/lang/natThread.cc:285
#26 0x403e43a0 in really_start (x=0x8076f30) at
../../../libjava/posix-threads.cc:375
#27 0x403ed1ea in GC_start_routine (arg=0x807bfe0)
at ../../../boehm-gc/linux_threads.c:1369
#28 0x405f1c6f in pthread_start_thread (arg=0x41d81be0) at manager.c:284
[-- Attachment #2: GCTest.java --]
[-- Type: text/plain, Size: 6091 bytes --]
/*
* GCTest.java
*
* Test that spawns a lot of threads, have each thread allocate various
* objects. Then the gc is invoked, and afterwards the objects are checked
* to make sure they didn't get freed or munged. This test helped discover
* two race conditions in the garbage collector.
*
* Courtesy Pat Tullmann (tullmann@cs.utah.edu)
*/
public class GCTest
implements Runnable
{
class GCTest_Object
{
public GCTest_Object(int id_, GCTest_Object next_, String name_)
{
id = id_;
next = next_;
name = name_;
}
public int id;
public GCTest_Object next;
public String name;
public void finalize()
{
id = -1;
next = null;
name = null;
}
}
private static int ct_ = 113;
private static boolean exitOnFailure_ = false;
private int id_;
public GCTest(int id)
{
id_ = id;
Thread th = new Thread(this);
th.start();
}
public static void main(String[] args)
{
int i;
int thCt;
// when run as part of testsuite, set some defaults
thCt = 45;
ct_ = 60;
if (args.length < 2)
{
// in interactive use make this true
if (false) {
System.out.println("Usage: GCTest "
+ "<thread count> <block size (count)>");
System.exit(1);
}
} else {
thCt = Integer.parseInt(args[0]);
ct_ = Integer.parseInt(args[1]);
}
for (i = 0; i < thCt; i++)
new GCTest(i);
}
public void run()
{
// Test various stack references
try
{
testObj();
// out("obj Success");
}
catch (Throwable t)
{
t.printStackTrace();
failure("testObj: Caught exception: " + t.getMessage());
}
try
{
testPrimArray();
// out("primarray Success");
}
catch (Throwable t)
{
t.printStackTrace();
failure("testPrimArray: Caught exception: " + t.getMessage());
}
try
{
testObjArray();
// out("objarray Success");
}
catch (Throwable t)
{
t.printStackTrace();
failure("testObjArray: Caught exception: " + t.getMessage());
}
try
{
testArrayOfArray();
// out("arrayofarray Success");
}
catch (Throwable t)
{
t.printStackTrace();
failure("testArrayOfArray: Caught exception: " + t.getMessage());
}
try
{
testObjChain();
// out("objChain Success");
}
catch (Throwable t)
{
t.printStackTrace();
failure("testObjChain: Caught exception: " + t.getMessage());
}
out ("Success");
}
public void gc()
{
if (true) {
// out("invoking gc");
System.gc();
// Sleep to see if the finalizer gets invoked..
try
{
Thread.sleep(1);
}
catch (Exception e)
{
out("sleep failure");
System.exit(13);
}
}
}
public void testObj()
{
Integer i = new Integer(42);
gc();
if (i.intValue() != 42)
failure("testObj");
}
public void testPrimArray()
{
int[] intArray = new int[ct_];
int i;
for (i = 0; i < ct_; i++)
intArray[i] = i;
gc();
for (i = 0; i < ct_; i++)
if (intArray[i] != i)
failure("testPrimArray: wanted " +i+ "; got " +intArray[i]+ ".");
}
public void testObjArray()
{
String[] strs = new String[ct_];
int i;
for (i = 0; i < ct_; i++)
strs[i] = new String(Integer.toString(i));
gc();
for (i = 0; i < ct_; i++)
{
String cmp = new String(Integer.toString(i));
if (!strs[i].equals(cmp))
failure("testObjArray: wanted " +cmp+ "; got " +strs[i]+ ".");
}
}
public void testArrayOfArray()
{
int[][] intArray = new int[ct_][ct_];
int[][] intArray2 = new int[ct_][];
int i, j;
// test 1
for (i = 0; i < ct_; i++)
for (j = 0; j < ct_; j++)
intArray[i][j] = id_;
gc();
for (j = 0; j < ct_; j++)
for (i = 0; i < ct_; i++)
if (intArray[i][j] != id_)
failure("testArrayOfArray(1): " +intArray+ "[" +i+ "][" +j+ "]. Got " +intArray[i][j]+ "; expected " +id_+ ".");
gc();
// test 2
intArray = new int[ct_][];
for (i = 0; i < ct_; i++)
{
intArray[i] = intArray2[i] = new int[ct_];
// out(intArray[i].toString());
for (j = 0; j < ct_; j++)
intArray[i][j] = id_;
}
gc();
for (j = 0; j < ct_; j++) {
if (intArray[j] != intArray2[j])
failure("testArrayOfArray(3): " +intArray+ "[" +j+ "]. Got " +intArray[j]+ "; expected " +intArray2[j]+ ".");
for (i = 0; i < ct_; i++)
if (intArray[i][j] != id_)
failure("testArrayOfArray(2): " +intArray+ "[" +i+ "][" +j+ "]. Got " +intArray[i][j]+ "; expected " +id_+ ".");
}
}
public void testObjChain()
{
GCTest_Object head, next;
int i;
head = new GCTest_Object(0, null, "0");
next = head;
for (i = 1; i < 100; i++)
{
next.next = new GCTest_Object(i, null, Integer.toString(i));
next = next.next;
}
gc();
next = head;
for (i = 0; i < 99; i++)
{
if ((next.id != i)
|| (next.next == null)
|| (!next.name.equals(Integer.toString(i))))
failure("testObjChain at " +i+ "(0x" +next.hashCode()+ "):"
+ " id is " +next.id+ " (should be " +i+ ");"
+ " name is " +next.name+ " (should be '" +i+ "').");
next = next.next;
}
if ((next.id != 99)
|| (next.next != null)
|| (!next.name.equals("99")))
failure("testObjChain at 99");
}
public void failure(String msg)
{
out("Failure: " +msg);
if (exitOnFailure_)
System.exit(11);
}
public void out(String msg)
{
synchronized (GCTest.class) {
System.out.println("[" +id_+ "]: " +msg);
}
}
}
// Sort output
/* Expected Output:
[0]: Success
[10]: Success
[11]: Success
[12]: Success
[13]: Success
[14]: Success
[15]: Success
[16]: Success
[17]: Success
[18]: Success
[19]: Success
[1]: Success
[20]: Success
[21]: Success
[22]: Success
[23]: Success
[24]: Success
[25]: Success
[26]: Success
[27]: Success
[28]: Success
[29]: Success
[2]: Success
[30]: Success
[31]: Success
[32]: Success
[33]: Success
[34]: Success
[35]: Success
[36]: Success
[37]: Success
[38]: Success
[39]: Success
[3]: Success
[40]: Success
[41]: Success
[42]: Success
[43]: Success
[44]: Success
[4]: Success
[5]: Success
[6]: Success
[7]: Success
[8]: Success
[9]: Success
*/
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2002-04-01 16:53 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-28 18:53 GC failure w/ THREAD_LOCAL_ALLOC ? Boehm, Hans
2002-03-28 23:08 ` Tom Tromey
2002-03-29 14:05 ` Bryce McKinlay
2002-03-30 5:20 ` Jeff Sturm
2002-04-01 8:14 ` Michael Smith
-- strict thread matches above, loose matches on Subject: below --
2002-04-01 8:53 Boehm, Hans
2002-03-29 17:09 Boehm, Hans
2002-03-28 12:08 Boehm, Hans
2002-03-22 22:13 Boehm, Hans
2002-03-27 12:48 ` Michael Smith
2002-03-22 11:41 Boehm, Hans
2002-03-22 12:59 ` Michael Smith
2002-03-22 18:25 ` Jeff Sturm
2002-03-22 21:22 ` Jeff Sturm
2002-03-21 10:24 Boehm, Hans
2002-03-21 10:59 ` Jeff Sturm
2002-03-22 11:00 ` Michael Smith
2002-03-20 22:10 Boehm, Hans
2002-03-21 10:00 ` Jeff Sturm
2002-03-20 17:03 Boehm, Hans
2002-03-20 17:54 ` Bryce McKinlay
2002-03-20 20:35 ` Jeff Sturm
2002-03-20 22:58 ` Bryce McKinlay
2002-03-26 15:45 ` Tom Tromey
2002-03-26 15:57 ` Bryce McKinlay
2002-03-26 16:11 ` Tom Tromey
2002-03-26 19:49 ` Bryce McKinlay
2002-03-20 14:03 Boehm, Hans
2002-03-19 19:30 Bryce McKinlay
2002-03-20 10:19 ` Michael Smith
2002-03-20 12:13 ` Jeff Sturm
2002-03-20 14:42 ` Tom Tromey
2002-03-20 15:03 ` Jeff Sturm
2002-03-20 14:51 ` Bryce McKinlay
2002-03-20 15:07 ` Jeff Sturm
2002-03-20 19:32 ` Loren James Rittle
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).