From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <archer-return-2240-listarch-archer=sourceware.org@sourceware.org>
Received: (qmail 11509 invoked by alias); 5 Oct 2010 17:31:23 -0000
Mailing-List: contact archer-help@sourceware.org; run by ezmlm
Sender: <archer@sourceware.org>
Precedence: bulk
List-Post: <mailto:archer@sourceware.org>
List-Help: <mailto:archer-help@sourceware.org>
List-Subscribe: <mailto:archer-subscribe@sourceware.org>
List-Id: <archer.sourceware.org>
Received: (qmail 11489 invoked by uid 22791); 5 Oct 2010 17:31:21 -0000
X-SWARE-Spam-Status: No, hits=-6.4 required=5.0
	tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI,SPF_HELO_PASS,T_LOTS_OF_MONEY,T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Date: Tue, 05 Oct 2010 17:31:00 -0000
From: Oleg Nesterov <oleg@redhat.com>
To: archer@sourceware.org, utrace-devel@redhat.com
Subject: BUG: gdb && notification packets (Was: gdbstub initial code, v12)
Message-ID: <20101005172729.GA27882@redhat.com>
References: <20100930182320.GA17475@redhat.com> <20101004181053.GA30217@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20101004181053.GA30217@redhat.com>
User-Agent: Mutt/1.5.18 (2008-05-17)
X-SW-Source: 2010-q4/txt/msg00002.txt.bz2

On 10/04, Oleg Nesterov wrote:
>
> While trying to understand what does this mean, I hit another bug
> in ugdb.

No, /usr/bin/gdb is buggy.

> A multithreaded tracee can "hang" if gdb simulates
> watchpoints with single-steps + mem-fetch/compare.

It doesn't, but gdb "forgets" about the pending "Stop:" notification,
see below.

> This never
> happens if the tracee is single-threaded.

This is clear. And this doesn't happen with the real gdbbserver.
Probably timing issues, with ugdb the tracee can report itself
as stopped much faster (not because ugdb is faster, but because
it sends the notification before the tracee actually stops).


Consider this simplified artifitial test-case:

	#include <stdio.h>
	#include <unistd.h>
	#include <pthread.h>

	struct {
		char x[128];
	} VAR;

	void *tfunc(void *arg)
	{
		for (;;)
			;
	}

	int main(void)
	{
		pthread_t thr;

		printf("pid: %d\n", getpid());

		pthread_create(&thr, NULL, tfunc, NULL);

		tfunc(NULL);

		return 0;
	}

It just creates 2 threads, each threads spins in the trivial endless loop.
VAR is the dummy variable for "watch", it is never changed. It should be big
enough to ensure can_use_hardware_watchpoint->default_region_ok_for_hw_watchpoint()
returns 0 and gdb falls back to simulating.

	(gdb) set target-async on
	(gdb) set non-stop
	(gdb) target extended-remote :2000
	(gdb) attach 1404
	(gdb) watch VAR
	(gdb) c -a

After that, sooner or later the tracee "hangs". gdb sleeps in poll(), both
traced threads sleep after reporting SIGTRAP (single-step).

The more or less "typical" transcript is:

	[... snip ...]
	=> s
	<= OK
	<= %Stop:T05thread:p57c.57d;
	=> vStopped
	<= T05thread:p57c.57c;
	=> vStopped
	<= OK
	=> g
	<= 50994242000000000000000000000000761f0a16677f000000000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> g
	<= 50994242000000000000000000000000761f0a16677f000000000000000000...
	=> s
	<= OK
	<= %Stop:T05thread:p57c.57d;
	=> Hgp57c.57c
	<= OK
	=> g
	<= 000000000000000030064000000000008162e115677f000000000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> Hcp57c.57c
	<= OK
	=> s#73$vStopped
	<= OK#9a$OK
	<= %Stop:T05thread:p57c.57c;
	=> Hgp57c.57d#ec$g
	<= OK#9a$50994242000000000000000000000000761f0a16677f000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> g
	<= 50994242000000000000000000000000761f0a16677f000000000000000000...
	=> Hcp57c.57d
	<= OK
	=> s#73$vStopped
	<= OK#9a$OK
	<= %Stop:T05thread:p57c.57d;
	=> Hgp57c.57c
	<= OK
	=> g
	<= 000000000000000030064000000000008162e115677f000000000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> Hcp57c.57c
	<= OK
	=> s#73$vStopped
	<= OK#9a$OK

Everything is fine so far. And this is how it works with the real gdbserver:
gdb always issues vStopped until it gets the final "OK". Like it should,
I guess.

So. At this point gdb knows that 57d is stopped, and it sends '$s' to 57c.

	<= %Stop:T05thread:p57c.57c;

57c reports the next stop, but gdb temporary "ignores" this notification
and plays with 57d. This is fine.

	=> Hgp57c.57d#ec$g
	<= OK#9a$50994242000000000000000000000000761f0a16677f000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> g
	<= 50994242000000000000000000000000761f0a16677f000000000000000000...
	=> Hcp57c.57d
	<= OK
	=> s#73$vStopped

Finally it does vStopped. But, by this time 57d is stopped again (note
that gdb sends '$s' before vStopped).

	<= OK#9a$T05thread:p57c.57d;

ugdb reports 57d as stopped.

_If_ gdb sent vStopped right now, he would get "OK" in return. But it
doesn't. Again, this is not the bug yet.

	=> Hgp57c.57c
	<= OK
	=> g
	<= 000000000000000030064000000000008162e115677f000000000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> Hcp57c.57c
	<= OK
	=> s
	<= OK

Now, 57c stops. But it can't report this stop. We are "inside" the
"vStopped" sequence, ugdb waits for gdb which should complete it
by sending the next vStopped packet which should find that 57c is
stopped.

>From the documentation:

	If the stub receives a `vStopped' packet and there are no additional
	stop events to report, the stub shall return an `OK' response.  At this
	point, if further stop events occur, the stub shall send a new stop
	reply notification,

IOW, 57c should not send the notification and it doesn't. But this confuses
gdb.

In short: gdb forgets that the last 'vStopped' didn't return 'OK' and
thus we probably have more stopped tracees but waits for the next
notification. ugdb doesn't send the notification because it shouldn't
do this until gdb sends the "final" vStopped according to the doc
above.

Another similar transcript:

	[... snip ...]
	=> Hcp589.589
	<= OK
	=> s
	<= OK#9a%Stop:T05thread:p589.589;
	=> vStopped
	<= OK
	=> g
	<= 0000000000000000300640000000000081621fd01f7f000000000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> s
	<= OK#9a%Stop:T05thread:p589.589;
	=> vStopped
	<= OK
	=> g
	<= 0000000000000000300640000000000081621fd01f7f000000000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> s#73$vStopped
	<= OK#9a$T05thread:p589.589;
	=> Hgp589.58a#c1$g
	<= OK#9a$50990141000000000000000000000000761f48d01f7f000000000000...
	=> m600a60,80
	<= 00000000000000000000000000000000000000000000000000000000000000...
	=> g
	<= 50990141000000000000000000000000761f48d01f7f000000000000000000...
	=> Hcp589.58a
	<= OK
	=> s
	<= OK

Almost the same.


Once again, whatever I did I failed to reproduce this hang with the
real gdbserver.

So I hacked gdb to send the similar sequence of packets instead, and
result is the same:

	getpkt ("Hcp589.589");  [no ack sent]
	putpkt ("$OK#9a"); [noack mode]
	getpkt ("s");  [no ack sent]
	putpkt ("$OK#9a"); [noack mode]
	pending stop replies: 1
	putpkt ("%Stop:T0506:f089f2deff7f0* ;07:f089f2deff7f0* ;10:d40540*';thread:p589.589;core:1;#94"); [notif]

	getpkt ("Hcp589.58a");  [no ack sent]
	putpkt ("$OK#9a"); [noack mode]

	getpkt ("s");  [no ack sent]
	putpkt ("$OK#9a"); [noack mode]
	pending stop replies: 2

	getpkt ("vStopped");  [no ack sent]
	vStopped: acking LWP 1417.1417
	putpkt ("$T0506:309101410*"00;07:309101410*"00;10:d40540*';thread:p589.58a;core:1;#24"); [noack mode]
	getpkt ("Hcp589.589");  [no ack sent]
	putpkt ("$OK#9a"); [noack mode]

	getpkt ("s");  [no ack sent]
	putpkt ("$OK#9a"); [noack mode]
	pending stop replies: 2

Like ugdb, gdbserver does NOT send the new notification, but waits for gdb
which should continue and complete the vStopped sequence.

So, I strongly believe gdb is buggy and should be fixed.

Oleg.