From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cluster-cvs-return-11663-listarch-cluster-cvs=sources.redhat.com@sourceware.org>
Received: (qmail 19223 invoked by alias); 5 Aug 2009 19:12:35 -0000
Received: (qmail 19217 invoked by alias); 5 Aug 2009 19:12:35 -0000
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0
	tests=AWL,BAYES_00,SPF_HELO_PASS
X-Spam-Status: No, hits=-2.0 required=5.0
	tests=AWL,BAYES_00,SPF_HELO_PASS
X-Spam-Check-By: sourceware.org
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	bastion2.fedora.phx.redhat.com
Subject: cluster: RHEL48 - dlm-kernel: ignore unlock reply of zero
To: cluster-cvs-relay@redhat.com
X-Project: Cluster Project
X-Git-Module: cluster.git
X-Git-Refname: refs/heads/RHEL48
X-Git-Reftype: branch
X-Git-Oldrev: ceb646b1850c6a3f45fdebd62af2e18122a58db9
X-Git-Newrev: 46d20cd6ace7ed9e35507ac823aa41999271b954
From: David Teigland <teigland@fedoraproject.org>
Message-Id: <20090805191216.90FF612028E@lists.fedorahosted.org>
Date: Wed, 05 Aug 2009 19:12:00 -0000
X-Scanned-By: MIMEDefang 2.58 on 172.16.52.254
Mailing-List: contact cluster-cvs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <cluster-cvs.sourceware.org>
List-Subscribe: <mailto:cluster-cvs-subscribe@sourceware.org>
List-Post: <mailto:cluster-cvs@sourceware.org>
List-Help: <mailto:cluster-cvs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: cluster-cvs-owner@sourceware.org
X-SW-Source: 2009-q3/txt/msg00160.txt.bz2

Gitweb:        http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=46d20cd6ace7ed9e35507ac823aa41999271b954
Commit:        46d20cd6ace7ed9e35507ac823aa41999271b954
Parent:        ceb646b1850c6a3f45fdebd62af2e18122a58db9
Author:        David Teigland <teigland@redhat.com>
AuthorDate:    Mon Apr 20 16:37:45 2009 -0500
Committer:     David Teigland <teigland@redhat.com>
CommitterDate: Wed Aug 5 14:02:35 2009 -0500

dlm-kernel: ignore unlock reply of zero

bz 495600

Studying bug 349001 and the patch for it seems to leave a small window for the
same problem to occur.  The symptoms seem consistent and it's the only
possibility I can come up with.

The problem that commit c92628dcc39e03a4e9eccc4fa76257c871e5ba00 aims to fix
is a grant message followed by a convert reply message for the same lock.  I
think the following sequence of events could still cause that to happen, even
though the patch closes the window most of the way.

dlm_recv thread
1. receive convert -- process_cluster_request/GDLM_REMCMD_CONVREQUEST
2.   lkb->lkb_request = freq
3.   dlm_convert_stage2()
4.     lkb can't be granted immediately, so it's put on the convert queue
6.   if (lkb->lkb_request != NULL)
9.     send convert reply

other thread
5. unlocks another lkb, finds lkb above can be granted, calls remote_grant()
7. lkb->lkb_request = NULL
8. send grant message

remote_grant() is supposed to prevent the convert reply from being sent at all
by setting lkb_request = NULL.  But, given the right race it doesn't work and
both grant and convert reply messages are sent (and sent in the bad order).

This isn't a general fix for the problem, but an additional work around.
We check right away upon getting a reply (before doing anything) whether
it's an unlock reply with status of 0, which shouldn't happen.  If it is,
we ignore the reply (which the previous work-around on the sender side failed
to prevent.)

Currently, we recognize this "condition which shouldn't happen" after we've
made changes (removed the lkb from the queue), making it too late to avoid
disrupting the proper handling of the real unlock reply.  This just moves the
check to earlier, where we can really effectively ignore the reply.

Also, add more info to the process_lockqueue_reply state 0 message to
help understand more about the cause.

Signed-off-by: David Teigland <teigland@redhat.com>
---
 dlm-kernel/src/dlm_internal.h |    3 ++-
 dlm-kernel/src/lockqueue.c    |   22 ++++++++++++++++++----
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/dlm-kernel/src/dlm_internal.h b/dlm-kernel/src/dlm_internal.h
index ff83a65..3400ee7 100644
--- a/dlm-kernel/src/dlm_internal.h
+++ b/dlm-kernel/src/dlm_internal.h
@@ -416,7 +416,8 @@ struct dlm_lkb {
 
 	struct list_head	lkb_lockqueue;	/* queue of locks waiting
 						   for remote reply */
-	int			lkb_lockqueue_state; /* reason on lockqueue */
+	int16_t			lkb_lockqueue_state; /* reason on lockqueue */
+	int16_t			lkb_prev_lq_state;
 	uint32_t		lkb_lockqueue_flags; /* as passed into
 							lock/unlock */
 	int			lkb_ownpid;	/* pid of lock owner */
diff --git a/dlm-kernel/src/lockqueue.c b/dlm-kernel/src/lockqueue.c
index b996e3a..d203e16 100644
--- a/dlm-kernel/src/lockqueue.c
+++ b/dlm-kernel/src/lockqueue.c
@@ -326,8 +326,19 @@ static void process_lockqueue_reply(struct dlm_lkb *lkb,
 	struct dlm_ls *ls = rsb->res_ls;
 	int oldstate, state = lkb->lkb_lockqueue_state;
 
-	if (state)
+	/* hack to work around bug
+	   https://bugzilla.redhat.com/show_bug.cgi?id=495600 */
+
+	if (state == GDLM_LQSTATE_WAIT_UNLOCK && reply->rl_status == 0) {
+		log_error(ls, "ignore zero unlock reply from %d %x %s",
+			  nodeid, lkb->lkb_id, lkb->lkb_resource->res_name);
+		return;
+	}
+
+	if (state) {
 		remove_from_lockqueue(lkb);
+		lkb->lkb_prev_lq_state = state;
+	}
 
 	switch (state) {
 	case GDLM_LQSTATE_WAIT_RSB:
@@ -565,14 +576,17 @@ static void process_lockqueue_reply(struct dlm_lkb *lkb,
 				   print_lkb(lkb););
 			queue_ast(lkb, AST_COMP | AST_DEL, 0);
 		} else {
-			log_error(ls, "cancel reply ret %d", reply->rl_status);
+			/* should not get here due to hack at top */
+			log_error(ls, "cancel reply ret %d %x",
+				  reply->rl_status, lkb->lkb_id);
 			queue_ast(lkb, AST_COMP, 0);
 		}
 		break;
 
 	default:
-		log_error(ls, "process_lockqueue_reply id %x state %d",
-		          lkb->lkb_id, state);
+		log_error(ls, "process_lockqueue_reply id %x state %d %d %d %d",
+		          lkb->lkb_id, state, lkb->lkb_prev_lq_state,
+			  reply->rl_status, lkb->lkb_retstatus);
 	}
 }