From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19223 invoked by alias); 5 Aug 2009 19:12:35 -0000 Received: (qmail 19217 invoked by alias); 5 Aug 2009 19:12:35 -0000 X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS X-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS X-Spam-Check-By: sourceware.org X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bastion2.fedora.phx.redhat.com Subject: cluster: RHEL48 - dlm-kernel: ignore unlock reply of zero To: cluster-cvs-relay@redhat.com X-Project: Cluster Project X-Git-Module: cluster.git X-Git-Refname: refs/heads/RHEL48 X-Git-Reftype: branch X-Git-Oldrev: ceb646b1850c6a3f45fdebd62af2e18122a58db9 X-Git-Newrev: 46d20cd6ace7ed9e35507ac823aa41999271b954 From: David Teigland Message-Id: <20090805191216.90FF612028E@lists.fedorahosted.org> Date: Wed, 05 Aug 2009 19:12:00 -0000 X-Scanned-By: MIMEDefang 2.58 on 172.16.52.254 Mailing-List: contact cluster-cvs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: cluster-cvs-owner@sourceware.org X-SW-Source: 2009-q3/txt/msg00160.txt.bz2 Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=46d20cd6ace7ed9e35507ac823aa41999271b954 Commit: 46d20cd6ace7ed9e35507ac823aa41999271b954 Parent: ceb646b1850c6a3f45fdebd62af2e18122a58db9 Author: David Teigland AuthorDate: Mon Apr 20 16:37:45 2009 -0500 Committer: David Teigland CommitterDate: Wed Aug 5 14:02:35 2009 -0500 dlm-kernel: ignore unlock reply of zero bz 495600 Studying bug 349001 and the patch for it seems to leave a small window for the same problem to occur. The symptoms seem consistent and it's the only possibility I can come up with. The problem that commit c92628dcc39e03a4e9eccc4fa76257c871e5ba00 aims to fix is a grant message followed by a convert reply message for the same lock. I think the following sequence of events could still cause that to happen, even though the patch closes the window most of the way. dlm_recv thread 1. receive convert -- process_cluster_request/GDLM_REMCMD_CONVREQUEST 2. lkb->lkb_request = freq 3. dlm_convert_stage2() 4. lkb can't be granted immediately, so it's put on the convert queue 6. if (lkb->lkb_request != NULL) 9. send convert reply other thread 5. unlocks another lkb, finds lkb above can be granted, calls remote_grant() 7. lkb->lkb_request = NULL 8. send grant message remote_grant() is supposed to prevent the convert reply from being sent at all by setting lkb_request = NULL. But, given the right race it doesn't work and both grant and convert reply messages are sent (and sent in the bad order). This isn't a general fix for the problem, but an additional work around. We check right away upon getting a reply (before doing anything) whether it's an unlock reply with status of 0, which shouldn't happen. If it is, we ignore the reply (which the previous work-around on the sender side failed to prevent.) Currently, we recognize this "condition which shouldn't happen" after we've made changes (removed the lkb from the queue), making it too late to avoid disrupting the proper handling of the real unlock reply. This just moves the check to earlier, where we can really effectively ignore the reply. Also, add more info to the process_lockqueue_reply state 0 message to help understand more about the cause. Signed-off-by: David Teigland --- dlm-kernel/src/dlm_internal.h | 3 ++- dlm-kernel/src/lockqueue.c | 22 ++++++++++++++++++---- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/dlm-kernel/src/dlm_internal.h b/dlm-kernel/src/dlm_internal.h index ff83a65..3400ee7 100644 --- a/dlm-kernel/src/dlm_internal.h +++ b/dlm-kernel/src/dlm_internal.h @@ -416,7 +416,8 @@ struct dlm_lkb { struct list_head lkb_lockqueue; /* queue of locks waiting for remote reply */ - int lkb_lockqueue_state; /* reason on lockqueue */ + int16_t lkb_lockqueue_state; /* reason on lockqueue */ + int16_t lkb_prev_lq_state; uint32_t lkb_lockqueue_flags; /* as passed into lock/unlock */ int lkb_ownpid; /* pid of lock owner */ diff --git a/dlm-kernel/src/lockqueue.c b/dlm-kernel/src/lockqueue.c index b996e3a..d203e16 100644 --- a/dlm-kernel/src/lockqueue.c +++ b/dlm-kernel/src/lockqueue.c @@ -326,8 +326,19 @@ static void process_lockqueue_reply(struct dlm_lkb *lkb, struct dlm_ls *ls = rsb->res_ls; int oldstate, state = lkb->lkb_lockqueue_state; - if (state) + /* hack to work around bug + https://bugzilla.redhat.com/show_bug.cgi?id=495600 */ + + if (state == GDLM_LQSTATE_WAIT_UNLOCK && reply->rl_status == 0) { + log_error(ls, "ignore zero unlock reply from %d %x %s", + nodeid, lkb->lkb_id, lkb->lkb_resource->res_name); + return; + } + + if (state) { remove_from_lockqueue(lkb); + lkb->lkb_prev_lq_state = state; + } switch (state) { case GDLM_LQSTATE_WAIT_RSB: @@ -565,14 +576,17 @@ static void process_lockqueue_reply(struct dlm_lkb *lkb, print_lkb(lkb);); queue_ast(lkb, AST_COMP | AST_DEL, 0); } else { - log_error(ls, "cancel reply ret %d", reply->rl_status); + /* should not get here due to hack at top */ + log_error(ls, "cancel reply ret %d %x", + reply->rl_status, lkb->lkb_id); queue_ast(lkb, AST_COMP, 0); } break; default: - log_error(ls, "process_lockqueue_reply id %x state %d", - lkb->lkb_id, state); + log_error(ls, "process_lockqueue_reply id %x state %d %d %d %d", + lkb->lkb_id, state, lkb->lkb_prev_lq_state, + reply->rl_status, lkb->lkb_retstatus); } }