public inbox for cluster-cvs@sourceware.org help / color / mirror / Atom feed
From: David Teigland <teigland@fedoraproject.org> To: cluster-cvs-relay@redhat.com Subject: cluster: RHEL48 - dlm-kernel: ignore unlock reply of zero Date: Wed, 05 Aug 2009 19:12:00 -0000 [thread overview] Message-ID: <20090805191216.90FF612028E@lists.fedorahosted.org> (raw) Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=46d20cd6ace7ed9e35507ac823aa41999271b954 Commit: 46d20cd6ace7ed9e35507ac823aa41999271b954 Parent: ceb646b1850c6a3f45fdebd62af2e18122a58db9 Author: David Teigland <teigland@redhat.com> AuthorDate: Mon Apr 20 16:37:45 2009 -0500 Committer: David Teigland <teigland@redhat.com> CommitterDate: Wed Aug 5 14:02:35 2009 -0500 dlm-kernel: ignore unlock reply of zero bz 495600 Studying bug 349001 and the patch for it seems to leave a small window for the same problem to occur. The symptoms seem consistent and it's the only possibility I can come up with. The problem that commit c92628dcc39e03a4e9eccc4fa76257c871e5ba00 aims to fix is a grant message followed by a convert reply message for the same lock. I think the following sequence of events could still cause that to happen, even though the patch closes the window most of the way. dlm_recv thread 1. receive convert -- process_cluster_request/GDLM_REMCMD_CONVREQUEST 2. lkb->lkb_request = freq 3. dlm_convert_stage2() 4. lkb can't be granted immediately, so it's put on the convert queue 6. if (lkb->lkb_request != NULL) 9. send convert reply other thread 5. unlocks another lkb, finds lkb above can be granted, calls remote_grant() 7. lkb->lkb_request = NULL 8. send grant message remote_grant() is supposed to prevent the convert reply from being sent at all by setting lkb_request = NULL. But, given the right race it doesn't work and both grant and convert reply messages are sent (and sent in the bad order). This isn't a general fix for the problem, but an additional work around. We check right away upon getting a reply (before doing anything) whether it's an unlock reply with status of 0, which shouldn't happen. If it is, we ignore the reply (which the previous work-around on the sender side failed to prevent.) Currently, we recognize this "condition which shouldn't happen" after we've made changes (removed the lkb from the queue), making it too late to avoid disrupting the proper handling of the real unlock reply. This just moves the check to earlier, where we can really effectively ignore the reply. Also, add more info to the process_lockqueue_reply state 0 message to help understand more about the cause. Signed-off-by: David Teigland <teigland@redhat.com> --- dlm-kernel/src/dlm_internal.h | 3 ++- dlm-kernel/src/lockqueue.c | 22 ++++++++++++++++++---- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/dlm-kernel/src/dlm_internal.h b/dlm-kernel/src/dlm_internal.h index ff83a65..3400ee7 100644 --- a/dlm-kernel/src/dlm_internal.h +++ b/dlm-kernel/src/dlm_internal.h @@ -416,7 +416,8 @@ struct dlm_lkb { struct list_head lkb_lockqueue; /* queue of locks waiting for remote reply */ - int lkb_lockqueue_state; /* reason on lockqueue */ + int16_t lkb_lockqueue_state; /* reason on lockqueue */ + int16_t lkb_prev_lq_state; uint32_t lkb_lockqueue_flags; /* as passed into lock/unlock */ int lkb_ownpid; /* pid of lock owner */ diff --git a/dlm-kernel/src/lockqueue.c b/dlm-kernel/src/lockqueue.c index b996e3a..d203e16 100644 --- a/dlm-kernel/src/lockqueue.c +++ b/dlm-kernel/src/lockqueue.c @@ -326,8 +326,19 @@ static void process_lockqueue_reply(struct dlm_lkb *lkb, struct dlm_ls *ls = rsb->res_ls; int oldstate, state = lkb->lkb_lockqueue_state; - if (state) + /* hack to work around bug + https://bugzilla.redhat.com/show_bug.cgi?id=495600 */ + + if (state == GDLM_LQSTATE_WAIT_UNLOCK && reply->rl_status == 0) { + log_error(ls, "ignore zero unlock reply from %d %x %s", + nodeid, lkb->lkb_id, lkb->lkb_resource->res_name); + return; + } + + if (state) { remove_from_lockqueue(lkb); + lkb->lkb_prev_lq_state = state; + } switch (state) { case GDLM_LQSTATE_WAIT_RSB: @@ -565,14 +576,17 @@ static void process_lockqueue_reply(struct dlm_lkb *lkb, print_lkb(lkb);); queue_ast(lkb, AST_COMP | AST_DEL, 0); } else { - log_error(ls, "cancel reply ret %d", reply->rl_status); + /* should not get here due to hack at top */ + log_error(ls, "cancel reply ret %d %x", + reply->rl_status, lkb->lkb_id); queue_ast(lkb, AST_COMP, 0); } break; default: - log_error(ls, "process_lockqueue_reply id %x state %d", - lkb->lkb_id, state); + log_error(ls, "process_lockqueue_reply id %x state %d %d %d %d", + lkb->lkb_id, state, lkb->lkb_prev_lq_state, + reply->rl_status, lkb->lkb_retstatus); } }
reply other threads:[~2009-08-05 19:12 UTC|newest] Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20090805191216.90FF612028E@lists.fedorahosted.org \ --to=teigland@fedoraproject.org \ --cc=cluster-cvs-relay@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).