From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21856 invoked by alias); 29 Jul 2009 10:33:28 -0000 Received: (qmail 21850 invoked by alias); 29 Jul 2009 10:33:28 -0000 X-SWARE-Spam-Status: No, hits=-1.6 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_66,SPF_HELO_PASS X-Spam-Status: No, hits=-1.6 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_66,SPF_HELO_PASS X-Spam-Check-By: sourceware.org X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bastion2.fedora.phx.redhat.com Subject: cluster: STABLE3 - cman: Fix a situation where cman could kill the wrong nodes To: cluster-cvs-relay@redhat.com X-Project: Cluster Project X-Git-Module: cluster.git X-Git-Refname: refs/heads/STABLE3 X-Git-Reftype: branch X-Git-Oldrev: 1558f71870f78c2101d8ef0833c178d2f2d86f8d X-Git-Newrev: aa2ea305eb1c7b706a8a3f81adb84f90ecd880d0 From: Christine Caulfield Message-Id: <20090729103304.CDF08120346@lists.fedorahosted.org> Date: Wed, 29 Jul 2009 10:33:00 -0000 X-Scanned-By: MIMEDefang 2.58 on 172.16.52.254 Mailing-List: contact cluster-cvs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: cluster-cvs-owner@sourceware.org X-SW-Source: 2009-q3/txt/msg00123.txt.bz2 Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=aa2ea305eb1c7b706a8a3f81adb84f90ecd880d0 Commit: aa2ea305eb1c7b706a8a3f81adb84f90ecd880d0 Parent: 1558f71870f78c2101d8ef0833c178d2f2d86f8d Author: Christine Caulfield AuthorDate: Wed Jul 29 11:17:47 2009 +0100 Committer: Christine Caulfield CommitterDate: Wed Jul 29 11:17:47 2009 +0100 cman: Fix a situation where cman could kill the wrong nodes hmm, how to describe this .... Hmmmm. OK lets try: There were a couple of places in the cman code where the transition message assumed that the node in question (either this node or the sending node) was joining the cluster, rather than just sending it's current post-transition state. This was wrong. It's a common problem we have with openais/corosync in that it always merges clusters rather than joining from scratch so we need to detect that in some way. The code in ais.c has a flag called 'first_trans' which it sets when it first encounters another node in the cluster. We should use this more often as it's really helpful. So this is what we now do. The comments in the existing code make it clear that it assumed the node was joining and not just part of an existing transition, but the first_trans flag was not checked, so it was fairly obvious what was going on. So, now we check the first_trans flag in all places where the code assumes that the node is joining a new cluster. See ? Signed-off-by: Christine Caulfield --- cman/daemon/commands.c | 16 +++++++++------- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/cman/daemon/commands.c b/cman/daemon/commands.c index 23ea810..b7f7293 100644 --- a/cman/daemon/commands.c +++ b/cman/daemon/commands.c @@ -75,6 +75,7 @@ static struct corosync_api_v1 *corosync; static corosync_timer_handle_t ccsd_timer; static unsigned int wanted_config_version; static int config_error; +static int local_first_trans; static corosync_timer_handle_t shutdown_timer; static struct connection *shutdown_con; @@ -1719,6 +1720,7 @@ void send_transition_msg(int last_memb_count, int first_trans) int len = sizeof(struct cl_transmsg); we_are_a_cluster_member = 1; + local_first_trans = first_trans; log_printf(LOGSYS_LEVEL_DEBUG, "memb: sending TRANSITION message. cluster_name = %s\n", cluster_name); msg->cmd = CLUSTER_MSG_TRANSITION; @@ -1889,9 +1891,9 @@ static void do_process_transition(int nodeid, char *data) return; } - /* If the remote node can see AISONLY nodes then we can't join as we don't - know the full state */ - if (msg->flags & NODE_FLAGS_SEESDISALLOWED && !have_disallowed()) { + /* If the remote node can see AISONLY nodes and we want to join, + then we can't, as we don't know the full state */ + if (local_first_trans && msg->flags & NODE_FLAGS_SEESDISALLOWED && !have_disallowed()) { /* Must use syslog directly here or the message will never arrive */ syslog(LOG_CRIT, "CMAN: Joined a cluster with disallowed nodes. must die"); cman_finish(); @@ -1953,10 +1955,10 @@ static void do_process_transition(int nodeid, char *data) add_ais_node(nodeid, incarnation, num_ais_nodes); } - /* If the cluster already has some AISONLY nodes then we can't make - sense of the membership. So the new node has to also be AISONLY - until we are consistent again */ - if (have_disallowed() && !node->us) + /* If the new node is joining and the existing cluster already has some AISONLY + nodes then we can't make sense of the membership. + So the new node has to also be AISONLY until we are consistent again */ + if (msg->first_trans && !node->us && have_disallowed()) node->state = NODESTATE_AISONLY; node->flags = msg->flags; /* This will clear the BEENDOWN flag of course */