From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cluster-cvs-return-6240-listarch-cluster-cvs=sources.redhat.com@sourceware.org>
Received: (qmail 28001 invoked by alias); 14 Mar 2007 04:28:32 -0000
Received: (qmail 27984 invoked by uid 9478); 14 Mar 2007 04:28:32 -0000
Date: Wed, 14 Mar 2007 04:28:00 -0000
Message-ID: <20070314042832.27983.qmail@sourceware.org>
From: jbrassow@sourceware.org
To: cluster-cvs@sources.redhat.com
Subject: cluster/cmirror-kernel/src dm-cmirror-client.c ...
Mailing-List: contact cluster-cvs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <cluster-cvs.sourceware.org>
List-Subscribe: <mailto:cluster-cvs-subscribe@sourceware.org>
List-Post: <mailto:cluster-cvs@sourceware.org>
List-Help: <mailto:cluster-cvs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: cluster-cvs-owner@sourceware.org
X-SW-Source: 2007-q1/txt/msg00592.txt.bz2

CVSROOT:	/cvs/cluster
Module name:	cluster
Branch: 	RHEL4
Changes by:	jbrassow@sourceware.org	2007-03-14 04:28:32

Modified files:
	cmirror-kernel/src: dm-cmirror-client.c dm-cmirror-server.c 

Log message:
	Bug 231230: leg failure on cmirrors causes devices to be stuck in SUSPE...
	
	The problem here appears to be timeouts related to clvmd.
	During failures under heavy load, clvmd commands (suspend/resume/
	activate/deactivate) can take a long time.  Clvmd assumes to quickly
	that they have failed.  This results in the fault handling being left
	half done.  Further calls to vgreduce (by hand or by dmeventd) will
	not help because the _on-disk_ version of the meta-data is consistent -
	that is, the faulty device has been removed.
	
	The most significant change in this patch is the removal of the
	'is_remote_recovering' function.  This function was designed to check
	if a remote node was recovering a region so that writes to the region
	could be delayed.  However, even with this function, it was possible
	for a remote node to begin recovery on a region _after_ the function
	was called, but before the write (mark request) took place.  Because
	of this, checking is done during the mark request stage - rendering
	the call to 'is_remote_recovering' meaningless.  Given the useless
	nature of this function, it has been pulled.  The benefits of its
	removal are increased performance and much faster (more than an
	order of magnitude) response during the mirror suspend process.
	
	The faster suspend process leads to less clvmd timeouts and
	reduced probability that bug 231230 will be triggered.
	
	However, when a mirror device is reconfigured, the mirror sub-devices
	are removed.  This is done by activating them cluster-wide before
	their removal.  With high enough load during recovery, these operations
	can still take a long time - even though they are linear devices.
	This too has the potential for causing clvmd to timeout and trigger
	bug 231230.  There is no cluster logging fix for this issue.  The
	delay on the linear devices must be determined.  A temporary
	work-around would be to increase the timeout of clvmd (e.g. clvmd -t #).

Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cmirror-kernel/src/dm-cmirror-client.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.1.2.40&r2=1.1.2.41
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cmirror-kernel/src/dm-cmirror-server.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.1.2.25&r2=1.1.2.26