From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 25966 invoked by alias); 14 Jul 2011 17:01:00 -0000 Received: (qmail 25949 invoked by uid 9478); 14 Jul 2011 17:01:00 -0000 Date: Thu, 14 Jul 2011 17:01:00 -0000 Message-ID: <20110714170100.25947.qmail@sourceware.org> From: jbrassow@sourceware.org To: lvm-devel@redhat.com, lvm2-cvs@sourceware.org Subject: LVM2/doc lvm2-raid.txt Mailing-List: contact lvm2-cvs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: lvm2-cvs-owner@sourceware.org X-SW-Source: 2011-07/txt/msg00027.txt.bz2 CVSROOT: /cvs/lvm2 Module name: LVM2 Changes by: jbrassow@sourceware.org 2011-07-14 17:01:00 Added files: doc : lvm2-raid.txt Log message: LVM2 RAID design doc Patches: http://sourceware.org/cgi-bin/cvsweb.cgi/LVM2/doc/lvm2-raid.txt.diff?cvsroot=lvm2&r1=NONE&r2=1.1 /cvs/lvm2/LVM2/doc/lvm2-raid.txt,v --> standard output revision 1.1 --- LVM2/doc/lvm2-raid.txt +++ - 2011-07-14 17:01:00.262888000 +0000 @@ -0,0 +1,298 @@ +======================= += LVM RAID Design Doc = +======================= + +############################# +# Chapter 1: User-Interface # +############################# + +***************** CREATING A RAID DEVICE ****************** + +01: lvcreate --type \ +02: [--regionsize ] \ +03: [-i/--stripes <#>] [-I,--stripesize ] \ +04: [-m/--mirrors <#>] \ +05: [--[min|max]recoveryrate ] \ +06: [--stripecache ] \ +07: [--writemostly ] \ +08: [--maxwritebehind ] \ +09: [[no]sync] \ +10: \ +11: [devices] + +Line 01: +I don't intend for there to be shorthand options for specifying the +segment type. The available RAID types are: + "raid0" - Stripe [NOT IMPLEMENTED] + "raid1" - should replace DM Mirroring + "raid10" - striped mirrors, [NOT IMPLEMENTED] + "raid4" - RAID4 + "raid5" - Same as "raid5_ls" (Same default as MD) + "raid5_la" - RAID5 Rotating parity 0 with data continuation + "raid5_ra" - RAID5 Rotating parity N with data continuation + "raid5_ls" - RAID5 Rotating parity 0 with data restart + "raid5_rs" - RAID5 Rotating parity N with data restart + "raid6" - Same as "raid6_zr" + "raid6_zr" - RAID6 Rotating parity 0 with data restart + "raid6_nr" - RAID6 Rotating parity N with data restart + "raid6_nc" - RAID6 Rotating parity N with data continuation +The exception to 'no shorthand options' will be where the RAID implementations +can displace traditional tagets. This is the case with 'mirror' and 'raid1'. +In these cases, a switch will exist in lvm.conf allowing the user to specify +which implementation they want. When this is in place, the segment type is +inferred from the argument, '-m' for example. + +Line 02: +Region size is relevant for all RAID types. It defines the granularity for +which the bitmap will track the active areas of disk. The default is currently +4MiB. I see no reason to change this unless it is a problem for MD performance. +MD does impose a restriction of 2^21 regions for a given device, however. This +means two things: 1) we should never need a metadata area larger than +8kiB+sizeof(superblock)+bitmap_offset (IOW, pretty small) and 2) the region +size will have to be upwardly revised if the device is larger than 8TiB +(assuming defaults). + +Line 03/04: +The '-m/--mirrors' option is only relevant to RAID1 and will be used just like +it is today for DM mirroring. For all other RAID types, -i/--stripes and +-I/--stripesize are relevant. The former will specify the number of data +devices that will be used for striping. For example, if the user specifies +'--type raid0 -i 3', then 3 devices are needed. If the user specifies +'--type raid6 -i 3', then 5 devices are needed. The -I/--stripesize may be +confusing to MD users, as they use the term "chunksize". I think they will +adapt without issue and I don't wish to create a conflict with the term +"chunksize" that we use for snapshots. + +Line 05/06/07: +I'm still not clear on how to specify these options. Some are easier than +others. '--writemostly' is particularly hard because it involves specifying +which devices shall be 'write-mostly' and thus, also have 'max-write-behind' +applied to them. It has been suggested that a '--readmostly'/'--readfavored' +or similar option could be introduced as a way to specify a primary disk vs. +specifying all the non-primary disks via '--writemostly'. I like this idea, +but haven't come up with a good name yet. Thus, these will remain +unimplemented until future specification. + +Line 09/10/11: +These are familiar. + +Further creation related ideas: +Today, you can specify '--type mirror' without an '-m/--mirrors' argument +necessary. The number of devices defaults to two (and the log defaults to +'disk'). A similar thing should happen with the RAID types. All of them +should default to having two data devices unless otherwise specified. This +would mean a total number of 2 devices for RAID 0/1, 3 devices for RAID 4/5, +and 4 devices for RAID 6/10. + + +***************** CONVERTING A RAID DEVICE ****************** + +01: lvconvert [--type ] \ +02: [-R/--regionsize ] \ +03: [-i/--stripes <#>] [-I,--stripesize ] \ +04: [-m/--mirrors <#>] \ +05: [--splitmirrors <#>] \ +06: [--replace ] \ +07: [--[min|max]recoveryrate ] \ +08: [--stripecache ] \ +09: [--writemostly ] \ +10: [--maxwritebehind ] \ +11: vg/lv +12: [devices] + +lvconvert should work exactly as it does now when dealing with mirrors - +even if(when) we switch to MD RAID1. Of course, there are no plans to +allow the presense of the metadata area to be configurable (e.g. --corelog). +It will be simple enough to detect if the LV being up/down-converted is +new or old-style mirroring. + +If we choose to use MD RAID0 as well, it will be possible to change the +number of stripes and the stripesize. It is therefore conceivable to see +something like, 'lvconvert -i +1 vg/lv'. + +Line 01: +It is possible to change the RAID type of an LV - even if that LV is already +a RAID device of a different type. For example, you could change from +RAID4 to RAID5 or RAID5 to RAID6. + +Line 02/03/04/05: +These are familiar options - all of which would now be available as options +for change. (However, it'd be nice if we didn't have regionsize in there. +It's simple on the kernel side, but is just an extra - often unecessary - +parameter to many functions in the LVM codebase.) + +Line 06: +This option allows the user to specify a sub_lv (e.g. a mirror image) or +a particular device for replacement. The device (or all the devices in +the sub_lv) will be removed and replaced with different devices from the +VG. + +Line 07/08/09/10: +It should be possible to alter these parameters of a RAID device. As with +lvcreate, however, I'm not entirely certain how to best define some of these. +We don't need all the capabilities at once though, so it isn't a pressing +issue. + +Line 11: +The LV to operate on. + +Line 12: +Devices that are to be used to satisfy the conversion request. If the +operation removes devices or splits a mirror, then the devices specified +form the list of candidates for removal. If the operation adds or replaces +devices, then the devices specified form the list of candidates for allocation. + + + +############################################### +# Chapter 2: LVM RAID internal representation # +############################################### + +The internal representation is somewhat like mirroring, but with alterations +for the different metadata components. LVM mirroring has a single log LV, +but RAID will have one for each data device. Because of this, I've added a +new 'areas' list to the 'struct lv_segment' - 'meta_areas'. There is exactly +a one-to-one relationship between 'areas' and 'meta_areas'. The 'areas' array +still holds the data sub-lv's (similar to mirroring), while the 'meta_areas' +array holds the metadata sub-lv's (akin to the mirroring log device). + +The sub_lvs will be named '%s_rimage_%d' instead of '%s_mimage_%d' as it is +for mirroring, and '%s_rmeta_%d' instead of '%s_mlog'. Thus, you can imagine +an LV named 'foo' with the following layout: +foo +[foo's lv_segment] +| +|-> foo_rimage_0 (areas[0]) +| [foo_rimage_0's lv_segment] +|-> foo_rimage_1 (areas[1]) +| [foo_rimage_1's lv_segment] +| +|-> foo_rmeta_0 (meta_areas[0]) +| [foo_rmeta_0's lv_segment] +|-> foo_rmeta_1 (meta_areas[1]) +| [foo_rmeta_1's lv_segment] + +LVM Meta-data format +-------------------- +The RAID format will need to be able to store parameters that are unique to +RAID and unique to specific RAID sub-devices. It will be modeled after that +of mirroring. + +Here is an example of the mirroring layout: +lv { + id = "agL1vP-1B8Z-5vnB-41cS-lhBJ-Gcvz-dh3L3H" + status = ["READ", "WRITE", "VISIBLE"] + flags = [] + segment_count = 1 + + segment1 { + start_extent = 0 + extent_count = 125 # 500 Megabytes + + type = "mirror" + mirror_count = 2 + mirror_log = "lv_mlog" + region_size = 1024 + + mirrors = [ + "lv_mimage_0", 0, + "lv_mimage_1", 0 + ] + } +} + +The real trick is dealing with the metadata devices. Mirroring has an entry, +'mirror_log', in the top-level segment. This won't work for RAID because there +is a one-to-one mapping between the data devices and the metadata devices. The +mirror devices are layed-out in sub-device/le pairs. The 'le' parameter is +redundant since it will always be zero. So for RAID, I have simple put the +metadata and data devices in pairs without the 'le' parameter. + +RAID metadata: +lv { + id = "EnpqAM-5PEg-i9wB-5amn-P116-1T8k-nS3GfD" + status = ["READ", "WRITE", "VISIBLE"] + flags = [] + segment_count = 1 + + segment1 { + start_extent = 0 + extent_count = 125 # 500 Megabytes + + type = "raid1" + device_count = 2 + region_size = 1024 + + raids = [ + "lv_rmeta_0", "lv_rimage_0", + "lv_rmeta_1", "lv_rimage_1", + ] + } +} + +The metadata also must be capable of representing the various tunables. We +already have a good example for one from mirroring, region_size. +'max_write_behind', 'stripe_cache', and '[min|max]_recovery_rate' could also +be handled in this way. However, 'write_mostly' cannot be handled in this +way, because it is a characteristic associated with the sub_lvs, not the +array as a whole. In these cases, the status field of the sub-lv's themselves +will hold these flags - the meaning being only useful in the larger context. + +New Segment Type(s) +------------------- +I've created a new file 'lib/raid/raid.c' that will handle the various different +RAID types. While there will be a unique segment type for each RAID variant, +they will all share a common backend - segtype_handler functions and +segtype->flags = SEG_RAID. + +I'm also adding a new field to 'struct segment_type', parity_devs. For every +segment_type except RAID4/5/6, this will be 0. This field facilitates in +allocation and size calculations. For example, the lvcreate for RAID5 would +look something like: +~> lvcreate --type raid5 -L 30G -i 3 -n my_raid5 my_vg +or +~> lvcreate --type raid5 -n my_raid5 my_vg /dev/sd[bcdef]1 + +In the former case, the stripe count (3) and device size are computed, and +then 'segtype->parity_devs' extra devices are allocated of the same size. In +the latter case, the number of PVs is determined and 'segtype->parity_devs' is +subtracted off to determine the number of stripes. + +This should also work in the case of RAID10 and doing things in this manor +should not affect the way size is calculated via the area_multiple. + +Allocation +---------- +When a RAID device is created, metadata LVs must be created along with the +data LVs that will ultimately compose the top-level RAID array. For the +foreseeable future, the metadata LVs must reside on the same device as (or +at least one of the devices that compose) the data LV. We use this property +to simplify the allocation process. Rather than allocating for the data LVs +and then asking for a small chunk of space on the same device (or the other +way around), we simply ask for the aggregate size of the data LV plus the +metadata LV. Once we have the space allocated, we divide it between the +metadata and data LVs. This also greatly simplifies the process of finding +parallel space for all the data LVs that will compose the RAID array. When +a RAID device is resized, we will not need to take the metadata LV into +account, because it will already be present. + +Apart from the metadata areas, the other unique characteristic of RAID +devices is the parity device count. The number of parity devices does nothing +to the calculation of size-per-device. The 'area_multiple' means nothing +here. The parity devices will simply be the same size as all the other devices +and will also require a metadata LV (i.e. it is treated no differently than +the other devices). + +Therefore, to allocate space for RAID devices, we need to know two things: +1) how many parity devices are required and 2) does an allocated area need to +be split out for the metadata LVs after finding the space to fill the request. +We simply add these two fields to the 'alloc_handle' data structure as, +'parity_count' and 'alloc_and_split_meta'. These two fields get set simply +in '_alloc_init'. The 'segtype->parity_devs' holds the number of parity +drives and can be directly copied to 'ah->parity_count' and +'alloc_and_split_meta' is set when a RAID segtype is detected and +'metadata_area_count' has been specified. With these two variables set, we +can calculate how many allocated areas we need. Also, in the routines that +find the actual space, they stop not when they have found ah->area_count but +when they have found (ah->area_count + ah->parity_count). +