-
Notifications
You must be signed in to change notification settings - Fork 59
41_Advanced_Storage_Setup
DRBD makes an excellent partner to SCST for creating highly available disk arrays and/or replicating data to a remote site for business continuity or disaster recovery purposes. ESOS includes a mostly vanilla DRBD install with the full set of user-land tools and DRBD functionality built-in to the kernel; we are currently using the 8.4 version of DRBD. The DRBD documentation is quite excellent, so we won't even try to emulate that here. Use the official 8.4 DRBD documentation.
Since DRBD resources (starting/stopping) are typically handled by the cluster stack (Pacemaker + Corosync), the stand-alone DRBD service is disabled by default. To setup/configure the DRBD resources on boot, edit the /etc/rc.conf file and set rc.drbd_enable
to "YES".
We'll provide a brief example DRBD setup of (2) nodes with one dual-primary resource. Here is the /etc/drbd.d/global_common.conf configuration file we used on both ESOS nodes:
global {
usage-count no;
}
common {
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
startup {
degr-wfc-timeout 120;
outdated-wfc-timeout 2;
}
options {
on-no-data-accessible io-error;
}
disk {
on-io-error detach;
disk-barrier no;
disk-flushes no;
fencing resource-only;
al-extents 3389;
}
net {
protocol C;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
rr-conflict disconnect;
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 512k;
}
}
Here is the resource file (/etc/drbd.d/r0.res) used on each node:
resource r0 {
net {
allow-two-primaries;
}
on bill.mcc.edu {
device /dev/drbd0;
disk /dev/sda;
address 172.16.0.22:7788;
meta-disk internal;
}
on ben.mcc.edu {
device /dev/drbd0;
disk /dev/sda;
address 172.16.0.21:7788;
meta-disk internal;
}
}
Next, run the following on both hosts:
drbdadm create-md r0
drbdadm attach r0
drbdadm syncer r0
drbdadm connect r0
Now run the following on only one of the hosts (assuming empty disk):
drbdadm -- --overwrite-data-of-peer primary r0
The DRBD resource will now be synchronized to the other host; you can check the status with "cat /proc/drbd" (or check the status in the TUI: Back-End Storage -> DRBD Status).
You can now make the "secondary" host a primary:
drbdadm primary r0
You should now have a dual-primary DRBD resource available as /dev/drbd0 on both nodes. You can use this block device node as an SCST device, or use an additional storage management/provisioning layer on top of the DRBD resource (LVM2, software RAID, etc.), or whatever other advanced storage configuration you might dream up.
The mdadm
tool is provided with ESOS to manage Linux software RAID (md) arrays. Since there are so many good guides, articles, and information on using mdadm available on the Internet, we won't even mention a specific link in this document; simply Google "mdadm howto" and you'll get a whole slew of them. We'll provide a couple basic examples below. Possible ESOS storage configuration ideas that make use of software RAID might include using RAID0 across two different hardware RAID controllers to possibly gain performance, or using software RAID1 across multiple hardware RAID controllers or SCSI disks to increase reliability.
Create a partition on each SCSI disk block device you'd like to use for software RAID ("Linux RAID Autodetect", type 'fd'). You can use either fdisk
or parted
to create partitions (both are included with ESOS).
To create a RAID0 (striped) volume with a 64 KiB chunk size on two disks, run the following command:
mdadm --create /dev/md0 --chunk=64 --level=0 --raid-devices=2 /dev/sda1 /dev/sdb1
To create a RAID1 (mirrored) volume with a 64 KiB chunk size on two disks, run the following command:
mdadm --create /dev/md0 --chunk=64 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
You can now use these new software RAID block devices as backing for a file system, or directly as a vdisk_blockio SCST device, or in conjunction with another storage management layer (eg, LVM2). The possibilities are limitless!
The Logical Volume Manager in Linux is a time-tested, stable piece of software. Its used by default on installs in big-name Linux distributions and has proven its usefulness. In ESOS, LVM a very helpful layer when use between the back-end storage devices, and SCST targets. It allows one to partition and manage back-storage easily, and includes advanced features like snapshots. There is lots of great information on using LVM2 on the web, so we definitely won't try to cover everything here. We'll provide a few brief examples below.
The clvmd daemon is also included with ESOS. This service is utilized when using LVM2 in a cluster; it prevents concurrent metadata updates from different nodes on shared storage (eg, using DRBD). The clvmd service is disabled by default. The clvmd daemon uses DLM for locking and requires a working Corosync cluster. To enable clvmd and dlm_controld, edit /etc/rc.conf, then change the value for 'rc.dlm_enable' and 'rc.clvmd_enable' to 'YES'. After you are sure Corosync is running and you have quorum (check it with: corosync-cfgtool -s
) start the services:
/etc/rc.d/rc.dlm start
/etc/rc.d/rc.clvmd start
To create a physical volume (PV) for LVM using an entire SCSI disk (/dev/sdc):
pvcreate -v /dev/sdc
>
Next you can create a volume group (VG) for LVM using the PV we created above:
vgcreate -v big_space_1 /dev/sdc
Now we can create a 500 GB logical volume (LV) called "small_vmfs_1" that can be used with SCST:
lvcreate -v -L 500G -n small_vmfs_1 big_space_1
These were a couple very basic examples that should show you how to get started with LVM2; this software is very powerful and flexible and can be used in a number of different ways. Read up on the LVM2 documentation (and man pages)!
The LSI Logic "MegaCLI" utility is an install option with ESOS (during the installation script). The utility allows you to create/delete/modify volumes (logical drives) on your MegaRAID controller. Below are a few examples of using the MegaCLI tool (MegaCli64 for us). A nice handy cheat sheet is located here. Or for a very in-depth document, consult the user guide available on LSI Logic's web site.
Get adapter information for the first MegaRAID adapter (0):
MegaCli64 -AdpAllInfo -a0
A list of all the adapter's physical drives:
MegaCli64 -PDList -a0
All of the logical drives for the adapter:
MegaCli64 -LDInfo -Lall -a0
Delete logical drive 0:
MegaCli64 -CfgLdDel -L0 -a0
Create a new RAID5 logical drive, with three physical disks, adaptive read-ahead (ADRA) and write cache enabled (WB):
MegaCli64 -CfgLDAdd -R5[8:0,8:1,8:2] WB ADRA -a0
Create a new RAID0 logical drive, with two physical disks, no read-ahead (NORA) and write cache disabled (WT):
MegaCli64 -CfgLDAdd -R0[8:0,8:1] WT NORA -a0
After you have created your new volume(s) you will need to find and record the SCSI device node (eg, /dev/sdz). You can easily find this using lsscsi
or checking dmesg
.
The arcconf utility is also an install option with ESOS for configuring Adaptec RAID controllers from inside the OS. This tool should work with most/all Adaptec SATA/SAS RAID controllers.
See all controller/drive/volume information for controller # 1:
arcconf GETCONFIG 1
Delete logical device (volume) 1 on controller 1:
arcconf DELETE 1 LOGICALDRIVE 1
Make a new RAID5 volume on controller 1 using three disks (channel 0, device numbers 2, 3, 4) with read cache enabled and write cache enabled:
arcconf CREATE 1 LOGICALDRIVE Rcache RON Wcache WB MAX 5 0 2 0 3 0 4
Make a new RAID0 volume on controller 1 using two disks (channel 0, device numbers 2, 3) with read cache disabled and write cache disabled:
arcconf CREATE 1 LOGICALDRIVE Rcache ROFF Wcache WT MAX 0 0 2 0 3
Once you've created a new volume on your Adaptec RAID controller, grab the SCSI device node (lsscsi
works well) and continue with one of the target configuration sections below.
Other RAID controllers are supported, however, not all of them necessarily have a CLI tool for configuring volumes, adapter settings, etc. from inside of ESOS. See the 03_Supported_Hardware wiki page for a current list of supported controllers and possible corresponding CLI utilities.
You can still use these other controllers with ESOS, you will just need to configure your volumes / logical drives "outside" of ESOS -- via the BIOS. Or seek the documentation for using the CLI tools on your own.
If you find that your favorite "enterprise class" RAID controller is not supported by ESOS, please let us know on the esos-users Google Group. It would also be helpful to know if there are any CLI management tools that can be used to configure these controllers from inside the OS.
ESOS includes the mhVTL software (virtual tape library); this, combined with data de-duplication makes an excellent traditional tape library (eg, DLT, LTO, etc.) replacement. You can use this virtual tape library on your Storage Area Network (SAN -> Fibre Channel, iSCSI, etc.). The mhVTL service is disabled by default -- to enable it, simply edit the /etc/rc.conf file and set 'rc.mhvtl_enable' to "YES".
The data storage location (mount point in the file system) is hard coded in the software (/mnt/mhvtl). You'll need to create a new back-end storage device and file system then mount it on mnt/mhvtl
and update the fstab file.
If you are going to use lessfs for de-duplication as the mhVTL storage backing, you will want to use a separate persistent storage device (RAID volume) for the lessfs configuration data and database. You will then use the /mnt/mhvtl
location for a separate underlying back-end storage file system, and then use the same location as the lessfs mount point. The key point here is that you need two separate persistent file systems for this setup with lessfs and mhVTL: One for the lessfs data, and one the mhVTL data.
We'll present a very basic mhVTL configuration on this page. See the mhVTL page for additional setup/configuration information.
First, create the /etc/mhvtl/device.conf file:
VERSION: 3
Library: 10 CHANNEL: 0 TARGET: 1 LUN: 0
Vendor identification: SPECTRA
Product identification: PYTHON
Product revision level: 5500
Unit serial number: XYZZY_10
NAA: 10:22:33:44:ab:cd:ef:00
Drive: 11 CHANNEL: 0 TARGET: 1 LUN: 1
Library ID: 10 Slot: 1
Vendor identification: QUANTUM
Product identification: SDLT600
Product revision level: 5500
Unit serial number: XYZZY_11
NAA: 10:22:33:44:ab:cd:ef:01
VPD: b0 04 00 02 01 00
Drive: 12 CHANNEL: 0 TARGET: 1 LUN: 2
Library ID: 10 Slot: 2
Vendor identification: QUANTUM
Product identification: SDLT600
Product revision level: 5500
Unit serial number: XYZZY_12
NAA: 10:22:33:44:ab:cd:ef:02
VPD: b0 04 00 02 01 00
Next, create the /etc/mhvtl/library_contents.10 file:
VERSION: 2
Drive 1:
Drive 2:
Picker 1:
MAP 1:
MAP 2:
MAP 3:
MAP 4:
Slot 01: L10001S3
Slot 02: L10002S3
Slot 03: L10003S3
Slot 04: L10004S3
Slot 05: L10005S3
Slot 06: L10006S3
Slot 07: L10007S3
Slot 08: L10008S3
Slot 09: L10009S3
Slot 10: L10010S3
You can now start the mhVTL service in a ESOS shell:
/etc/rc.d/rc.mhvtl start
You should now have a mhVTL virtual tape library running on your ESOS storage server! You can check that the robot/drives are available with the lsscsi -g
command. After the VTL is configured, you'll use the 51_Device_Configuration wiki page to create the corresponding SCST devices.
De-duplication in ESOS is handled by lessfs, a virtual file system for FUSE. The lessfs file system is mounted on top of a "normal" (eg, ext3, xfs, etc.) and provides compression and encryption. You can then use this lessfs file system as your back-end storage for mhVTL or SCST FILEIO devices, seamlessly providing de-duplication.
A separate, unique lessfs configuration file is needed for each lessfs file system. One configuration file can not handle multiple lessfs instances. A database (local files, Berkeley DB) is used for each lessfs file system, which is configured using the lessfs configuration file.
The location for the lessfs database files and configuration file needs to be a persistent attached storage device on ESOS (eg, logical drive on local RAID controller). Do not use any locations on the esos_root file system (/) for storing lessfs configuration files, or databases!
A typical setup for lessfs looks like this:
- Create a new back-end storage file system using the TUI. Example mount point: /mnt/vdisks/test_fs_1
- You would then create your lessfs configuration file here: /mnt/vdisks/test_fs_1/lessfs.cfg
- Create these three directories: /mnt/vdisks/test_fs_1/mta /mnt/vdisks/test_fs_1/dta /mnt/vdisks/test_fs_1/data
Here is an example lessfs configuration file (/mnt/vdisks/test_fs_1/lessfs.cfg):
DEBUG=5
HASHNAME=MHASH_TIGER192
HASHLEN=24
BLOCKDATA_IO_TYPE=file_io
BLOCKDATA_PATH=/mnt/vdisks/test_fs_1/dta/blockdata.dta
META_PATH=/mnt/vdisks/test_fs_1/mta
META_BS=1048576
CACHESIZE=512
COMMIT_INTERVAL=10
LISTEN_IP=127.0.0.1
LISTEN_PORT=100
MAX_THREADS=16
DYNAMIC_DEFRAGMENTATION=on
COREDUMPSIZE=2560000000
SYNC_RELAX=0
BACKGROUND_DELETE=on
ENCRYPT_DATA=off
ENCRYPT_META=off
ENABLE_TRANSACTIONS=on
BLKSIZE=131072
COMPRESSION=snappy
Create the new lessfs file system:
mklessfs -f -c /mnt/vdisks/test_fs_1/lessfs.cfg
Now add this line to your /etc/fstab file; be sure this line is after your normal, back-end file system that lessfs sits on top of:
lessfs#/mnt/vdisks/test_fs_1/lessfs.cfg /mnt/vdisks/test_fs_1/data fuse defaults 0 0
You can now mount your lessfs file system:
mount /mnt/vdisks/test_fs_1/data
You now have a file system that supports inline data de-duplication and can be used for virtual disk files (vdisk_fileio). If you are going to use lessfs in conjunction with mhVTL, you should use a separate storage device for the lessfs configuration and metadata (database files) since the VTL path is static. See the lessfs web site for additional documentation and an explanation of the configuration parameters.
Several different block level (layer) caching solutions exist in ESOS. These software options allow you to use some type of fast storage (eg, an SSD) as a caching device to improve the performance of some other lower-end (probably large) storage. These would be similar or an alternative to controller-side (hardware) caching options like MegaRAID CacheCade or Adaptec maxCache. At the time of writing this, Enterprise Storage OS includes the following options:
- bcache http://bcache.evilpiepirate.org/
- dm-cache http://visa.cs.fiu.edu/tiki/dm-cache
- EnhanceIO https://github.com/stec-inc/EnhanceIO
- lvmcache https://sourceware.org/lvm2/
We'll attempt to give a brief setup example for each of these; please consult the project web sites above for additional information.
For bcache, you'll first need to identify a caching device (like an SSD drive, or array of SSDs) and a backing device (a slow hard drive, or RAID array of spinning disks). The steps go something like this:
- Create and register the caching device.
- Create and register the backing device.
- Attach the caching device to the backing device.
- You can now use the new bcache block device as any other block device in ESOS (eg, create a virtual disk file system for vdisk_fileio, use the raw block device for LVM, directly with vdisk_blockio, etc.).
Here is a real example of the commands for a bcache device; '/dev/sdc' is the caching device, '/dev/sdd' is the backing device, and the UUID comes from 'cset.uuid' in the bcache-super-show command:
make-bcache -C /dev/sdc
echo "/dev/sdc" > /sys/fs/bcache/register
make-bcache -B /dev/sdd
echo "/dev/sdd" > /sys/fs/bcache/register
bcache-super-show /dev/sdc
echo "6d4ab278-0844-4a50-8e74-87aeda4fd353" > /sys/block/sdd/bcache/attach
You should now have a /dev/bcacheX device node that you can use.
Now we'll take a look at setting up dm-cache. Two different block devices (or segments) are needed for dm-cache: (1) for the cache metadata, and (1) for the cache regions. For this example setup, we created both on one single SSD-backed volume using LVM. See this article for a more in-depth example; for the example shown below, metadata size was not taken into account.
pvcreate /dev/sda
vgcreate ssd_vg /dev/sda
lvcreate -L 10G -n ssd_metadata ssd_vg
lvcreate -L 150G -n ssd_blocks ssd_vg
pvcreate /dev/sdb
vgcreate slow_disk_vg /dev/sdb
lvcreate -L 100G -n data_vol slow_disk_vg
blockdev --getsz /dev/mapper/slow_disk_vg-data_vol
dmsetup create cached_dev --table '0 209715200 cache /dev/mapper/ssd_vg-ssd_metadata /dev/mapper/ssd_vg-ssd_blocks /dev/mapper/slow_disk_vg-data_vol 512 1 writeback default 0'
You would now have a '/dev/mapper/cached_dev' device node that can be used for a partition table & file system, raw block device, etc. To make dm-cache devices persist across reboots, you'll need to enable the rc script (rc.dmcache) in the /etc/rc.conf file. Then you'll need to add the commands to create/destroy the dm-cache device(s) using files "/etc/dm-cache.start" and "/etc/dm-cache.stop"; below are examples following suit from above.
/etc/dm-cache.start:
dmsetup create cached_dev --table '0 209715200 cache /dev/mapper/ssd_vg-ssd_metadata /dev/mapper/ssd_vg-ssd_blocks /dev/mapper/slow_disk_vg-data_vol 512 1 writeback default 0'
dmsetup resume cached_dev
/etc/dm-cache.stop:
dmsetup suspend cached_dev
dmsetup remove cached_dev
The setup procedure for EnhanceIO cache devices is pretty clear-cut; the source or backing device (the device you want to "enhance") can already contain data and even have a mounted file system while adding/deleting a cache. The eio_cli
tool that comes with ESOS is a special version that supports non-udev setups (like mdev in ESOS). EnhanceIO is disabled in ESOS by default; edit the /etc/rc.conf file and set 'rc.eio_enable' to "YES". Next, you'll need to setup your cache device using the eio_cli
tool (be sure to always use the "-u" option to disable support for udev):
eio_cli create -u -d /dev/disk-by-id/SERIAL-B8CEA82A -s /dev/disk-by-id/SERIAL-A65CBA25 -m wb -c my_cache
Your backing/source device ("/dev/disk-by-id/SERIAL-B8CEA82A" in this example) is now enhanced! The configuration file that eio_cli and rc.eio use is located here: /etc/eio.conf
You can also use the LVM interface to device-mapper cache (dm-cache). Using it via this method is much simpler compared to setting up dm-cache. First you'll need to make sure LVM is enabled for boot (set "rc.lvm2_enable" to "YES" in /etc/rc.conf).
For this lvmcache setup example, we'll be using (1) SSD SCSI disk (our cache), and (1) 7.2K NL SAS SCSI disk (our backing disk).
Make these SCSI disks into LVM PVs and add both devices to the same volume group (VG):
pvcreate -v /dev/sdb /dev/sdc
vgcreate -v VolumeGroup1 /dev/sdb /dev/sdc
You then need to create (3) logical volumes and allocate to each specific physical disk.
Create a logical volume to use as cache and assign it to the SSD disk (/dev/sdb):
lvcreate -L 950GB -n lv1_cache VolumeGroup1 /dev/sdb
Create a logical volume to use as the cache metadata and assign it to the SSD (/dev/sdb -- this needs to be about a 1000:1 split):
lvcreate -L 1GB -n lv1_cache_meta VolumeGroup1 /dev/sdb
Create a logical volume to use as the data disk and assign it to the SAS 7.2K NL disk (/dev/sdc):
lvcreate -L 2TB -n lv1_data VolumeGroup1 /dev/sdc
Now we need to convert the (2) cache volumes into a "cache pool" (this will add lv1_cache to a cache pool using lv1_cache_meta as the metadata):
lvconvert --type cache-pool --poolmetadata VolumeGroup1/lv1_cache_meta VolumeGroup1/lv1_cache
Finally, attach the cache pool to the data volume -- your volume will now be cached:
lvconvert --type cache --cachepool VolumeGroup1/lv1_cache VolumeGroup1/lv1_data
How to "un-cache" a logical volume... all you need to do is remove the cache pool logical volume (LVM will then copy the unwritten data to the data drive then remove the cache and metadata volumes):
lvremove VolumeGroup1/lv1_cache
To add the cache back in, you will need to recreate the cache pool from scratch and assign it back to the logical volume.
BTIER configuration should be similar to Linux distributions. There is a decent article here on configuring BTIER. Skip past the building/installation part and start with configuring BTIER devices in that article.
As an example configuration, assuming we have a Linux RAID (md) RAID1 volume that consists of two SSDs (md0) and a Linux RAID RAID5 volume that contains several SATA drives (md1), then we can create a BTIER device like this:
btier_setup -f /dev/md0:/dev/md1 -B -c
Please note: The "-c" flag is only used when the BTIER device is initially created. Using "-c" writes the initial metadata to the underlying disks. The system should now show a new device: /dev/sdtiera
Use this block device as you would normally when creating SCST devices. Add the following line to the "/etc/bttab" configuration file to make the BTIER device persist between reboots:
/dev/md0:/dev/md1
Special thanks to Riccardo Bicelli for creating a BTIER resource agent (RA) for use with Pacemaker. This RA is included in ESOS; here is his original post for the BTIER RA: http://think-brick.blogspot.it/2014/09/btier-resource-agents-for-pacemaker.html
Example usage in ESOS:
crm
cib new btier
configure primitive p_btier ocf:esos:btier \
params tier_devices="/dev/sda:/dev/sdb" \
device_name="mybtierdev01"
op monitor interval="10s"
configure show
cib commit btier
quit
In ESOS, you can use a Ceph RBD image as a back-end block device (mapped to). You can then treat this as a normal block device and use it with vdisk_blockio, or put a file system on it and use vdisk_fileio.
Edit the /etc/ceph/ceph.conf file and add your monitors (nodes) to the configuration file; this allows your Ceph cluster to be discovered. Make sure you have a mon_hosts line in /etc/ceph/ceph.conf that enumerates all Ceph monitors. Here is an example:
mon_host = 192.168.1.101,192.168.1.102,192.168.1.103
You will also need a client key ring file (/etc/ceph/ceph.client.keyring):
[client.admin]
key = AQC2WFlTYPvVHhAAuk1jxZ4u86EkMdeUyn6LYA==
Finally configure the /etc/ceph/rbdmap file (pool / image mappings):
rbd/disk01 id=client.admin,keyring=/etc/ceph/ceph.client.keyring
Edit the /etc/rc.conf file and set "rc.rbdmap_enable" to "YES" and then start it:
/etc/rc.d/rc.rbdmap start
If you get any error messages, check the kernel logs (dmesg
). See this article if you have any "feature set mismatch" errors: http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client/
After you have setup/configured your advanced back-end storage, it will still appear as a block device, just as described in the basic back-storage wiki document.
With this logical block device, you can now create a file system on it and create virtual disk files, if desired. Follow the same steps as described in the 31_Basic_Back_End_Storage_Setup document for making file systems and adding virtual disk files, but with the advanced back-storage, you'll select your DRBD block device (eg, /dev/drbd0) or whatever advanced block device you configured.
TODO
You should now continue on to the 41_Hosts_and_Initiators wiki page which will guide you through configuring SCST security groups.