Skip to content

41_Advanced_Storage_Setup

Marc A. Smith edited this page Mar 3, 2017 · 9 revisions

Distributed Replicated Block Device (DRBD)

DRBD makes an excellent partner to SCST for creating highly available disk arrays and/or replicating data to a remote site for business continuity or disaster recovery purposes. ESOS includes a mostly vanilla DRBD install with the full set of user-land tools and DRBD functionality built-in to the kernel; we are currently using the 8.4 version of DRBD. The DRBD documentation is quite excellent, so we won't even try to emulate that here. Use the official 8.4 DRBD documentation.

Since DRBD resources (starting/stopping) are typically handled by the cluster stack (Pacemaker + Corosync), the stand-alone DRBD service is disabled by default. To setup/configure the DRBD resources on boot, edit the /etc/rc.conf file and set rc.drbd_enable to "YES".

We'll provide a brief example DRBD setup of (2) nodes with one dual-primary resource. Here is the /etc/drbd.d/global_common.conf configuration file we used on both ESOS nodes:

global {
        usage-count no;
}
common {
        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
                fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
        }
        startup {
                degr-wfc-timeout 120;
                outdated-wfc-timeout 2;
        }
        options {
                on-no-data-accessible io-error;
        }
        disk {
                on-io-error detach;
                disk-barrier no;
                disk-flushes no;
                fencing resource-only;
                al-extents 3389;
        }
        net {
                protocol C;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
                rr-conflict disconnect;
                max-buffers 8000;
                max-epoch-size 8000;
                sndbuf-size 512k;
        }
}

Here is the resource file (/etc/drbd.d/r0.res) used on each node:

resource r0 {
        net {
                allow-two-primaries;
        }
        on bill.mcc.edu {
                device     /dev/drbd0;
                disk       /dev/sda;
                address    172.16.0.22:7788;
                meta-disk  internal;
        }
        on ben.mcc.edu {
                device    /dev/drbd0;
                disk      /dev/sda;
                address   172.16.0.21:7788;
                meta-disk internal;
        }
}

Next, run the following on both hosts:

drbdadm create-md r0
drbdadm attach r0
drbdadm syncer r0
drbdadm connect r0

Now run the following on only one of the hosts (assuming empty disk):

drbdadm -- --overwrite-data-of-peer primary r0

The DRBD resource will now be synchronized to the other host; you can check the status with "cat /proc/drbd" (or check the status in the TUI: Back-End Storage -> DRBD Status).

You can now make the "secondary" host a primary:

drbdadm primary r0

You should now have a dual-primary DRBD resource available as /dev/drbd0 on both nodes. You can use this block device node as an SCST device, or use an additional storage management/provisioning layer on top of the DRBD resource (LVM2, software RAID, etc.), or whatever other advanced storage configuration you might dream up.


DM Multipath

Typically DM (device mapper) multipath is used on the initiator side to handle multipathing I/O (eg, multiple target interfaces) and it handles picking the proper path for I/O, handling failures, and even round-robin'ing I/O across multiple paths for performance.

On the target (ESOS) side, multipath is typically used for a couple different reasons:

  • The ESOS host is used as a gateway between different SAN mediums. Say you have LUN's presented to an ESOS host via Fibre Channel, where ESOS is the initiator for Fibre Channel and the volumes are visible as block devices. If multiple paths are presented, you'd run multipath-tools to turn several paths into logical devices. You could then use ESOS to present those devices to some other initiators say across iSCSI (or whatever SAN medium).
  • Another more common use is if you have dual-domain SAS disks, and/or have two servers connected redundantly to SAS enclosures (redundant I/O modules, dual-domain SAS disks). This then presents two block devices to ESOS, one pair for each physical SAS disk. You'd then use multipath-tools and this will create one logical block device to use on the ESOS side.

The multipath-tools service is disabled by default -- to enable it, simply edit the /etc/rc.conf file and set 'rc.multipathd_enable' to "YES". Next, you'll need to create a multipath-tools configure file ('/etc/multipath.conf') which can vary based on your hardware configuration. You'll typically want to exclude, or blacklist (depending on your configuration) the ESOS boot device at the very least.

Here is an example '/etc/multipath.conf' configuration file that may work well for dual-domain SAS disks:

blacklist {
    device {
        vendor "SanDisk"
        product "Extreme"
    }
}

defaults {
    polling_interval 5
    path_selector "round-robin 0"
    path_grouping_policy failover
    getuid_callout "/usr/lib/udev/scsi_id --whitelisted --device=/dev/%n"
    prio const
    path_checker directio
    rr_min_io 1000
    rr_weight uniform
    failback manual
    no_path_retry fail
    user_friendly_names no
}

Our ESOS USB flash drive is a "SanDisk Extreme" and so we blacklist that device in the configuration file which prevents device-mapper (DM) logical devices from being created. You can now start multipath-tools (multipathd) with this command:

/etc/rc.d/rc.multipathd start

Check your logical device layout and path states with this command:

multipath -ll

There are many configuration options for multipath-tools. Search the web for examples, and tweak your configuration as need for your hardware setup.


ZFS

Native ZFS on Linux is included with ESOS as a build-time option. Currently, the pre-built packages offered in the ESOS downloads area do not include ZFS support due to the controversy surrounding the ZFS license and the GPL. That said, recently Ubuntu made the decision to distribute ZFS with the Ubuntu Linux distributions. Our project may follow suit in the future, but for now, you must enable it when building ESOS from source ('--enable-zfs').

There are so many guides and examples available on the internet for using ZFS, we're not even going to attempt to cover any of that in this wiki page. Again, ESOS employs the ZFS on Linux project which provides a native port of ZFS to Linux. Search the web for ZFS administration information if you're not familiar, or check out this page which has several links to other ZFS guides that are available.


Virtual Tape Library (VTL)

ESOS includes the mhVTL software (virtual tape library); this, combined with data de-duplication makes an excellent traditional tape library (eg, DLT, LTO, etc.) replacement. You can use this virtual tape library on your Storage Area Network (SAN -> Fibre Channel, iSCSI, etc.). The mhVTL service is disabled by default -- to enable it, simply edit the /etc/rc.conf file and set 'rc.mhvtl_enable' to "YES".

The data storage location (mount point in the file system) is hard coded in the software (/mnt/mhvtl). You'll need to create a new back-end storage device and file system then mount it on mnt/mhvtl and update the fstab file.

If you are going to use lessfs for de-duplication as the mhVTL storage backing, you will want to use a separate persistent storage device (RAID volume) for the lessfs configuration data and database. You will then use the '/mnt/mhvtl' location for a separate underlying back-end storage file system, and then use the same location as the lessfs mount point. The key point here is that you need two separate persistent file systems for this setup with lessfs and mhVTL: One for the lessfs data, and one the mhVTL data.

We'll present a very basic mhVTL configuration on this page. See the mhVTL page for additional setup/configuration information.

First, create the /etc/mhvtl/device.conf file:

VERSION: 3

Library: 10 CHANNEL: 0 TARGET: 1 LUN: 0
 Vendor identification: SPECTRA
 Product identification: PYTHON
 Product revision level: 5500
 Unit serial number: XYZZY_10
 NAA: 10:22:33:44:ab:cd:ef:00

Drive: 11 CHANNEL: 0 TARGET: 1 LUN: 1
 Library ID: 10 Slot: 1
 Vendor identification: QUANTUM
 Product identification: SDLT600
 Product revision level: 5500
 Unit serial number: XYZZY_11
 NAA: 10:22:33:44:ab:cd:ef:01
 VPD: b0 04 00 02 01 00

Drive: 12 CHANNEL: 0 TARGET: 1 LUN: 2
 Library ID: 10 Slot: 2
 Vendor identification: QUANTUM
 Product identification: SDLT600
 Product revision level: 5500
 Unit serial number: XYZZY_12
 NAA: 10:22:33:44:ab:cd:ef:02
 VPD: b0 04 00 02 01 00

Next, create the /etc/mhvtl/library_contents.10 file:

VERSION: 2

Drive 1:
Drive 2:

Picker 1:

MAP 1:
MAP 2:
MAP 3:
MAP 4:

Slot 01: L10001S3
Slot 02: L10002S3
Slot 03: L10003S3
Slot 04: L10004S3
Slot 05: L10005S3
Slot 06: L10006S3
Slot 07: L10007S3
Slot 08: L10008S3
Slot 09: L10009S3
Slot 10: L10010S3

You can now start the mhVTL service in a ESOS shell:

/etc/rc.d/rc.mhvtl start

You should now have a mhVTL virtual tape library running on your ESOS storage server! You can check that the robot/drives are available with the lsscsi -g command. After the VTL is configured, you'll use the 36_Devices_and_Mappings wiki page to create the corresponding SCST devices.


Inline Data De-duplication (Deprecated)

Using lessfs is deprecated in ESOS and may be removed in future releases. We recommend using ZFS if de-duplication is needed. The information below is provided for reference purposes only.

De-duplication in ESOS is handled by lessfs, a virtual file system for FUSE. The lessfs file system is mounted on top of a "normal" (eg, ext3, xfs, etc.) and provides compression and encryption. You can then use this lessfs file system as your back-end storage for mhVTL or SCST FILEIO devices, seamlessly providing de-duplication.

A separate, unique lessfs configuration file is needed for each lessfs file system. One configuration file can not handle multiple lessfs instances. A database (local files, Berkeley DB) is used for each lessfs file system, which is configured using the lessfs configuration file.

The location for the lessfs database files and configuration file needs to be a persistent attached storage device on ESOS (eg, logical drive on local RAID controller). Do not use any locations on the esos_root file system (/) for storing lessfs configuration files, or databases!

A typical setup for lessfs looks like this:

  • Create a new back-end storage file system using the TUI. Example mount point: /mnt/vdisks/test_fs_1
  • You would then create your lessfs configuration file here: /mnt/vdisks/test_fs_1/lessfs.cfg
  • Create these three directories: /mnt/vdisks/test_fs_1/mta /mnt/vdisks/test_fs_1/dta /mnt/vdisks/test_fs_1/data

Here is an example lessfs configuration file (/mnt/vdisks/test_fs_1/lessfs.cfg):

DEBUG=5
HASHNAME=MHASH_TIGER192
HASHLEN=24
BLOCKDATA_IO_TYPE=file_io
BLOCKDATA_PATH=/mnt/vdisks/test_fs_1/dta/blockdata.dta
META_PATH=/mnt/vdisks/test_fs_1/mta
META_BS=1048576
CACHESIZE=512
COMMIT_INTERVAL=10
LISTEN_IP=127.0.0.1
LISTEN_PORT=100
MAX_THREADS=16
DYNAMIC_DEFRAGMENTATION=on
COREDUMPSIZE=2560000000
SYNC_RELAX=0
BACKGROUND_DELETE=on
ENCRYPT_DATA=off
ENCRYPT_META=off
ENABLE_TRANSACTIONS=on
BLKSIZE=131072
COMPRESSION=snappy

Create the new lessfs file system:

mklessfs -f -c /mnt/vdisks/test_fs_1/lessfs.cfg

Now add this line to your /etc/fstab file; be sure this line is after your normal, back-end file system that lessfs sits on top of:

lessfs#/mnt/vdisks/test_fs_1/lessfs.cfg /mnt/vdisks/test_fs_1/data fuse defaults 0 0

You can now mount your lessfs file system:

mount /mnt/vdisks/test_fs_1/data

You now have a file system that supports inline data de-duplication and can be used for virtual disk files (vdisk_fileio). If you are going to use lessfs in conjunction with mhVTL, you should use a separate storage device for the lessfs configuration and metadata (database files) since the VTL path is static. See the lessfs web site for additional documentation and an explanation of the configuration parameters.


Block Layer Caching

Several different block level (layer) caching solutions exist in ESOS. These software options allow you to use some type of fast storage (eg, an SSD) as a caching device to improve the performance of some other lower-end (probably large) storage. These would be similar or an alternative to controller-side (hardware) caching options like MegaRAID CacheCade or Adaptec maxCache. At the time of writing this, Enterprise Storage OS includes the following options:

We'll attempt to give a brief setup example for each of these; please consult the project web sites above for additional information.

bcache

For bcache, you'll first need to identify a caching device (like an SSD drive, or array of SSDs) and a backing device (a slow hard drive, or RAID array of spinning disks). The steps go something like this:

  1. Create and register the caching device.
  2. Create and register the backing device.
  3. Attach the caching device to the backing device.
  4. You can now use the new bcache block device as any other block device in ESOS (eg, create a virtual disk file system for vdisk_fileio, use the raw block device for LVM, directly with vdisk_blockio, etc.).

Here is a real example of the commands for a bcache device; '/dev/sdc' is the caching device, '/dev/sdd' is the backing device, and the UUID comes from 'cset.uuid' in the bcache-super-show command:

make-bcache -C /dev/sdc
echo "/dev/sdc" > /sys/fs/bcache/register
make-bcache -B /dev/sdd
echo "/dev/sdd" > /sys/fs/bcache/register
bcache-super-show /dev/sdc
echo "6d4ab278-0844-4a50-8e74-87aeda4fd353" > /sys/block/sdd/bcache/attach

You should now have a /dev/bcacheX device node that you can use.

dm-cache

Now we'll take a look at setting up dm-cache. Two different block devices (or segments) are needed for dm-cache: (1) for the cache metadata, and (1) for the cache regions. For this example setup, we created both on one single SSD-backed volume using LVM. See this article for a more in-depth example; for the example shown below, metadata size was not taken into account.

pvcreate /dev/sda
vgcreate ssd_vg /dev/sda
lvcreate -L 10G -n ssd_metadata ssd_vg
lvcreate -L 150G -n ssd_blocks ssd_vg
pvcreate /dev/sdb
vgcreate slow_disk_vg /dev/sdb
lvcreate -L 100G -n data_vol slow_disk_vg
blockdev --getsz /dev/mapper/slow_disk_vg-data_vol
dmsetup create cached_dev --table '0 209715200 cache /dev/mapper/ssd_vg-ssd_metadata /dev/mapper/ssd_vg-ssd_blocks /dev/mapper/slow_disk_vg-data_vol 512 1 writeback default 0'

You would now have a '/dev/mapper/cached_dev' device node that can be used for a partition table & file system, raw block device, etc. To make dm-cache devices persist across reboots, you'll need to enable the rc script (rc.dmcache) in the /etc/rc.conf file. Then you'll need to add the commands to create/destroy the dm-cache device(s) using files "/etc/dm-cache.start" and "/etc/dm-cache.stop"; below are examples following suit from above.

/etc/dm-cache.start:

dmsetup create cached_dev --table '0 209715200 cache /dev/mapper/ssd_vg-ssd_metadata /dev/mapper/ssd_vg-ssd_blocks /dev/mapper/slow_disk_vg-data_vol 512 1 writeback default 0'
dmsetup resume cached_dev

/etc/dm-cache.stop:

dmsetup suspend cached_dev
dmsetup remove cached_dev

EnhanceIO (Deprecated)

Using EnhanceIO is deprecated in ESOS and may be removed in future releases. We recommend using bcache or lvmcache if block layer caching is needed. The information below is provided for reference purposes only.

The setup procedure for EnhanceIO cache devices is pretty clear-cut; the source or backing device (the device you want to "enhance") can already contain data and even have a mounted file system while adding/deleting a cache. The eio_cli tool that comes with ESOS is a special version that supports non-udev setups (like mdev in ESOS). EnhanceIO is disabled in ESOS by default; edit the /etc/rc.conf file and set 'rc.eio_enable' to "YES". Next, you'll need to setup your cache device using the eio_cli tool (be sure to always use the "-u" option to disable support for udev):

eio_cli create -u -d /dev/disk-by-id/SERIAL-B8CEA82A -s /dev/disk-by-id/SERIAL-A65CBA25 -m wb -c my_cache

Your backing/source device ("/dev/disk-by-id/SERIAL-B8CEA82A" in this example) is now enhanced! The configuration file that eio_cli and rc.eio use is located here: /etc/eio.conf

lvmcache

You can also use the LVM interface to device-mapper cache (dm-cache). Using it via this method is much simpler compared to setting up dm-cache. First you'll need to make sure LVM is enabled for boot (set "rc.lvm2_enable" to "YES" in /etc/rc.conf).

For this lvmcache setup example, we'll be using (1) SSD SCSI disk (our cache), and (1) 7.2K NL SAS SCSI disk (our backing disk).

Make these SCSI disks into LVM PVs and add both devices to the same volume group (VG):

pvcreate -v /dev/sdb /dev/sdc
vgcreate -v VolumeGroup1 /dev/sdb /dev/sdc

You then need to create (3) logical volumes and allocate to each specific physical disk.

Create a logical volume to use as cache and assign it to the SSD disk (/dev/sdb):

lvcreate -L 950GB -n lv1_cache VolumeGroup1 /dev/sdb

Create a logical volume to use as the cache metadata and assign it to the SSD (/dev/sdb -- this needs to be about a 1000:1 split):

lvcreate -L 1GB -n lv1_cache_meta VolumeGroup1 /dev/sdb

Create a logical volume to use as the data disk and assign it to the SAS 7.2K NL disk (/dev/sdc):

lvcreate -L 2TB -n lv1_data VolumeGroup1 /dev/sdc

Now we need to convert the (2) cache volumes into a "cache pool" (this will add lv1_cache to a cache pool using lv1_cache_meta as the metadata):

lvconvert --type cache-pool --poolmetadata VolumeGroup1/lv1_cache_meta VolumeGroup1/lv1_cache

Finally, attach the cache pool to the data volume -- your volume will now be cached:

lvconvert --type cache --cachepool VolumeGroup1/lv1_cache VolumeGroup1/lv1_data

How to "un-cache" a logical volume... all you need to do is remove the cache pool logical volume (LVM will then copy the unwritten data to the data drive then remove the cache and metadata volumes):

lvremove VolumeGroup1/lv1_cache

To add the cache back in, you will need to recreate the cache pool from scratch and assign it back to the logical volume.


Automatic Tiered Block Devices - BTIER (Deprecated)

Using BTIER is deprecated in ESOS and may be removed in future releases. We recommend using back-end storage that fits the needs of your application. The information below is provided for reference purposes only.

BTIER configuration should be similar to Linux distributions. There is a decent article here on configuring BTIER. Skip past the building/installation part and start with configuring BTIER devices in that article.

As an example configuration, assuming we have a Linux RAID (md) RAID1 volume that consists of two SSDs (md0) and a Linux RAID RAID5 volume that contains several SATA drives (md1), then we can create a BTIER device like this:

btier_setup -f /dev/md0:/dev/md1 -B -c

Please note: The "-c" flag is only used when the BTIER device is initially created. Using "-c" writes the initial metadata to the underlying disks. The system should now show a new device: /dev/sdtiera

Use this block device as you would normally when creating SCST devices. Add the following line to the "/etc/bttab" configuration file to make the BTIER device persist between reboots:

/dev/md0:/dev/md1

Ceph RBD Mapping

In ESOS, you can use a Ceph RBD image as a back-end block device (mapped to). You can then treat this as a normal block device and use it with vdisk_blockio, or put a file system on it and use vdisk_fileio.

Edit the /etc/ceph/ceph.conf file and add your monitors (nodes) to the configuration file; this allows your Ceph cluster to be discovered. Make sure you have a mon_hosts line in /etc/ceph/ceph.conf that enumerates all Ceph monitors. Here is an example:

mon_host = 192.168.1.101,192.168.1.102,192.168.1.103

You will also need a client key ring file (/etc/ceph/ceph.client.keyring):

[client.admin]
        key = AQC2WFlTYPvVHhAAuk1jxZ4u86EkMdeUyn6LYA==

Finally configure the /etc/ceph/rbdmap file (pool / image mappings):

rbd/disk01     id=client.admin,keyring=/etc/ceph/ceph.client.keyring

Edit the /etc/rc.conf file and set "rc.rbdmap_enable" to "YES" and then start it:

/etc/rc.d/rc.rbdmap start

If you get any error messages, check the kernel logs (dmesg). See this article if you have any "feature set mismatch" errors: http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client/


File Systems & Virtual Disk Files

After you have setup/configured your advanced back-end storage, it will still appear as a block device, just as described in the basic back-storage wiki document.

With this logical block device, you can now create a file system on it and create virtual disk files, if desired. Follow the same steps as described in the 34_File_Systems_Configuration document for making file systems and adding virtual disk files, but with the advanced back-storage, you'll select your DRBD block device (eg, /dev/drbd0) or whatever advanced block device you configured.