Skip to content

Commit

Permalink
DirectIO support
Browse files Browse the repository at this point in the history
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <[email protected]>
  • Loading branch information
ryao committed Jul 23, 2015
1 parent 3b79cef commit 2a751aa
Show file tree
Hide file tree
Showing 3 changed files with 100 additions and 0 deletions.
72 changes: 72 additions & 0 deletions config/kernel-vfs-direct_IO.m4
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
dnl #
dnl # Linux 4.1.x API change
dnl #
AC_DEFUN([ZFS_AC_KERNEL_VFS_DIRECT_IO],
[AC_MSG_CHECKING([whether fops->direct_IO() uses iov_iter without rw])
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
ssize_t test_direct_IO(struct kiocb *kiocb,
struct iov_iter *iter, loff_t offset)
{ return 0; }
static const struct address_space_operations
fops __attribute__ ((unused)) = {
.direct_IO = test_direct_IO,
};
],[
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_VFS_DIRECT_IO_ITER, 1,
[fops->direct_IO() uses iov_iter without rw])
],[
AC_MSG_RESULT(no)
dnl #
dnl # Linux 3.16.x API change
dnl #
[AC_MSG_CHECKING([whether fops->direct_IO() uses iov_iter with rw])
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
ssize_t test_direct_IO(int rw, struct kiocb *kiocb,
struct iov_iter *iter, loff_t offset)
{ return 0; }
static const struct address_space_operations
fops __attribute__ ((unused)) = {
.direct_IO = test_direct_IO,
};
],[
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_VFS_DIRECT_IO_ITER_RW, 1,
[fops->direct_IO() uses iov_iter with rw])
],[
AC_MSG_RESULT(no)
dnl #
dnl # Ancient Linux API (predates git)
dnl #
[AC_MSG_CHECKING([whether fops->direct_IO() uses iovec])
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
ssize_t test_direct_IO(int rw,
struct kiocb *kiocb,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs)
{ return 0; }
static const struct address_space_operations
fops __attribute__ ((unused)) = {
.direct_IO = test_direct_IO,
};
],[
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_VFS_DIRECT_IO_IOVEC, 1,
[fops->direct_IO() uses iovec])
],[
AC_MSG_ERROR(no)
])
])
])
])
1 change: 1 addition & 0 deletions config/kernel.m4
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ AC_DEFUN([ZFS_AC_CONFIG_KERNEL], [
ZFS_AC_KERNEL_LSEEK_EXECUTE
ZFS_AC_KERNEL_VFS_ITERATE
ZFS_AC_KERNEL_VFS_RW_ITERATE
ZFS_AC_KERNEL_VFS_DIRECT_IO
AS_IF([test "$LINUX_OBJ" != "$LINUX"], [
KERNELMAKE_PARAMS="$KERNELMAKE_PARAMS O=$LINUX_OBJ"
Expand Down
27 changes: 27 additions & 0 deletions module/zfs/zpl_file.c
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,32 @@ zpl_aio_write(struct kiocb *kiocb, const struct iovec *iovp,
}
#endif /* HAVE_VFS_RW_ITERATE */

static size_t
#ifdef HAVE_VFS_DIRECT_IO_IOVEC
zpl_direct_IO(int rw, struct kiocb *kiocb, const struct iovec *iovp,
loff_t pos, unsigned long nr_segs)
{
#elif defined(HAVE_VFS_DIRECT_IO_ITER_RW)
zpl_direct_IO(int rw, struct kiocb *kiocb, struct iov_iter *from,
loff_t pos)
{
#elif (defined HAVE_VFS_DIRECT_IO_ITER)
zpl_direct_IO(struct kiocb *kiocb, struct iov_iter *from,
loff_t pos)
{
int rw = iov_iter_rw(iter);
#else
#error "No function prototype found for DirectIO"
#endif
const struct iovec *iovp = from->iov;
loff_t pos = from->nr_segs;
#endif
if (rw == WRITE)
return (zpl_iter_write_common(kiocb, iovp, nr_segs, kiocb->ki_nbytes));
else
return (zpl_iter_read_common(kiocb, iovp, nr_segs, kiocb->ki_nbytes));
}

static loff_t
zpl_llseek(struct file *filp, loff_t offset, int whence)
{
Expand Down Expand Up @@ -799,6 +825,7 @@ const struct address_space_operations zpl_address_space_operations = {
.readpage = zpl_readpage,
.writepage = zpl_writepage,
.writepages = zpl_writepages,
.direct_IO = zpl_direct_IO,
};

const struct file_operations zpl_file_operations = {
Expand Down

0 comments on commit 2a751aa

Please sign in to comment.