Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX for database workloads. Its purpose was to allow the database to bypass the page and buffer caches to prevent unnecessary IO operations (e.g. readahead) while preventing contention for system memory between the database and kernel caches. Unfortunately, the semantics were never defined in any standard. The semantics of O_DIRECT in XFS in Linux are as follows: 1. O_DIRECT requires IOs be aligned to backing device's sector size. 2. O_DIRECT performs unbuffered IO operations between user memory and block device (DMA when the block device is physical hardware). 3. O_DIRECT implies O_DSYNC. 4. O_DIRECT disables any locking that would serialize IO operations. The first is not possible in ZFS beause there is no backing device in the general case. The second is not possible in ZFS in the presence of compression because that prevents us from doing DMA from user pages. If we relax the requirement in the case of compression, we encunter another hurdle. In specific, avoiding the userland to kernel copy risks other userland threads modifying buffers during compression and checksum computations. For compressed data, this would cause undefined behavior while for checksums, this would imply we write incorrect checksums to disk. It would be possible to avoid those issues if we modify the page tables to make any changes by userland to memory trigger page faults and perform CoW operations. However, it is unclear if it is wise for a filesystem driver to do this. The third is doable, but we would need to make ZIL perform indirect logging to avoid writing the data twice. The fourth is already done for all IO in ZFS. Other Linux filesystems such as ext4 do not follow #3. Other platforms implement varying subsets of the XFS semantics. FreeBSD does not implement #1 and might not implement others (not checked). Mac OS X does not implement O_DIRECT, but it does implement F_NOCACHE, which is similiar to #2 in that it prevents new data from being cached. AIX relaxes #3 by only committing the file data to disk. Metadata updates required should the operations make the file larger are asynchronous unless O_DSYNC is specified. On Solaris and Illumos, there is a library function called directio(3C) that allows userspace to provide a hint to the filesystem that DirectIO is useful, but the filesystem is free to ignore it. The semantics are also entirely a filesystem decision. Those that do not implement it return ENOTTY. Given the lack of standardization and ZFS' heritage, one solution to provide compatibility with userland processes that expect DirectIO is to treat DirectIO as a hint that we ignore. This can be done trivially by implementing a shim that maps aops->direct_IO to AIO. There is also already code in ZoL for bypassing the page cache when O_DIRECT is specified, but it has been inert until now. If it turns out that it is acceptable for a filesystem driver to interact with the page tables, the scatter-gather list work will need be finished and we would need to utilize the page tables to make operations on the userland pages safe. References: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html https://blogs.oracle.com/roch/entry/zfs_and_directio https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics https://illumos.org/man/3c/directio https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html Signed-off-by: Richard Yao <[email protected]>
- Loading branch information