-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openzfs compressedarc abd patchset WIP #5009
Openzfs compressedarc abd patchset WIP #5009
Conversation
@ironMann It looks like your work is against David Chen's version of ABD. This version we have here is against the Delphix-OS version worked on by Dan Kimmel. I'll look to see what its trying to return with that size parameter but its possible it doesn't make sense with the other ABD implementation. |
@ironMann that slight change to abd_get_offset() looks entirely reasonable to me. I can definitely see how it would make adding ABD to the vectorized raidz simpler. Let's get @dankimmel's thoughts on this too make sure it's not going to cause problems elsewhere. I should add it would be awesome to get an updated version of your raidz-abd branch against this PR. That's still a big chunk we need which is missing. This change still needs more polish to pass all the testing but that's being worked on. |
@dpquigl To clarify the size parameter: When a new overlay abd is created with |
@ironMann: neat, this optimization looks really cool! @dpquigl is right, the new ABD code is a significant departure from the version you patched. The iterator functions have all been rewritten in the new version to use Why do you need the size stored for offset ABDs -- just because it would be easier to track the lengths of the segments to compare? It shouldn't be hard to add that, just trying to understand the use case better, since I think you could achieve the same thing by passing the length to |
Authored by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Ported by: David Quigley <[email protected]> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS Issue: https://www.illumos.org/issues/6950
@dankimmel @behlendorf I've ported raidz onto this abd stack, PR #5020. 2d660f3 is adaptation of ABD needed for the rest of the RAIDZ code. There are few minor changes, and two special iterator methods.
I'm already using
For iterators (created from offset ABDs) in |
#define kpm_enable 1 | ||
#define abd_alloc_chunk() \ | ||
((struct page *)kmem_alloc(PAGESIZE, KM_SLEEP)) | ||
#define abd_free_chunk(chunk) kmem_free(chunk, PAGESIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to this change PAGESIZE wasn't used extensively in userspace so it was implemented in lib/libspl/include/sys/param.h
just as sysconf(_SC_PAGESIZE)
. That's not going to work well if use PAGESIZE everywhere. We're either going to want to implement PAGESIZE differently or alternately preserve the zfs_abd_chunk_size
and initialize it once in abd_init()
.
Potentially we could also preserve the kmem cache chunk case for use solely by user space. That would help minimize our delta with upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf Good point. How reliant is userspace to the actual running system's PAGE_SIZE
? Can we get it in configure script, and define PAGE_SIZE
then?
I'm now more worried about kernel's PAGESIZE
value using the build system's kernel value.PAGE_SHIFT
is configurable on some arch, e.g. arm64 and ia64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting PAGE_SIZE
wrong in user space shouldn't be terrible. I could see it perhaps causing some memory alignment issues but I suspect that would be the worst of it. That said I've never tried.
On the kernel side we should always be using PAGE_SIZE
from the kernel headers. I don't recall why we ever added the #ifdef
around PAGESIZE, it seems like we'd want it to fail if PAGESIZE
was ever defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential PAGE_SHIFT
discrepancies will be resolved by dkms. I was thrown off by a colleague's driver debugging story with two arm64 systems with different page sizes.
Yeah, I agree that this is such a special case that adding them to
Heh. I see what you're saying now.
The |
@@ -577,7 +577,6 @@ extern int spa_get_stats(const char *pool, nvlist_t **config, char *altroot, | |||
size_t buflen); | |||
extern int spa_create(const char *pool, nvlist_t *config, nvlist_t *props, | |||
nvlist_t *zplprops); | |||
extern int spa_import_rootpool(char *devpath, char *devid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find, we accidentally missed this declaration.
In receive_object() a new tx is created in order to create an object. While the tx is assigned and not yet commited direct reclaim occurs via the kmem_cache_alloc() in dnode_create(). This results in an inode being evicted which needs to be removed. The removal process creates another new tx which can't be assigned to the closed txg and the receive_writer thread blocks waiting for the next txg. However, since it's still holding a tx open in the previous txg the entire txg_sync process deadlocks. This is prevented by calling spl_fstrans_mark() which disables direct reclaim from occuring in all memory allocations performed by this process. This covered the offending kmem_cache_alloc() and any other kmem_alloc()'s which occur in the critical region. An analogous situtation can occur in arc_write_ready() and is resolved the same way. Finally, abd_alloc_chunk() is changed to use kmem_flags_convert() which detects if PF_FSTRANS flag set for the process. When set the __GFP_IO and __GFP_IO bits are clearer to prevent reclaim. INFO: task receive_writer:25773 blocked for more than 120 seconds. Call Trace: [<ffffffff8163b809>] schedule+0x29/0x70 [<ffffffffa04ba1ad>] cv_wait_common+0x10d/0x130 [spl] [<ffffffffa04ba1e5>] __cv_wait+0x15/0x20 [spl] [<ffffffffa062f3f3>] txg_wait_open+0xe3/0x1b0 [zfs] [<ffffffffa05d53a5>] dmu_tx_wait+0x465/0x4d0 [zfs] [<ffffffffa05d54d9>] dmu_tx_assign+0xc9/0x700 [zfs] [<ffffffffa0662826>] zfs_rmnode+0x136/0x440 [zfs] [<ffffffffa06892f8>] zfs_zinactive+0xd8/0x140 [zfs] [<ffffffffa067f5ab>] zfs_inactive+0x8b/0x300 [zfs] [<ffffffffa06a3773>] zpl_evict_inode+0x43/0xa0 [zfs] [<ffffffff811fa6e7>] evict+0xa7/0x170 [<ffffffff811fa7ee>] dispose_list+0x3e/0x50 [<ffffffff811fb6c3>] prune_icache_sb+0x163/0x320 [<ffffffff811e12d3>] prune_super+0x143/0x170 [<ffffffff8117c8f5>] shrink_slab+0x165/0x300 [<ffffffff8117fa72>] do_try_to_free_pages+0x3c2/0x4e0 [<ffffffff8117fc8c>] try_to_free_pages+0xfc/0x180 [<ffffffff81173988>] __alloc_pages_nodemask+0x808/0xba0 [<ffffffff811b49ea>] alloc_pages_current+0xaa/0x170 [<ffffffff811bef45>] new_slab+0x275/0x300 [<ffffffff8163306c>] __slab_alloc+0x315/0x48f [<ffffffff811c1583>] kmem_cache_alloc+0x193/0x1d0 [<ffffffffa04b4339>] spl_kmem_cache_alloc+0x99/0x150 [spl] [<ffffffffa05d7864>] dnode_create+0x34/0x280 [zfs] [<ffffffffa05d9ae0>] dnode_hold_impl+0x7e0/0x9d0 [zfs] [<ffffffffa05c3292>] dmu_object_claim_dnsize+0xa2/0x180 [zfs] [<ffffffffa05ce1bd>] receive_object.isra.8+0x3ed/0x400 [zfs] [<ffffffffa05ce578>] receive_writer_thread+0x2a8/0x760 [zfs] [<ffffffffa04b4f01>] thread_generic_wrapper+0x71/0x80 [spl] [<ffffffff810a5aef>] kthread+0xcf/0xe0 Signed-off-by: Brian Behlendorf <[email protected]>
The user space implementation of cv_timedwait_hires() was always passing a relative time to pthread_cond_timedwait() when an absolute time is expected. This was accidentally introduced in commit 206971d. Replace two magic values with their corresponding preprocessor macro. Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Isaac Huang <[email protected]>
Closing. To be replaced by a new rebased PR with this changes now that compressed ARC and compressed send/recv have been merged. |
This is a WIP PR for the abd work. This follows the path outlined by Dan Kimmel at delphix and contains multiple features including compressed arc, compressed send/recv, a reworking of arc internals, and the ABD code with some linux specific patches.