Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement uncached prefetch #14243

Merged
merged 1 commit into from
Jan 5, 2023
Merged

Implement uncached prefetch #14243

merged 1 commit into from
Jan 5, 2023

Conversation

amotin
Copy link
Member

@amotin amotin commented Nov 30, 2022

Before this change primarycache property was handled only on dbuf layer, controlling dbuf cache and through a hack an ARC evictions. Since speculative prefetcher is implemented on ARC level, it had to be disabled for uncacheable buffers because otherwise falsely prefetched and never read buffers would stay cached in ARC.

This change gives ARC a knowledge about uncacheable buffers. It is passed to arc_read() and arc_write() and stored in ARC header. When remove_reference() drops last reference on the ARC header, it can either immediately destroy it, or if it is marked as prefetch, put it into new arc_uncached state. That state is scanned every second, looking for stale buffers that were not demand read (in which case they are evicted immediately).

To handle cases of short or misaligned reads, this change tracks at dbuf layer buffers that were read from the beginning, but not to the end. It is assumed that such buffers may receive further reads, and so they are stored in dbuf cache. If some of following reads reaches the end of such buffer, it is immediately evicted. Otherwise it will follow regular dbuf cache eviction and will be evicted also from ARC the same moment. Since dbuf layer does not know the actual file size, this logic is not applied to the last buffers of dnodes, which are always evicted same as before.

Since uncacheable buffers should no longer stay in ARC for long, this patch also tries to optimize I/O by allocating ARC physical buffers as linear to allow buffer sharing. It allows to avoid one of two memory copies for uncompressed data for both reads and writes by the cost of some higher KVA usage in case of prefetch. In case decompression is needed the sharing is impossible, but this still allows to avoid extra memory copy since decompression still require a linear buffer.

With the combination of enabled prefetch and avoided memory copy this change improves sequential single-threaded read speed from a wide NVMe pool from 2049 to 3932 MiB/s. During write profiler shows 22% reduction of unhalted CPU cycles at the same throughput of 3653 MiB/s.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member Author

amotin commented Nov 30, 2022

I think this may nicely fill the gap between regular cached I/O and zero-copy O_DIRECT of #10018 , handling cases where full zero-copy is not possible, or where user can not give up on prefetch and writeback.

@amotin amotin added Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem labels Dec 1, 2022
@amotin amotin requested review from grwilson and ahrens December 2, 2022 17:47
@amotin
Copy link
Member Author

amotin commented Dec 9, 2022

As discussed on the latest monthly meeting, I've enabled linear buffer allocation for data that need to be decompressed. While those buffers can not be shared, since decompression code requires linear buffers, this still allows to avoid extra memory copy. My tests show read speed improvement for 50% LZ4 compressed 128KB blocks from 2900MB/s to 3400MB/s.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me!

@amotin
Copy link
Member Author

amotin commented Dec 20, 2022

Just added some more comments and rebased.

@amotin amotin force-pushed the uncached branch 2 times, most recently from 60567c5 to 4ca49ca Compare December 20, 2022 20:13
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and has worked as expected in my local testing. I am a little concerned about always allocating uncachable buffers as linear. But this should only be an issue if they were allocated to accumulate and as you pointed out they're short lived so this should be okay.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 20, 2022
@amotin amotin force-pushed the uncached branch 4 times, most recently from 44178ac to f9647b9 Compare December 22, 2022 15:20
@behlendorf
Copy link
Contributor

behlendorf commented Dec 22, 2022

Performance results of primarycache=none and primarycache=metadata running on a striped NVMe pool look very good.

image

@amotin
Copy link
Member Author

amotin commented Dec 22, 2022

Rebased it on new master after #14123 merge.

Copy link
Member

@grwilson grwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update arcstat to get the new "uncached" stats?

@amotin
Copy link
Member Author

amotin commented Dec 23, 2022

Can you update arcstat to get the new "uncached" stats?

Yes, it will need update after both PRs.

@amotin
Copy link
Member Author

amotin commented Dec 24, 2022

Can you update arcstat to get the new "uncached" stats?

Done.

Before this change primarycache property was handled only on dbuf
layer, controlling dbuf cache and through a hack an ARC evictions.
Since speculative prefetcher is implemented on ARC level, it had
to be disabled for uncacheable buffers because otherwise falsely
prefetched and never read buffers would stay cached in ARC.

This change gives ARC a knowledge about uncacheable buffers.  It
is passed to arc_read() and arc_write() and stored in ARC header.
When remove_reference() drops last reference on the ARC header, it
can either immediately destroy it, or if it is marked as prefetch,
put it into new arc_uncached state.  That state is scanned every
second, looking for stale buffers that were not demand read (in
which case they are evicted immediately).

To handle cases of short or misaligned reads, this change tracks
at dbuf layer buffers that were read from the beginning, but not
to the end.  It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If some of following
reads reaches the end of such buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction and will be
evicted also from ARC the same moment.  Since dbuf layer does not
know the actual file size, this logic is not applied to the last
buffers of dnodes, which are always evicted same as before.

Since uncacheable buffers should no longer stay in ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.  It allows to avoid
one of two memory copies for uncompressed data for both reads and
writes by the cost of some higher KVA usage in case of prefetch.
In case decompression is needed the sharing is impossible, but
this still allows to avoid extra memory copy since decompression
still require a linear buffer.

With the combination of enabled prefetch and avoided memory copy
this change improves sequential single-threaded read speed from
a wide NVMe pool from 2049 to 3932 MiB/s.  During write profiler
shows 22% reduction of unhalted CPU cycles at the same throughput
of 3653 MiB/s.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
@amotin
Copy link
Member Author

amotin commented Dec 27, 2022

I've added one more chunk for dmu_tx_check_ioerr() to avoid double block read during short/misaligned uncached write as described in #14333 . It keeps uncacheable L0 blocks read by dmu_tx_check_ioerr() in dbuf cache until they will be read again by dmu_buf_will_dirty(). For higher level blocks this trick may not work, leaving uncacheable metadata in cache, since they seem to not always be read explicitly later to clean the flag. But it should be less critical, since dmu_buf_hold_array_by_dnode() should be able to handle indirects read errors, and also hopefully not many run primarycache=none with random write workload.

@amotin amotin mentioned this pull request Dec 29, 2022
13 tasks
@mmaybee mmaybee merged commit ed2f7ba into openzfs:master Jan 5, 2023
behlendorf pushed a commit that referenced this pull request Jan 5, 2023
Recent ARC commits added new statistic counters, such as iohits,
uncached state, etc.  Represent those.  Also some of previously
reported numbers were confusing or even made no sense.  Cleanup
and restructure existing reports.

Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:  Alexander Motin <[email protected]>
Sponsored by:   iXsystems, Inc.
Issue #14115 
Issue #14123
Issue #14243 
Closes #14320
@shodanshok
Copy link
Contributor

Thanks for implementing that feature, which will be very useful on fast NVMe devices!

One question: being uncached buffer linearly allocated, does it means increased memory fragmentation despite zfs_abd_scatter_enabled ? Anything to be aware of?

Thanks.

@amotin
Copy link
Member Author

amotin commented Jan 7, 2023

One question: being uncached buffer linearly allocated, does it means increased memory fragmentation despite zfs_abd_scatter_enabled ? Anything to be aware of?

If data are not compressed, then only for amount of preferched, but not yet read data and for no more than one second. For compressed data it may add up to another size of dbuf cache, that is 3% of ARC by default. So I don't think it should cause problems. Dbuf cache and all dirty data are always linear, so this is not a dramatic change.

lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 3, 2023
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.

This change gives the ARC knowledge about uncacheable buffers
via  arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state. 
That state is scanned every second, evicting stale buffers that were
not demand read.

This change also tracks dbufs that were read from the beginning,
but not to the end.  It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction.  Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.

Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes openzfs#14243
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 3, 2023
Recent ARC commits added new statistic counters, such as iohits,
uncached state, etc.  Represent those.  Also some of previously
reported numbers were confusing or even made no sense.  Cleanup
and restructure existing reports.

Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:  Alexander Motin <[email protected]>
Sponsored by:   iXsystems, Inc.
Issue openzfs#14115 
Issue openzfs#14123
Issue openzfs#14243 
Closes openzfs#14320
@amotin amotin deleted the uncached branch October 9, 2023 13:23
behlendorf added a commit that referenced this pull request Oct 13, 2023
New Features
- Block cloning (#13392)
- Linux container support (#14070, #14097, #12263)
- Scrub error log (#12812, #12355)
- BLAKE3 checksums (#12918)
- Corrective "zfs receive"
- Vdev and zpool user properties

Performance
- Fully adaptive ARC (#14359)
- SHA2 checksums (#13741)
- Edon-R checksums (#13618)
- Zstd early abort (#13244)
- Prefetch improvements (#14603, #14516, #14402, #14243, #13452)
- General optimization (#14121, #14123, #14039, #13680, #13613,
  #13606, #13576, #13553, #12789, #14925, #14948)

Signed-off-by: Brian Behlendorf <[email protected]>
@amotin amotin mentioned this pull request Jul 2, 2024
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants