forked from openzfs/zfs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DLPX-86682 DOSE Migration: Evacuate data blocks based on their block …
…boundaries (openzfs#1021) = Problem Current data evacuation design of migration has the problem of breaking up all segments in the indirect mapping to multiple 512-byte segments in the destination vdev which is the object store. This causes frees and reads to those blocks to be split up into multiple I/Os affecting our CPU usage and I/O throughput. Moreover, ingesting all these 512-blocks to the zettacache induces unecessary overhead to some of its subsystems like the SlabAllocator and Index Merging. A side-issue that's also fixed in this PR is the sync write semantics for hybrid pools (object store vdev + normal vdevs). Currently our VMs drop ZIL writes, when they could just allow them to be satisfied by normal class vdevs. = This Patch Initiates a pool-wide scan that records the block boundaries of all the blocks that belong to the device that we want to remove. Then these block boundaries are used to issue ZIOs with the exact block size to the object store, avoiding this way the 512-byte split issue, and resulting to an object-store vdev layout that's similar to a pure object-based pool. The block boundaries are kept in-memory on a B-Tree, and they are persisted in a spacemap. The B-Tree is later used during the creation of the indirect mappings to issue ZIOs of the right block size. A new on-disk feature flag is created for hybrid pools (which are the first step of migration). A feature flag for the agent is also introduced since we changed some protocol semantics for zero-length writes (see code for more details and the note below). A side-change here is that we enable ZIL writes to normal class vdevs in hybrid pools. Another side change here is the introduction of `zpool wait -i bb_scan` which basically waits for the pool-wide scan that precedes removal. This was implemented for the purposes of testing the feature in the ZTS. Running zpool wait -i removal waits for both the scrub and the actual removal. = Testing * New tests have been added in the test suite to test this feature * We now have green zoa_kill stress tests from QA = Misc Details About Code & Future Work zero-length writes: To maintain our offset-to-blockid translations in our indirect mapping for all allocated blocks we submit a write to object store with the contents of the block that we are copying specifying the same size as the block and then submit zero-length writes for every block ID that is covered by that segment. For example we copy a block that's 2KB ot the object store that translated to BlockID X, we submit the 2KB write with the contents to BlockID X and then we submit 3 zero-length writes to X+1, X+2, and X+3. These zero-length writes are something that we had to explicitly add support for in the object-agent - specifically allowing DataObjects with no blocks to be flushed to the object store. These zero-length writes can still induce overheads in CPU and bandwidth in the Kernel-To-Agent communication hurting our removal performance. We could further optimize those in future releases. Bug: https://delphix.atlassian.net/browse/DLPX-85983 memory limit for removal: The B-Tree used for the pool-wide scan and exists until the end of the removal can be quite expensive in terms of RAM. It can lead to a very bad scenario if we were to start a migration/removal that ends up running the system out of memory. For this reason we have a memory limit check before starting such an operation. Unfortunately, we can't tell exactly how much RAM this tree could get because its size depends on the sizes of the blocks of the removing device and we don't have any useful block statistics in ZFS currently. Thus we make some assumptions that are implemented in the code as tunables. @sumedhbala-delphix helped me with this by creating an Excel spreadsheet listing all the devices that we have from phonehome data and the memory of the systems that they belong to (reference: https://docs.google.com/spreadsheets/d/ 1VMWRtNdQZ2EWoxLdI5Gpyrpoz_NZiYDzgciLIebrHSw/edit?usp=sharing). Almost 1% of our customers does not have enough memory for such an operation and the majority of that 1% is under the recommended memory size that we have for customers (64GB). With that in mind, our goal is to use the heuristic memory limit check that we have now for 14.0 and improve our memory consumption and memory checks in future releases. (see https://delphix.atlassian.net/browse/DLPX-87127). zettacache ingestion: Currently all data send to the object store from the removal operation is ingested in the zettacache as normal writes. This could potentially disrupt the cached read performance of some of our customers depending on their workload. Even though there isn't a silver bullet for this problem it could be helpful in the future to introduce some kind of tunable or filtering for those writes: https://delphix.atlassian.net/browse/DLPX-85502 block boundary persistence: We currently use a spacemap which has space overheads (i.e. fields that we don't use + debug entries). It would be nice to have an on-disk structure that's basically a Vec: https://delphix.atlassian.net/browse/DLPX-87128
- Loading branch information
Showing
48 changed files
with
2,297 additions
and
490 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.