Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction of Arrow data structures #22

Closed
jlefevre opened this issue Jun 18, 2019 · 1 comment · Fixed by #70
Closed

Compaction of Arrow data structures #22

jlefevre opened this issue Jun 18, 2019 · 1 comment · Fixed by #70
Assignees

Comments

@jlefevre
Copy link
Member

This project will develop object class methods that will merge (or conversely split) formatted data partitions within an object. Self-contained partitions are written (appended) to objects and over time objects may contain a sequence of independent formatted data structures. A compaction request will invoke this method that will iterate over the data structures, combining (or splitting) them into a single larger data structure representing the complete data partition. In essences, this methods will perform a read-modify-write operation on an object's local data.

Task 1: A given object may contain a sequence of serialized Arrow data structures on disk. This task will read 2 or more Arrow structures and combine them into a single Arrow structure. Essentially this can be thought of as compaction, which is needed since objects may contain multiple smaller Arrow structs after some number of row inserts over time. The inverse of compaction -- i.e., splitting 1 Arrow struct into 2 -- is also part of this task.

Assumptions: Each Arrow struct has a known number of entries (rows), and the combined (or separated) struct will have a target max_rows or size. If the new expected size would be larger than the target max_rows or size, do not perform the compaction or splitting.

Task 2: The current physical off/len of each data struct (Arrow in this case) is stored in omap as an idx_fb_entry struct.. After combining the 2 Arrow structs into one struct, the idx_fb_entries for the old structs must be removed. Also note the max fb_seq_number is stored in omap and needs updated as well. This value is is always set during idx creation process, it indicates the current (i.e., max) number of data structs in the object, given the object contains a sequence of structs. For example, if there were 2 Arrow structs, the previous fb_seq_num should have been 2 and after combining, it should be set to 1. Alternatively, for safety the fb_seq_num key could be removed from omap, and later set elsewhere, outside of the compaction task, for instance in optional task#3 below.

Task 3: Optionally - a new physical idx entry (off/len) should be added to omap for the new combined struct. The idx_entry is normally built here, by looping over all bufferlists in an object. That functionality needs duplicated, or preferably that code should first be refactored to support building the idx_fb_entry separately from building the index data entries (opening issue #21 for that).

@ivotron ivotron transferred this issue from uccross/skyhookdm-ceph May 31, 2020
@jlefevre
Copy link
Member Author

The compaction will focus on reading a sequence of tables within an object and converting to a single table, then using cls_cxx_replace(), this will look similar to the method in transform_db_op().

@jlefevre jlefevre linked a pull request Aug 31, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants