Compaction of Arrow data structures #22

jlefevre · 2019-06-18T00:58:44Z

This project will develop object class methods that will merge (or conversely split) formatted data partitions within an object. Self-contained partitions are written (appended) to objects and over time objects may contain a sequence of independent formatted data structures. A compaction request will invoke this method that will iterate over the data structures, combining (or splitting) them into a single larger data structure representing the complete data partition. In essences, this methods will perform a read-modify-write operation on an object's local data.

Task 1: A given object may contain a sequence of serialized Arrow data structures on disk. This task will read 2 or more Arrow structures and combine them into a single Arrow structure. Essentially this can be thought of as compaction, which is needed since objects may contain multiple smaller Arrow structs after some number of row inserts over time. The inverse of compaction -- i.e., splitting 1 Arrow struct into 2 -- is also part of this task.

Assumptions: Each Arrow struct has a known number of entries (rows), and the combined (or separated) struct will have a target max_rows or size. If the new expected size would be larger than the target max_rows or size, do not perform the compaction or splitting.

Task 2: The current physical off/len of each data struct (Arrow in this case) is stored in omap as an idx_fb_entry struct.. After combining the 2 Arrow structs into one struct, the idx_fb_entries for the old structs must be removed. Also note the max fb_seq_number is stored in omap and needs updated as well. This value is is always set during idx creation process, it indicates the current (i.e., max) number of data structs in the object, given the object contains a sequence of structs. For example, if there were 2 Arrow structs, the previous fb_seq_num should have been 2 and after combining, it should be set to 1. Alternatively, for safety the fb_seq_num key could be removed from omap, and later set elsewhere, outside of the compaction task, for instance in optional task#3 below.

Task 3: Optionally - a new physical idx entry (off/len) should be added to omap for the new combined struct. The idx_entry is normally built here, by looping over all bufferlists in an object. That functionality needs duplicated, or preferably that code should first be refactored to support building the idx_fb_entry separately from building the index data entries (opening issue #21 for that).

jlefevre · 2020-08-11T19:01:23Z

The compaction will focus on reading a sequence of tables within an object and converting to a single table, then using cls_cxx_replace(), this will look similar to the method in transform_db_op().

jlefevre assigned ashayshirwadkar Jun 18, 2019

ivotron transferred this issue from uccross/skyhookdm-ceph May 31, 2020

jlefevre assigned aditigupta17 and unassigned ashayshirwadkar Aug 11, 2020

jlefevre linked a pull request Aug 31, 2020 that will close this issue

Implement compaction for Arrow tables #70

Merged

jlefevre closed this as completed in #70 Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction of Arrow data structures #22

Compaction of Arrow data structures #22

jlefevre commented Jun 18, 2019

jlefevre commented Aug 11, 2020

Compaction of Arrow data structures #22

Compaction of Arrow data structures #22

Comments

jlefevre commented Jun 18, 2019

jlefevre commented Aug 11, 2020