Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delayed column opening during merge #2132

Merged
merged 4 commits into from
Aug 21, 2023
Merged

delayed column opening during merge #2132

merged 4 commits into from
Aug 21, 2023

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Jul 21, 2023

This is the first part of addressing quickwit-oss/quickwit#3633
Instead of loading all Column into memory for the merge, only the current column_name, column_category group is loaded.

Second part would be to replace mmap calls with file reads, for better control on releasing memory

@codecov-commenter
Copy link

Codecov Report

Merging #2132 (1bfbfda) into main (5fafe4b) will decrease coverage by 0.02%.
The diff coverage is 88.83%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2132      +/-   ##
==========================================
- Coverage   94.37%   94.36%   -0.02%     
==========================================
  Files         321      319       -2     
  Lines       60821    60874      +53     
==========================================
+ Hits        57401    57442      +41     
- Misses       3420     3432      +12     
Impacted Files Coverage Δ
common/src/file_slice.rs 72.38% <0.00%> (-9.34%) ⬇️
src/lib.rs 99.05% <ø> (ø)
columnar/src/columnar/merge/mod.rs 94.81% <99.00%> (+0.73%) ⬆️
columnar/src/columnar/merge/tests.rs 98.48% <100.00%> (+0.04%) ⬆️
columnar/src/columnar/reader/mod.rs 79.41% <100.00%> (+1.15%) ⬆️
columnar/src/dynamic_column.rs 70.52% <100.00%> (ø)
common/src/group_by.rs 100.00% <100.00%> (ø)
src/core/index_meta.rs 96.14% <100.00%> (ø)
src/store/compressors.rs 97.70% <100.00%> (+5.51%) ⬆️
src/store/decompressors.rs 97.82% <100.00%> (-0.39%) ⬇️
... and 1 more

... and 4 files with indirect coverage changes

@PSeitz PSeitz force-pushed the columnar_merge2 branch 3 times, most recently from edcda2d to 9c07eb8 Compare July 24, 2023 02:04
@PSeitz PSeitz requested a review from fulmicoton July 24, 2023 02:17
This is the first part of addressing #3633
Instead of loading all Column into memory for the merge, only the current column_name
group is loaded. This can be done since the sstable streams the columns lexicographically.
@PSeitz PSeitz force-pushed the columnar_merge2 branch from 9c07eb8 to c207fcf Compare July 24, 2023 03:39
continue;
/// Iterates over the columns of the columnar readers, grouped by column name.
/// Key functionality is that `open` of the Columns is done lazy per group.
fn group_columns_for_merge_iter<'a>(
Copy link
Collaborator

@fulmicoton fulmicoton Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is much more complicated. Could we have kept the previous code, and worked of DynamicColumnHandle instead of DynamicColumn and gotten the same benefit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columns need to be loaded to access things like min max values. The alternative would be to change the code so we can access only the metadata with something like open_metadata

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can lighten the work of group_columns_for_merge and change it so that it does not:

  • filter out columns that end up being empty.
  • identify the resulting ColumnType.

It would just stops at returning a

fn group_columns_for_merge(
    columnar_readers: &[&ColumnarReader],
    required_columns: &[(String, ColumnType)],
    merge_row_order: &MergeRowOrder,
) -> io::Result<BTreeMap<(String, ColumnTypeCategory), Vec<Option<DynamicColumnHandle>>>> {

On the callsite, we would then iterate through these handles, and open the DynamicColumn,
and do the conversion of the (ColumnTypeCategory, Vec<Option<DynamicColumnHandle>>) into a
(ColumnType, Vec<Option<DynamicHandle>>) where all of the columns that are Some contain at least one element.

The rest is then similar to what exists today.
The code should not be more complicated than it is today.

One tricky part to consider is how to deal with required columns.
If all columns for this (String, ColumnTypeCategory) are empty (following deletes), whether or not we want a column depends on whether the column is marked as required or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the high cardinality num column case, the BTreeMap variant may cost more than 100MB heap allocation, which is why I opted to the Iterator version.
From the code complexity the main difference is the mixing of the required columns into the (sorted by column name) Column stream instead of simply inserting them into the BTreeMap

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too complex. Can you go with the BTreeMap solution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to a BtreeMap

@PSeitz PSeitz requested a review from fulmicoton August 13, 2023 21:19
@PSeitz PSeitz changed the title lazy columnar merge lazy column opening during merge Aug 14, 2023
@PSeitz PSeitz changed the title lazy column opening during merge delayed column opening during merge Aug 14, 2023
@PSeitz PSeitz merged commit 59460c7 into main Aug 21, 2023
@PSeitz PSeitz deleted the columnar_merge2 branch August 21, 2023 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants