Optimize compaction strategy to avoid merging SST files without overlapping time ranges #3416

v0y4g3r · 2024-02-29T13:14:54Z

What type of enhancement is this?

Performance

What does the enhancement do?

With the introduction of new merge tree memtable, we may find SST files locates in L0 can as large as several GiBs. In this case, merging L0 files regardless of their size can be expensive and do little with read performance. We should leverage the "tiered" merging manner that skips large files and those does overlap with others in terms of time ranges.

Tasks

Only filter deleted entries when there's only one file in sorted run feat: add filter_deleted option to avoid removing deletion markers #3707
Implement sorted-run based compaction to reduce write amplication feat!: reduce sorted runs during compaction #3702

Implementation challenges

No response

The text was updated successfully, but these errors were encountered:

evenyag · 2024-02-29T13:22:23Z

We compact small files with a large file. e.g.

3.8G 4649c98-4edc-444e-a9bd-920d9b88b033.parquet
452M 8fea78f6-ad12-4344-a43d-c93092bfe068.parquet
426M d3a2ddcd-1ff4-4d82-a87d-d47d1f8490ba.parquet
423M fb727d76-2afd-487e-a286-cdeea29048dd.parquet

becomes

4.5G 5d79971b-ce56-4984-af07-5e60fa6f9436.parquet

We might stop compacting a window if the output file is too large.

However, we need to choose files carefully since we remove deleted keys and the deletion marker during compaction so we should compact files with overlapped keys.

v0y4g3r · 2024-03-29T06:23:02Z

For active windows, we enforce max_sorted_runs (maybe apply to both L0 and L1) and max_files (apply to all files in active window or just L0 in active window?). If current runs exceed that value, we find a cheapest way to achieve that goal.

For example, for an active window with max_sorted_runs=1

For inactive windows, consider that sometimes there will be out of order insertions
we allow no more than one run in noth L0 and L1 (Typical leveled compaction strategy). If current runs exceeds that value, all overlapping files in both l0 and l1 will be picked to be compacted to a new file in l1.

When an active window shifts to inactive window, we check if there is only one sorted run in both l0 and l1 (this does not require another rountine, every compaction should check every time window in current implementation).

evenyag · 2024-03-29T07:02:38Z

We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.

v0y4g3r · 2024-03-29T07:19:34Z

We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.

Just like Cassandra's size-tiered strategy, tombstones are only dropped when all overlapping files are involved in current compaction.

evenyag · 2024-03-29T07:29:39Z

I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.

2024-03-28T06:59:10.833054Z  INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(9f93f433-2928-474c-99b8-0e313aeeff81), time_range: (Timestamp { value: 1708489285951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 29669, available_indexes: [InvertedIndex], index_file_size: 20331 },
// ......
FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(c07bf121-2ef6-4e2c-8381-67b27958195f), time_range: (Timestamp { value: 1708489210951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 32240, available_indexes: [InvertedIndex], index_file_size: 20333 }, FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(be30fabe-d0b4-4e54-89c2-ae860b3cd9ae), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674531, available_indexes: [InvertedIndex], index_file_size: 41016404 }], output: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(8343abbc-2d5f-4ba4-a8f8-bf8efb4d71a8), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674772, available_indexes: [InvertedIndex], index_file_size: 41016403 }], window: Some(31536000)

2024-03-28 14:59:10.899	
2024-03-28T06:59:10.899509Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9f93f433-2928-474c-99b8-0e313aeeff81, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.903	
2024-03-28T06:59:10.903589Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9031acad-daa6-44a7-a4d4-a85593236fbb, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.906	
2024-03-28T06:59:10.906535Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 51d57191-4007-495a-8b2b-e4a6e74a77bc, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.907	
2024-03-28T06:59:10.907597Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7b7b74a8-1062-49ea-9582-cde00f27bf5b, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.927	
2024-03-28T06:59:10.927402Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 53f8d4cb-1af1-4772-9471-a811c7ccf424, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.932	
2024-03-28T06:59:10.932686Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 438cbe35-b94c-4e1e-8671-78d7614acc46, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.937	
2024-03-28T06:59:10.937580Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 94592fe4-f407-4cdd-b875-f7dce7c916aa, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.939	
2024-03-28T06:59:10.938907Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7d8942b5-edf4-4531-a4b1-6b007bc8d41c, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.958	
2024-03-28T06:59:10.957885Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2f892e83-8477-46de-89df-6a1d829f6fef, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.969	
2024-03-28T06:59:10.969388Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9aef716f-6e28-45bb-ab61-b4e585df5c99, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.972	
2024-03-28T06:59:10.972234Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: dd9be29e-4ac5-4fc6-9c3a-00dac12f4995, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.973	
2024-03-28T06:59:10.973382Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2bbbb635-b09f-4c83-93bc-141211557447, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.991	
2024-03-28T06:59:10.991235Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: b6e7eb67-967c-4b98-bf1a-8ae463708ad8, region: 5175435591680(1205, 0)
2024-03-28 14:59:11.006	
2024-03-28T06:59:11.006431Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: f1da6408-6f8a-45dd-94f3-067d1d444e87, region: 5175435591680(1205, 0)
......

What's more, compaction jobs and flush jobs also use the same background worker pool and job limit.

evenyag · 2024-03-31T06:33:27Z

I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.

I created a PR #3621 for this.

For flush and compaction jobs, I wonder if we should use dedicated schedulers.

killme2008 · 2024-06-28T23:56:20Z

Geat job!

killme2008 · 2024-06-28T23:57:15Z

@v0y4g3r Assign a doc task to you. GreptimeTeam/docs#1009
We need a detailed document to describe the compaction, it's a critical task for GreptimeDB.

v0y4g3r · 2024-07-01T07:08:34Z

@v0y4g3r Assign a doc task to you. GreptimeTeam/docs#1009 We need a detailed document to describe the compaction, it's a critical task for GreptimeDB.

I'll add a separate doc section to describe how compaction works now in v0.9

evenyag added this to mito2 Mar 25, 2024

evenyag assigned v0y4g3r Mar 25, 2024

evenyag moved this to Todo in mito2 Mar 25, 2024

evenyag moved this from Todo to In Progress in mito2 Mar 26, 2024

evenyag mentioned this issue Mar 31, 2024

fix: run purge jobs in another scheduler #3621

Merged

3 tasks

This was referenced Apr 14, 2024

feat!: reduce sorted runs during compaction #3702

Merged

feat: add filter_deleted option to avoid removing deletion markers #3707

Merged

killme2008 closed this as completed Jun 28, 2024

github-project-automation bot moved this from In Progress to Done in mito2 Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize compaction strategy to avoid merging SST files without overlapping time ranges #3416

Optimize compaction strategy to avoid merging SST files without overlapping time ranges #3416

v0y4g3r commented Feb 29, 2024 •

edited by killme2008

Loading

evenyag commented Feb 29, 2024

v0y4g3r commented Mar 29, 2024 •

edited

Loading

evenyag commented Mar 29, 2024 •

edited

Loading

v0y4g3r commented Mar 29, 2024

evenyag commented Mar 29, 2024

evenyag commented Mar 31, 2024

killme2008 commented Jun 28, 2024

killme2008 commented Jun 28, 2024

v0y4g3r commented Jul 1, 2024

Optimize compaction strategy to avoid merging SST files without overlapping time ranges #3416

Optimize compaction strategy to avoid merging SST files without overlapping time ranges #3416

Comments

v0y4g3r commented Feb 29, 2024 • edited by killme2008 Loading

What type of enhancement is this?

What does the enhancement do?

Tasks

Implementation challenges

evenyag commented Feb 29, 2024

v0y4g3r commented Mar 29, 2024 • edited Loading

evenyag commented Mar 29, 2024 • edited Loading

v0y4g3r commented Mar 29, 2024

evenyag commented Mar 29, 2024

evenyag commented Mar 31, 2024

killme2008 commented Jun 28, 2024

killme2008 commented Jun 28, 2024

v0y4g3r commented Jul 1, 2024

v0y4g3r commented Feb 29, 2024 •

edited by killme2008

Loading

v0y4g3r commented Mar 29, 2024 •

edited

Loading

evenyag commented Mar 29, 2024 •

edited

Loading