Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogMergePolicy knob del_docs_percentage_before_merge #1238

Merged
merged 10 commits into from
Dec 20, 2021
2 changes: 1 addition & 1 deletion src/indexer/index_writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -974,7 +974,7 @@ mod tests {
assert_eq!(
format!("{:?}", index_writer.get_merge_policy()),
"LogMergePolicy { min_num_segments: 8, max_docs_before_merge: 10000000, min_layer_size: 10000, \
level_log_size: 0.75 }"
level_log_size: 0.75, max_del_docs_pct: 100 }"
);
shikhar marked this conversation as resolved.
Show resolved Hide resolved
let merge_policy = Box::new(NoMergePolicy::default());
index_writer.set_merge_policy(merge_policy);
Expand Down
33 changes: 28 additions & 5 deletions src/indexer/log_merge_policy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ const DEFAULT_LEVEL_LOG_SIZE: f64 = 0.75;
const DEFAULT_MIN_LAYER_SIZE: u32 = 10_000;
const DEFAULT_MIN_NUM_SEGMENTS_IN_MERGE: usize = 8;
const DEFAULT_MAX_DOCS_BEFORE_MERGE: usize = 10_000_000;
const DEFAULT_MAX_DEL_DOCS_PCT: u8 = 100;
Copy link
Collaborator Author

@shikhar shikhar Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene default for a similar knob is 33% https://github.com/apache/lucene/blob/c64e5fe84c4990968844193e3a62f4ebbba638ea/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L91

100% is effectively a no-op over the current policy. Lowering it to 33% causes some tests to fail, probably worth working through it though, if the approach makes sense!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should lower it, not sure if 33% is not too early though.


/// `LogMergePolicy` tries to merge segments that have a similar number of
/// documents.
Expand All @@ -17,6 +18,7 @@ pub struct LogMergePolicy {
max_docs_before_merge: usize,
min_layer_size: u32,
level_log_size: f64,
max_del_docs_pct: u8,
}

impl LogMergePolicy {
Expand Down Expand Up @@ -52,19 +54,34 @@ impl LogMergePolicy {
pub fn set_level_log_size(&mut self, level_log_size: f64) {
self.level_log_size = level_log_size;
}

/// Set the maximum percentage of deleted documents in a segment to
/// tolerate, if it is exceeded by any segment at a log level, a merge
/// will be triggered for it.
///
/// If there is a single segment at a level, we effectively end up expunging
/// deleted documents from it.
pub fn set_max_del_docs_pct(&mut self, max_del_docs_pct: u8) {
assert!(max_del_docs_pct <= 100);
self.max_del_docs_pct = max_del_docs_pct;
}
}

fn deletes_pct(segment: &SegmentMeta) -> u8 {
(segment.num_deleted_docs() as u64 * 100 / segment.max_doc() as u64) as u8
}

impl MergePolicy for LogMergePolicy {
fn compute_merge_candidates(&self, segments: &[SegmentMeta]) -> Vec<MergeCandidate> {
let mut size_sorted_segments = segments
let size_sorted_segments = segments
.iter()
.filter(|segment_meta| segment_meta.num_docs() <= (self.max_docs_before_merge as u32))
.filter(|seg| seg.num_docs() <= (self.max_docs_before_merge as u32))
.sorted_by_key(|seg| std::cmp::Reverse(seg.max_doc()))
.collect::<Vec<&SegmentMeta>>();

if size_sorted_segments.len() <= 1 {
if size_sorted_segments.is_empty() {
return vec![];
}
size_sorted_segments.sort_by_key(|seg| std::cmp::Reverse(seg.num_docs()));

let mut current_max_log_size = f64::MAX;
let mut levels = vec![];
Expand All @@ -82,7 +99,12 @@ impl MergePolicy for LogMergePolicy {

levels
.iter()
.filter(|level| level.len() >= self.min_num_segments)
.filter(|level| {
level.len() >= self.min_num_segments
|| level
.iter()
.any(|segment| deletes_pct(segment) > self.max_del_docs_pct)
shikhar marked this conversation as resolved.
Show resolved Hide resolved
})
.map(|segments| MergeCandidate(segments.iter().map(|&seg| seg.id()).collect()))
.collect()
}
Expand All @@ -95,6 +117,7 @@ impl Default for LogMergePolicy {
max_docs_before_merge: DEFAULT_MAX_DOCS_BEFORE_MERGE,
min_layer_size: DEFAULT_MIN_LAYER_SIZE,
level_log_size: DEFAULT_LEVEL_LOG_SIZE,
max_del_docs_pct: DEFAULT_MAX_DEL_DOCS_PCT,
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ pub use self::docset::{DocSet, TERMINATED};
pub use crate::core::{Executor, SegmentComponent};
pub use crate::core::{
Index, IndexBuilder, IndexMeta, IndexSettings, IndexSortByField, Order, Searcher, Segment,
SegmentId, SegmentMeta,
SegmentId, SegmentMeta, SegmentMetaInventory,
};
pub use crate::core::{InvertedIndexReader, SegmentReader};
pub use crate::directory::Directory;
Expand Down