Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge bug for objects with position enabled and including at least one number #2251

Closed
fulmicoton opened this issue Nov 14, 2023 · 3 comments · Fixed by #2253
Closed

Merge bug for objects with position enabled and including at least one number #2251

fulmicoton opened this issue Nov 14, 2023 · 3 comments · Fixed by #2253

Comments

@fulmicoton
Copy link
Collaborator

fulmicoton commented Nov 14, 2023

minimal reproducing test.

#[test]
fn test_positions_dynamic_mode() {
    let mut schema_builder = Schema::builder();
    let field = schema_builder.add_json_field("dynamic", TEXT);
    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema.clone());
    let mut writer: IndexWriter = index.writer_for_tests().unwrap();
    let mut merge_policy = LogMergePolicy::default();
    merge_policy.set_min_num_segments(2);
    writer.set_merge_policy(Box::new(merge_policy));
    // Here a string would work.
    let doc_json = r#"{"tenant_id":75}"#;
    let vals = serde_json::from_str(doc_json).unwrap();
    let mut doc = TantivyDocument::default();
    doc.add_object(field, vals);
    writer.add_document(doc.clone()).unwrap();
    writer.commit().unwrap();
    writer.add_document(doc.clone()).unwrap();
    writer.commit().unwrap();
    writer.wait_merging_threads().unwrap();
    let reader = index.reader().unwrap();
    assert_eq!(reader.searcher().segment_readers().len(), 1);
}
@fulmicoton
Copy link
Collaborator Author

I git bisected the problem

0d4589219bcf8b4f10326e599c23ac102d28e42b is the first bad commit
commit 0d4589219bcf8b4f10326e599c23ac102d28e42b
Author: trinity-1686a <[email protected]>
Date:   Fri Oct 20 16:58:26 2023 +0200

    encode some part of posting list as -1 instead of direct values (#2185)

    * add support for delta-1 encoding posting list

    * encode term frequency minus one

    * don't emit tf for json integer terms

    * make skipreader not pub(crate) mutable

@fulmicoton
Copy link
Collaborator Author

other problem..
the spec for term freq is shaky.

for numbers, we do not record the term freq on sgement writing
but then we record a term freq of 0 after merging.

@PSeitz
Copy link
Contributor

PSeitz commented Nov 14, 2023

We should extend test_operation_strategy in index_writer to also include mixed types

fulmicoton added a commit that referenced this issue Nov 14, 2023
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.

The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.

It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.

Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)

The merger would then proceed and attempt to decode 1 position when
there are in fact none.

This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.

Closes #2251
fulmicoton added a commit that referenced this issue Nov 14, 2023
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.

The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.

It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.

Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)

The merger would then proceed and attempt to decode 1 position when
there are in fact none.

This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.

Closes #2251
PSeitz pushed a commit that referenced this issue Apr 10, 2024
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.

The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.

It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.

Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)

The merger would then proceed and attempt to decode 1 position when
there are in fact none.

This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.

Closes #2251
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants