Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change format for store to make it faster with small documents #1569

Merged
merged 3 commits into from
Oct 4, 2022

Conversation

trinity-1686a
Copy link
Contributor

@trinity-1686a trinity-1686a commented Sep 29, 2022

fix #1552

I've used a limit dependent on block_size under the assumption someone setting this to high values know what they are doing, and are probably storing large enough documents, but maybe setting a constant value would be safer? Currently settings make blocks of 1093 docs at most when asked for 16_384B.

In micro-benchmark with 10_000_000 docs getting merged, remap_and_write has gone from 97s to 16s. Hardcoding a lower value for doc limit (512 instead of 1092 computed) makes it go even faster (10s), but I lack knowledge about such a value being possible harmful in other use cases.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 29, 2022

Sorry you should use the performance bugfix for the docstore rewrite #1565 for benchmarks. It's not merged yet.

I would prefer a constant. If the value is too low, we will hurt compression, so I would be relative conservative about that value (performance/compression tradeoff). In quickwit we use quite large blocks of 1MB. On the hdfs dataset that's ~3000 documents per block

@fulmicoton
Copy link
Collaborator

fulmicoton commented Sep 29, 2022

I prefer a constant too. 128?

@trinity-1686a
Copy link
Contributor Author

I'm not sure we are talking about the same thing when, talking about a constant.
Currently this PR makes it so blocks get terminated either when they're full, or if they would be full if storing documents of 15 bytes (which is the weight of a document with a single u64 field).
With Tantivy default settings, that's 1093 docs per block max, with Quickwit it's around 70k.
Should I understand we want blocks of 128 docs max, or that we should assume a small document is 128 bytes, so Tantivy would store 128 docs per block max, and Quickwit 8192?

@fulmicoton
Copy link
Collaborator

fulmicoton commented Sep 30, 2022

So we had several problem.

The main one was the read amplifcation when production first segment with an index sorted by field.
In that case, all docs are first written to a temp store, and once we know the permutation to apply, we do one get per doc.
The cache was useless due to the random access.

With small docs, we were ending up decompressing the same block as many times as the number of doc in that block.
In #1565 Pascal made it so for this temp the docstore is uncompressed, and each doc contains its own individual block.

The remaining problem is, generally speaking, if we tiny documents, we have to pay for what looks like a weird linear lookup
every time we fetch doc. That linear lookup is rather silly because of the way the data is layed out.

In merged in particular, we end up fetch each doc one time. The block decompression cache is saving the day, but we still have to do this linear lookup.

Two questions for you:
Do we have an actual problem?

In other words, for tiny docs, can this weird linear lookup been non-negligible?
One place where this might become important is a for a user that is trying to stream docs as fast as possible and possibly matching a query (you can test with doc_id in 0..max_doc).

Several users already do something like that. This is important for quickwit too.

Can you create a small bench with

  • an index with small docs.
  • do a for loop on 0..max_doc and get each document.

If your PR helps a lot, I think we should consider changing the layout of the block.

Instead of

Doc Length (vint) | Doc payload
Doc Length (vint) | Doc payload
Doc Length (vint) | Docpayload
Doc Length (vint) | Docpayload
Doc Length (vint) | Docpayload
Doc Length (vint) | Docpayload
 ...

Make it

|offset_index| Doc | Doc | Doc | Doc |

The offset_index could just be all of the doc offset within the block encoded over 4 bytes... That's an overhead precompression of 3 bytes that I think we can afford.

What do you think?

@fulmicoton
Copy link
Collaborator

Should I understand we want blocks of 128 docs max

I meant blocks of 128 documents maximum.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 30, 2022

While multiple decompression (fixed by #1565) can be a problem in principle, I did never observe it in my tests. Since most documents were in a few blocks, I did hit the cache. The issue was the linear lookup.

128 is probably too low and would hurt compression. E.g. with the hdfs dataset and 1MB blocks that would be 3000 docs.

I think changing the layout is a good idea, same compression, but faster random access. The offset index could be converted to an bitpacked index to remove the linear search. Then we could also discard the docs limit.
Bitpacked is not ideal for very different document lengths, but if we have very large documents, then the stored length is dwarfed anyways.

@trinity-1686a
Copy link
Contributor Author

trinity-1686a commented Sep 30, 2022

Two questions for you:
Do we have an actual problem?
In other words, for tiny docs, can this weird linear lookup been non-negligible?

I benched the doc store (no index, just the store), reading 256k docs out of 128m, with block size 1MB:

bench (doc per block) total duration amortized per doc
unlimited doc per block block (=512k) 159.39s 608µs
assume 15B doc (=70k) 23.33s 89.0µs
assume 128B doc (=8k) 5.33s 20.3µs
128 doc per block 2.71s 10.3µs

I haven't benchmarked in context (with an index) yet, but 608µs per lookup seems quiet high

it makes writing the block faster due to one less memcopy
@trinity-1686a trinity-1686a changed the title limit number of small documents in single block change format for store to make it faster with small documents Oct 3, 2022
@trinity-1686a
Copy link
Contributor Author

I've kept the PR as the changes follow the discussion that happened here, but the solution implemented is totally different to before @fulmicoton @PSeitz.

I've gone with

| Doc | Doc | Doc | Doc | offset_index |

instead of

|offset_index| Doc | Doc | Doc | Doc |

because the later caused a slight performance regression on the writing side: data needed to be copied to an intermediary buffer before being sent to the compressor.

Comment on lines 316 to 319
Ok(Range {
start: start_pos,
end: end_pos,
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
Ok(Range {
start: start_pos,
end: end_pos,
})
Ok(start_pos..end_pos)

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also cleaner that way.

@fulmicoton
Copy link
Collaborator

@trinity-1686a awesome job

let index_start = block.len() - (index_len + 1) * size_of_u32;
let index = &block[index_start..index_start + index_len * size_of_u32];

let start_pos = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let start_pos = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;
let start_offset = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;

_offset is imo better to name byte offsets

Comment on lines +81 to +88
let size_of_u32 = std::mem::size_of::<u32>();
self.current_block
.reserve((self.doc_pos.len() + 1) * size_of_u32);

for pos in self.doc_pos.iter() {
pos.serialize(&mut self.current_block)?;
}
(self.doc_pos.len() as u32).serialize(&mut self.current_block)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would bitpack them here, so the cost per doc is not fixed to 4bytes but depends on the block size (e.g. 3 bytes for 2MB blocks)

Usage is very simple, see bitpacked.rs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick to this.

@trinity-1686a trinity-1686a merged commit 5945dbf into main Oct 4, 2022
@trinity-1686a trinity-1686a deleted the issue-1552 branch October 4, 2022 07:58
This was referenced Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Doc blocks get too large if no fields are stored
3 participants