-
-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change format for store to make it faster with small documents #1569
Conversation
Sorry you should use the performance bugfix for the docstore rewrite #1565 for benchmarks. It's not merged yet. I would prefer a constant. If the value is too low, we will hurt compression, so I would be relative conservative about that value (performance/compression tradeoff). In quickwit we use quite large blocks of 1MB. On the hdfs dataset that's ~3000 documents per block |
I prefer a constant too. 128? |
I'm not sure we are talking about the same thing when, talking about a constant. |
So we had several problem. The main one was the read amplifcation when production first segment with an index sorted by field. With small docs, we were ending up decompressing the same block as many times as the number of doc in that block. The remaining problem is, generally speaking, if we tiny documents, we have to pay for what looks like a weird linear lookup In merged in particular, we end up fetch each doc one time. The block decompression cache is saving the day, but we still have to do this linear lookup. Two questions for you: In other words, for tiny docs, can this weird linear lookup been non-negligible? Several users already do something like that. This is important for quickwit too. Can you create a small bench with
If your PR helps a lot, I think we should consider changing the layout of the block. Instead of
Make it
The offset_index could just be all of the doc offset within the block encoded over 4 bytes... That's an overhead precompression of 3 bytes that I think we can afford. What do you think? |
I meant blocks of 128 documents maximum. |
While multiple decompression (fixed by #1565) can be a problem in principle, I did never observe it in my tests. Since most documents were in a few blocks, I did hit the cache. The issue was the linear lookup. 128 is probably too low and would hurt compression. E.g. with the hdfs dataset and 1MB blocks that would be 3000 docs. I think changing the layout is a good idea, same compression, but faster random access. The offset index could be converted to an bitpacked index to remove the linear search. Then we could also discard the docs limit. |
I benched the doc store (no index, just the store), reading 256k docs out of 128m, with block size 1MB:
I haven't benchmarked in context (with an index) yet, but 608µs per lookup seems quiet high |
it makes writing the block faster due to one less memcopy
e641dd5
to
8ac4887
Compare
I've kept the PR as the changes follow the discussion that happened here, but the solution implemented is totally different to before @fulmicoton @PSeitz. I've gone with
instead of
because the later caused a slight performance regression on the writing side: data needed to be copied to an intermediary buffer before being sent to the compressor. |
src/store/reader.rs
Outdated
Ok(Range { | ||
start: start_pos, | ||
end: end_pos, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
Ok(Range { | |
start: start_pos, | |
end: end_pos, | |
}) | |
Ok(start_pos..end_pos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also cleaner that way.
@trinity-1686a awesome job |
src/store/reader.rs
Outdated
let index_start = block.len() - (index_len + 1) * size_of_u32; | ||
let index = &block[index_start..index_start + index_len * size_of_u32]; | ||
|
||
let start_pos = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let start_pos = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize; | |
let start_offset = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize; |
_offset is imo better to name byte offsets
let size_of_u32 = std::mem::size_of::<u32>(); | ||
self.current_block | ||
.reserve((self.doc_pos.len() + 1) * size_of_u32); | ||
|
||
for pos in self.doc_pos.iter() { | ||
pos.serialize(&mut self.current_block)?; | ||
} | ||
(self.doc_pos.len() as u32).serialize(&mut self.current_block)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would bitpack them here, so the cost per doc is not fixed to 4bytes but depends on the block size (e.g. 3 bytes for 2MB blocks)
Usage is very simple, see bitpacked.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to this.
fix #1552
I've used a limit dependent on
block_size
under the assumption someone setting this to high values know what they are doing, and are probably storing large enough documents, but maybe setting a constant value would be safer? Currently settings make blocks of 1093 docs at most when asked for 16_384B.In micro-benchmark with 10_000_000 docs getting merged,remap_and_write
has gone from 97s to 16s. Hardcoding a lower value for doc limit (512 instead of 1092 computed) makes it go even faster (10s), but I lack knowledge about such a value being possible harmful in other use cases.