change format for store to make it faster with small documents #1569

trinity-1686a · 2022-09-29T09:33:12Z

I've used a limit dependent on block_size under the assumption someone setting this to high values know what they are doing, and are probably storing large enough documents, but maybe setting a constant value would be safer? Currently settings make blocks of 1093 docs at most when asked for 16_384B.

In micro-benchmark with 10_000_000 docs getting merged, remap_and_write has gone from 97s to 16s. Hardcoding a lower value for doc limit (512 instead of 1092 computed) makes it go even faster (10s), but I lack knowledge about such a value being possible harmful in other use cases.

PSeitz · 2022-09-29T13:27:02Z

Sorry you should use the performance bugfix for the docstore rewrite #1565 for benchmarks. It's not merged yet.

I would prefer a constant. If the value is too low, we will hurt compression, so I would be relative conservative about that value (performance/compression tradeoff). In quickwit we use quite large blocks of 1MB. On the hdfs dataset that's ~3000 documents per block

fulmicoton · 2022-09-29T14:34:45Z

I prefer a constant too. 128?

trinity-1686a · 2022-09-29T15:33:36Z

I'm not sure we are talking about the same thing when, talking about a constant.
Currently this PR makes it so blocks get terminated either when they're full, or if they would be full if storing documents of 15 bytes (which is the weight of a document with a single u64 field).
With Tantivy default settings, that's 1093 docs per block max, with Quickwit it's around 70k.
Should I understand we want blocks of 128 docs max, or that we should assume a small document is 128 bytes, so Tantivy would store 128 docs per block max, and Quickwit 8192?

fulmicoton · 2022-09-30T00:58:19Z

So we had several problem.

The main one was the read amplifcation when production first segment with an index sorted by field.
In that case, all docs are first written to a temp store, and once we know the permutation to apply, we do one get per doc.
The cache was useless due to the random access.

With small docs, we were ending up decompressing the same block as many times as the number of doc in that block.
In #1565 Pascal made it so for this temp the docstore is uncompressed, and each doc contains its own individual block.

The remaining problem is, generally speaking, if we tiny documents, we have to pay for what looks like a weird linear lookup
every time we fetch doc. That linear lookup is rather silly because of the way the data is layed out.

In merged in particular, we end up fetch each doc one time. The block decompression cache is saving the day, but we still have to do this linear lookup.

Two questions for you:
Do we have an actual problem?

In other words, for tiny docs, can this weird linear lookup been non-negligible?
One place where this might become important is a for a user that is trying to stream docs as fast as possible and possibly matching a query (you can test with doc_id in 0..max_doc).

Several users already do something like that. This is important for quickwit too.

Can you create a small bench with

an index with small docs.
do a for loop on 0..max_doc and get each document.

If your PR helps a lot, I think we should consider changing the layout of the block.

Instead of

Doc Length (vint) | Doc payload
Doc Length (vint) | Doc payload
Doc Length (vint) | Docpayload
Doc Length (vint) | Docpayload
Doc Length (vint) | Docpayload
Doc Length (vint) | Docpayload
 ...

Make it

|offset_index| Doc | Doc | Doc | Doc |

The offset_index could just be all of the doc offset within the block encoded over 4 bytes... That's an overhead precompression of 3 bytes that I think we can afford.

What do you think?

fulmicoton · 2022-09-30T00:59:15Z

Should I understand we want blocks of 128 docs max

I meant blocks of 128 documents maximum.

PSeitz · 2022-09-30T01:49:24Z

While multiple decompression (fixed by #1565) can be a problem in principle, I did never observe it in my tests. Since most documents were in a few blocks, I did hit the cache. The issue was the linear lookup.

128 is probably too low and would hurt compression. E.g. with the hdfs dataset and 1MB blocks that would be 3000 docs.

I think changing the layout is a good idea, same compression, but faster random access. The offset index could be converted to an bitpacked index to remove the linear search. Then we could also discard the docs limit.
Bitpacked is not ideal for very different document lengths, but if we have very large documents, then the stored length is dwarfed anyways.

trinity-1686a · 2022-09-30T08:30:46Z

Two questions for you:
Do we have an actual problem?
In other words, for tiny docs, can this weird linear lookup been non-negligible?

I benched the doc store (no index, just the store), reading 256k docs out of 128m, with block size 1MB:

bench (doc per block)	total duration	amortized per doc
unlimited doc per block block (=512k)	159.39s	608µs
assume 15B doc (=70k)	23.33s	89.0µs
assume 128B doc (=8k)	5.33s	20.3µs
128 doc per block	2.71s	10.3µs

I haven't benchmarked in context (with an index) yet, but 608µs per lookup seems quiet high

it makes writing the block faster due to one less memcopy

trinity-1686a · 2022-10-03T16:01:52Z

I've kept the PR as the changes follow the discussion that happened here, but the solution implemented is totally different to before @fulmicoton @PSeitz.

I've gone with

| Doc | Doc | Doc | Doc | offset_index |

instead of

|offset_index| Doc | Doc | Doc | Doc |

because the later caused a slight performance regression on the writing side: data needed to be copied to an intermediary buffer before being sent to the compressor.

fulmicoton · 2022-10-04T00:32:28Z

src/store/reader.rs

+    Ok(Range {
+        start: start_pos,
+        end: end_pos,
+    })


nitpick

Suggested change

Ok(Range {

start: start_pos,

end: end_pos,

})

Ok(start_pos..end_pos)

fulmicoton

It is also cleaner that way.

fulmicoton · 2022-10-04T01:08:28Z

@trinity-1686a awesome job

PSeitz · 2022-10-04T01:46:40Z

src/store/reader.rs

+    let index_start = block.len() - (index_len + 1) * size_of_u32;
+    let index = &block[index_start..index_start + index_len * size_of_u32];
+
+    let start_pos = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;


Suggested change

let start_pos = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;

let start_offset = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;

_offset is imo better to name byte offsets

PSeitz · 2022-10-04T01:53:29Z

src/store/writer.rs

+        let size_of_u32 = std::mem::size_of::<u32>();
+        self.current_block
+            .reserve((self.doc_pos.len() + 1) * size_of_u32);
+
+        for pos in self.doc_pos.iter() {
+            pos.serialize(&mut self.current_block)?;
+        }
+        (self.doc_pos.len() as u32).serialize(&mut self.current_block)?;


I would bitpack them here, so the cost per doc is not fixed to 4bytes but depends on the block size (e.g. 3 bytes for 2MB blocks)

Usage is very simple, see bitpacked.rs

Let's stick to this.

trinity-1686a added 2 commits October 3, 2022 16:07

use new format for docstore blocks

7eb03db

move index to end of block

8ac4887

it makes writing the block faster due to one less memcopy

trinity-1686a force-pushed the issue-1552 branch from e641dd5 to 8ac4887 Compare October 3, 2022 15:45

trinity-1686a changed the title ~~limit number of small documents in single block~~ change format for store to make it faster with small documents Oct 3, 2022

fulmicoton reviewed Oct 4, 2022

View reviewed changes

fulmicoton approved these changes Oct 4, 2022

View reviewed changes

PSeitz reviewed Oct 4, 2022

View reviewed changes

fix nits

2ab4bea

trinity-1686a merged commit 5945dbf into main Oct 4, 2022

trinity-1686a deleted the issue-1552 branch October 4, 2022 07:58

This was referenced Jan 13, 2023

truncation comment PSeitz/tantivy#30

Closed

use stats PSeitz/tantivy#31

Closed

PSeitz mentioned this pull request Jan 31, 2023

update lz4 flex PSeitz/tantivy#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change format for store to make it faster with small documents #1569

change format for store to make it faster with small documents #1569

trinity-1686a commented Sep 29, 2022 •

edited

Loading

PSeitz commented Sep 29, 2022

fulmicoton commented Sep 29, 2022 •

edited

Loading

trinity-1686a commented Sep 29, 2022

fulmicoton commented Sep 30, 2022 •

edited

Loading

fulmicoton commented Sep 30, 2022

PSeitz commented Sep 30, 2022

trinity-1686a commented Sep 30, 2022 •

edited

Loading

trinity-1686a commented Oct 3, 2022

fulmicoton Oct 4, 2022

fulmicoton left a comment

fulmicoton commented Oct 4, 2022

PSeitz Oct 4, 2022

PSeitz Oct 4, 2022

fulmicoton Oct 4, 2022

	let start_pos = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;
	let start_offset = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;

change format for store to make it faster with small documents #1569

change format for store to make it faster with small documents #1569

Conversation

trinity-1686a commented Sep 29, 2022 • edited Loading

PSeitz commented Sep 29, 2022

fulmicoton commented Sep 29, 2022 • edited Loading

trinity-1686a commented Sep 29, 2022

fulmicoton commented Sep 30, 2022 • edited Loading

fulmicoton commented Sep 30, 2022

PSeitz commented Sep 30, 2022

trinity-1686a commented Sep 30, 2022 • edited Loading

trinity-1686a commented Oct 3, 2022

fulmicoton Oct 4, 2022

Choose a reason for hiding this comment

fulmicoton left a comment

Choose a reason for hiding this comment

fulmicoton commented Oct 4, 2022

PSeitz Oct 4, 2022

Choose a reason for hiding this comment

PSeitz Oct 4, 2022

Choose a reason for hiding this comment

fulmicoton Oct 4, 2022

Choose a reason for hiding this comment

trinity-1686a commented Sep 29, 2022 •

edited

Loading

fulmicoton commented Sep 29, 2022 •

edited

Loading

fulmicoton commented Sep 30, 2022 •

edited

Loading

trinity-1686a commented Sep 30, 2022 •

edited

Loading