-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About range query performance questions. #2266
Comments
Can you retest without setting the field fast? |
@PSeitz Thank you very much for your reply. I use
Here is some information about test case:
BooleanQuery::new(vec![
(Occur::Must, Box::new(range_query)),
(Occur::Must, Box::new(text_query)),
]); Here is the test result:
range query answer size:10000, range_duration:280ms
text query answer size:2574, range_duration:0ms
boolean query answer size:5, range_duration:333ms
range query answer size:10000, range_duration:31ms
text query answer size:2574, range_duration:0ms
boolean query answer size:5, range_duration:30ms One more thing I want to figure out: Here is my minimum test case: extern crate tantivy;
use std::fs;
use std::path::Path;
use std::time::Instant;
use tantivy::collector::Count;
use tantivy::query::BooleanQuery;
use tantivy::query::QueryParser;
use tantivy::query::RangeQuery;
use tantivy::query_grammar::Occur;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::ReloadPolicy;
use serde::{Deserialize, Serialize};
#[derive(Deserialize, Serialize)]
struct Item {
id: u64,
title: String,
body: String,
}
fn recreate_or_load_index(index_path: &Path, recreate: bool, schema: &Schema) -> tantivy::Result<Index> {
if recreate {
if index_path.exists() {
let _ = fs::remove_dir_all(index_path);
}
let _ = fs::create_dir_all(index_path);
let mut index = Index::create_in_dir(index_path, schema.clone()).expect("create index error");
let _ = index.set_default_multithread_executor();
Ok(index)
} else {
let mut index = Index::open_in_dir(index_path).expect("msg");
let _ = index.set_default_multithread_executor();
Ok(index)
}
}
fn index_from_json_file(index:&Index, json_file_path: &Path, schema: &Schema) -> tantivy::Result<()> {
// Create index_writer and set merge policy.
let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024 * 8).unwrap();
let mut merge_policy = tantivy::merge_policy::LogMergePolicy::default();
merge_policy.set_max_docs_before_merge(100_000);
merge_policy.set_min_num_segments(4);
index_writer.set_merge_policy(Box::new(merge_policy));
let row_id = schema.get_field("row_id").unwrap();
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
// Read datasets.
let json_content = fs::read_to_string(json_file_path).expect("file should be read");
let documents: Vec<Item> = serde_json::from_str(&json_content)?;
// Index data from json.
for doc in documents {
let mut temp = Document::default();
temp.add_u64(row_id, doc.id);
temp.add_text(title, doc.title);
temp.add_text(body, doc.body);
let _ = index_writer.add_document(temp);
}
index_writer.commit()?;
// Release merging threads.
let segment_ids = index
.searchable_segment_ids()
.expect("Searchable segments failed.");
if segment_ids.len() > 1 {
index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?;
}
Ok(())
}
fn main() -> tantivy::Result<()> {
// Tantivy index files local path.
let index_path = Path::new("/home/mochix/workspace_github/tantivy_demo/index_path");
// Datasets, you can download it from AWS S3: wget https://myscale-example-datasets.s3.amazonaws.com/wiki_560w.json
let json_file_path = Path::new("/home/mochix/workspace_github/tantivy_demo/wiki_560w.json");
// If you want to recreate tantivy index, make it be true.
let recreate_index = true;
let mut schema_builder = Schema::builder();
schema_builder.add_u64_field("row_id", FAST);
schema_builder.add_text_field("title", TEXT);
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = recreate_or_load_index(index_path, recreate_index, &schema)?;
if recreate_index {
let _ = index_from_json_file(&index, json_file_path, &schema);
}
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
let searcher = reader.searcher();
let range_query_start = Instant::now();
let range_query = RangeQuery::new_u64("row_id".to_string(), 80000..90000);
let range_size = searcher.search(&range_query, &Count).expect("range query error");
let range_query_duration = range_query_start.elapsed().as_millis();
println!("range query answer size:{:?}, range_duration:{:?}ms", range_size, range_query_duration);
let text_query_start = Instant::now();
let query_parser = QueryParser::for_index(&index, vec![schema.get_field("body").unwrap()]);
let text_query = query_parser.parse_query("volunteer")?;
let text_ans_size = searcher.search(&text_query, &Count)?;
let text_query_duration = text_query_start.elapsed().as_millis();
println!("text query answer size:{:?}, range_duration:{:?}ms", text_ans_size, text_query_duration);
let boolean_query_start = Instant::now();
let boolean_query = BooleanQuery::new(vec![
(Occur::Must, Box::new(range_query)),
(Occur::Must, Box::new(text_query)),
]);
let boolean_ans_size = searcher.search(&boolean_query, &Count)?;
let boolean_query_duration = boolean_query_start.elapsed().as_millis();
println!("boolean query answer size:{:?}, range_duration:{:?}ms", boolean_ans_size, boolean_query_duration);
Ok(())
} |
These numbers seem off. Are you running in release mode? What CPU are you on?
They run at the same time, but with INDEXED, a datastructure is created beforehand for efficient intersection. |
Where can I find out whether this test case is running in release or debug mode? [package]
name = "tantivy_demo"
version = "0.1.0"
edition = "2018"
[dependencies]
tantivy = "0.21.1"
tempfile = "3.2"
serde = { version = "1.0.127", features = ["derive"] }
serde_json = "1.0.66"
[[bin]]
name = "tantivy_demo"
path = "src/main.rs" Here is my cpu info: ❯ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 6900HX with Radeon Graphics
CPU family: 25
Model: 68
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 1
Frequency boost: enabled
CPU max MHz: 4933.8862
CPU min MHz: 1600.0000
BogoMIPS: 6587.56
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmpe
rf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topo
ext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveo
pt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 4 MiB (8 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected |
You just need to add |
Here is test result in release mode:
❯ ./target/release/tantivy_demo
range query answer size:10000, range_duration:5ms
text query answer size:2574, range_duration:0ms
boolean query answer size:5, range_duration:5ms |
What about with |
@PSeitz Here is ❯ ./target/release/tantivy_demo
range query answer size:10000, range_duration:29ms
text query answer size:2574, range_duration:0ms
boolean query answer size:5, range_duration:26ms Based on the results from the release mode of Could you offer some suggestions? The integration of Tantivy into ClickHouse demands stringent requirements regarding search latency, which is quite critical in our process. |
I’ve realized that by using a custom schema_builder.add_u64_field("row_id", FAST | INDEXED);
schema_builder.add_text_field("title", TEXT);
schema_builder.add_text_field("body", TEXT); And the code for the custom use tantivy::columnar::Column;
// ---
// Importing tantivy...
use tantivy::collector::{Collector, SegmentCollector};
use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, INDEXED, TEXT, Field};
use tantivy::{doc, Index, Score, SegmentReader};
pub struct RowIdCollector {
pub row_id_field: String,
}
impl RowIdCollector {
pub fn with_field(row_id_field: String) -> RowIdCollector {
RowIdCollector { row_id_field }
}
}
impl Collector for RowIdCollector {
// That's the type of our result.
// Our standard deviation will be a float.
type Fruit = Vec<u64>;
type Child = RowIdSegmentCollector;
fn for_segment(
&self,
_segment_local_id: u32,
segment_reader: &SegmentReader,
) -> tantivy::Result<Self::Child> {
let row_id_reader = segment_reader.fast_fields().u64(&self.row_id_field)?;
Ok(RowIdSegmentCollector {
row_id_reader: row_id_reader,
row_ids: Vec::new(),
})
}
fn requires_scoring(&self) -> bool {
// this collector does not care about score.
false
}
fn merge_fruits(&self, segment_row_ids: Vec<Self::Fruit>) -> tantivy::Result<Self::Fruit> {
Ok(segment_row_ids.into_iter().flatten().collect())
}
}
pub struct RowIdSegmentCollector {
row_id_reader: Column,
row_ids: Vec<u64>,
}
impl SegmentCollector for RowIdSegmentCollector {
type Fruit = Vec<u64>;
fn collect(&mut self, doc: u32, _score: Score) {
// Since we know the values are single value, we could call `first_or_default_col` on the
// column and fetch single values.
for value in self.row_id_reader.values_for_doc(doc) {
self.row_ids.push(value);
}
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
self.row_ids
}
} And here is the test case result in release mode: ❯ ./target/release/tantivy_demo
range query answer size:10000, range_query_duration: 24.703 ms
text query answer size:2574, text_query_duration: 0.118 ms
boolean query answer size:5, boolean_query_duration: 11.758 ms
row_id answer size:2574, row_id_collector_query_duration: 0.777 ms I am interested in knowing how to modify this custom Based on yesterday's experience, using |
That's not how it works, it's 1 lookup in the inverted index and return the docs.
Range Query with The issue here is that E.g. if DocSet1 (Term DocSet) has docids Show DocIds For TermQueryhit:2693 size:2574
What are the requirements?
The code for range queries on the fast fields The biggest gain would probably be on this type of query to sort the index by row_id and then exploit the sorting in the |
Thank you for your patient explanation, now I understand the scanning method of the FAST index. The tantivy index schema structure looks like: let mut schema_builder = Schema::builder();
schema_builder.add_u64_field("row_id", FAST | INDEXED);
schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build(); This FFI search function takes three important parameters:
It looks like: pub struct IndexR {
pub path: String,
pub index: Index,
pub reader: IndexReader,
}
#[no_mangle]
pub extern "C" fn tantivy_search_in_range(ir: *mut IndexR, query_ptr: *const c_char, lrange: u64, rrange: u64) -> bool {
// ...
} The function returns a boolean value. 🎯 Everything became clear after that. 1️⃣ ➡️ Initially, I tried using the following more traditional logic to determine if the let docs_results = searcher.search(&query, &Docs::with_limit(limit as usize)).expect("failed to search");
for (_score, doc_address) in docs_results {
let row_id_column = &row_id_columns[doc_address.segment_ord as usize];
let row_id_value = row_id_column.values_for_doc(doc_address.doc_id).next().unwrap_or(0);
if lrange <= row_id_value && row_id_value <= rrange {
return true;
}
}
false 2️⃣ ➡️ Later, I planned to optimize this FFI interface. The goal of the optimization is to reduce latency. So, I thought of combining a range query. First, I would limit the let range_query = RangeQuery::new_u64("row_id".to_string(), lrange..rrange);
let query_parser = QueryParser::for_index(&index, vec![schema.get_field("text").unwrap()]);
let text_query = query_parser.parse_query("volunteer")?;
let boolean_query = BooleanQuery::new(vec![
(Occur::Must, Box::new(range_query)),
(Occur::Must, Box::new(text_query)),
]);
let boolean_ans_size = searcher.search(&boolean_query, &Count)?;
if boolean_ans_size !=0 {
return true;
}
false However, the latency when using a range query was still relatively high, as shown in the previous results I posted. Executing a text search alone takes 3️⃣ ➡️ Therefore, I started to consider whether it would be possible to work on the custom let row_id_hit_size = searcher.search(&text_query, &RowIdCollector::with_field("row_id".to_string(), lrange, rrange))?;
if row_id_hit_size.unwrap().get_hit_count() !=0 {
return true;
}
false The custom use tantivy::columnar::Column;
use tantivy::collector::{Collector, SegmentCollector};
use tantivy::{ Score, SegmentReader};
#[derive(Default)]
pub struct RowIdRangeHit {
hit: bool,
hit_count: u64,
}
impl RowIdRangeHit {
pub fn new() -> RowIdRangeHit {
RowIdRangeHit {
hit: false,
hit_count: 0,
}
}
pub fn get_hit_count(&self) -> u64 {
self.hit_count
}
pub fn mark_hit(&mut self) {
self.hit_count += 1;
self.hit = true;
}
pub fn is_hit(&self) -> bool {
self.hit
}
}
pub struct RowIdCollector {
pub row_id_field: String,
pub lrange: u64,
pub rrange: u64,
}
impl RowIdCollector {
pub fn with_field(row_id_field: String, lrange: u64, rrange: u64) -> RowIdCollector {
RowIdCollector { row_id_field, lrange, rrange }
}
}
impl Collector for RowIdCollector {
type Fruit = Option<RowIdRangeHit>;
type Child = RowIdSegmentCollector;
fn for_segment(
&self,
_segment_local_id: u32,
segment_reader: &SegmentReader,
) -> tantivy::Result<Self::Child> {
let row_id_reader_ = segment_reader.fast_fields().u64(&self.row_id_field)?;
Ok(RowIdSegmentCollector {
row_id_reader: row_id_reader_,
row_id_range_hit: RowIdRangeHit::new(),
lrange: self.lrange,
rrange: self.rrange,
})
}
fn requires_scoring(&self) -> bool {
// this collector does not care about score.
false
}
fn merge_fruits(&self, segment_row_ids: Vec<Self::Fruit>) -> tantivy::Result<Self::Fruit> {
let mut row_id_range_hit = RowIdRangeHit::new();
for segment_row_id_range_hit in segment_row_ids.into_iter().flatten() {
if segment_row_id_range_hit.is_hit() {
row_id_range_hit.hit = true;
row_id_range_hit.hit_count += segment_row_id_range_hit.get_hit_count();
}
}
Ok(Some(row_id_range_hit))
}
}
pub struct RowIdSegmentCollector {
row_id_reader: Column,
row_id_range_hit: RowIdRangeHit,
pub lrange: u64,
pub rrange: u64,
}
impl SegmentCollector for RowIdSegmentCollector {
type Fruit = Option<RowIdRangeHit>;
fn collect(&mut self, doc: u32, _score: Score) {
for row_id_ in self.row_id_reader.values_for_doc(doc) {
if row_id_ >= self.lrange && row_id_ <= self.rrange {
self.row_id_range_hit.mark_hit();
}
}
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
Some(self.row_id_range_hit)
}
} This approach is currently the most effective method I have found for minimizing latency. I tested this custom Collector in the test code mentioned earlier (using the row_id with a range query answer size:10000, range_query_duration: 15.182 ms
text query answer size:2574, text_query_duration: 0.126 ms
boolean query answer size:5, boolean_query_duration: 12.160 ms
row_id hit count size:5, row_id_collector_query_duration: 0.717 ms I'm wondering if changing the indexing method of row_id to ❯ ./target/release/tantivy_demo
range query answer size:10000, range_query_duration: 53.900 ms
text query answer size:1796537, text_query_duration: **3.695 ms**
boolean query answer size:3109, boolean_query_duration: 11.944 ms
row_id hit count size:3109, row_id_collector_query_duration: **11.278 ms** So, if I change the indexing method of Thank you for your patient reading.💗 |
I meant the "stringent requirements regarding search latency"
Similar to the range_query. It's slower for large ranges. DocSets And IntersectionThe problem can be simplified two DocSets, each set hits a list of
The intersection would result in two On its own, DocSet1 is bad for the Currently tantivy always uses the columnar storage with You collector manually implements the intersection. It alleviates the problem that Early ExitYour use case is also special, that you would want to early exit the query after the first hit. There is no such mechanism currently, but it could be added by changing how the query is driven with some parameter. There several methods, but one of them is: https://github.com/quickwit-oss/tantivy/blob/main/src/query/weight.rs#L26 |
Recently, I'm interpreting tantivy in ClickHouse. Here is the schema in tantivy:
And this is my FFI function in Rust, it will consume 0.99s.
If only text search is applied, it will only consume 0.2s.
In fact, I want to filter the row_id using lrange and rrange first, and then do a text search based on the filtered results. I thought this would speed up the query results, but now it looks like range queries are too slow.
The data set I used was wiki abstract 5 million, with each text containing about 200 words.
Would you please give me some advice on what I can do to get a more efficient query😨
The text was updated successfully, but these errors were encountered: