Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in quickwit search stream StorageDirectory only supports async reads #1366

Closed
fulmicoton opened this issue May 5, 2022 · 32 comments · Fixed by #1384
Closed

Bug in quickwit search stream StorageDirectory only supports async reads #1366

fulmicoton opened this issue May 5, 2022 · 32 comments · Fixed by #1384
Labels
bug Something isn't working

Comments

@fulmicoton
Copy link
Contributor

fulmicoton commented May 5, 2022

Copy pasted from #1357 (reply in thread)
am able to ingest data in quickwit and search . However when I search using curl command , I am getting read async error.
What could go wrong here.
heena@Clickhouse1:~/quickwit-v0.2.1$ ./quickwit index search --index hackernews_5 --query Ambulance 2022-05-04T13:25:27.169Z ERROR quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads" 2022-05-04T13:25:27.171Z ERROR quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads" { "numHits": 1, "hits": [ { "by": [ "sgk284" ], "id": [ 2923885 ], "kids": [ 2923989, 2925247, 2924320, 2925442, 2924224, 2923994, 2924209, 2924702, 2925235, 2925010, 2924319, 2924638, 2925781, 2923943, 2924298 ], "score": [ 622 ], "text": [ "" ], "time": [ 1314251037 ], "title": [ "Icon Ambulance" ], "type": [ "story" ], "url": [ "https://plus.google.com/107117483540235115863/posts/gcSStkKxXTw" ] } ], "elapsedTimeMicros": 77324, "errors": [ "SplitSearchError { error: \"Internal error: An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: \"ccf34dbac4614904b1124b751756dab8.term\"'.\", split_id: \"01G26NHMCV1BAP61AS006H7A75\", retryable_error: true }", "SplitSearchError { error: \"Internal error: An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: \"6dc68fd1122c44a985ccf5348907c5f8.term\"'.\", split_id: \"01G26NK8YX0DM4YSVH6J9YD1GN\", retryable_error: true }" ] }
The output with curl command to search the same keyword.
heena@Clickhouse1:~/quickwit-v0.2.1$ curl "http://0.0.0.0:7280/api/v1/hackernews_5/search/stream?query=Ambulance&outputFormat=csv&fastField=id" curl: (18) transfer closed with outstanding read data remaining heena@Clickhouse1:~/quickwit-v0.2.1$

Attached the console logs when queried the commands ,This might be helpful

2022-05-04T13:24:03.927Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T13:24:13.927Z  INFO quickwit_serve::rest: search_stream index_id=hackernews_5 request=SearchStreamRequestQueryString { query: "google", search_fields: None, start_timestamp: None, end_timestamp: None, fast_field: "id", output_format: ClickHouseRowBinary, partition_by_field: None }
2022-05-04T13:24:13.927Z  INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T13:24:13.968Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:13.969Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:13.970Z  INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T13:24:13.972Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T13:24:14.006Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:14.006Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:14.007Z ERROR quickwit_serve::rest: Error when streaming search results. error=Internal error: `Internal error: `An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: "ccf34dbac4614904b1124b751756dab8.term"'`.`.
2022-05-04T13:24:14.009Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T13:24:49.399Z  INFO quickwit_serve::rest: search_stream index_id=hackernews_5 request=SearchStreamRequestQueryString { query: "google.com", search_fields: None, start_timestamp: None, end_timestamp: None, fast_field: "id", output_format: Csv, partition_by_field: None }
2022-05-04T13:24:49.400Z  INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T13:24:49.442Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:49.442Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:49.443Z  INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T13:24:49.452Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T13:24:49.494Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:49.495Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:24:49.496Z ERROR quickwit_serve::rest: Error when streaming search results. error=Internal error: `Internal error: `An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: "ccf34dbac4614904b1124b751756dab8.term"'`.`.
2022-05-04T13:24:49.503Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T13:26:29.659Z  INFO quickwit_serve::rest: search_stream index_id=hackernews_5 request=SearchStreamRequestQueryString { query: "Ambulance", search_fields: None, start_timestamp: None, end_timestamp: None, fast_field: "id", output_format: Csv, partition_by_field: None }
2022-05-04T13:26:29.661Z  INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T13:26:29.705Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:26:29.706Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:26:29.707Z  INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T13:26:29.713Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T13:26:29.756Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:26:29.757Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T13:26:29.757Z ERROR quickwit_serve::rest: Error when streaming search results. error=Internal error: `Internal error: `An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: "ccf34dbac4614904b1124b751756dab8.term"'`.`.
2022-05-04T13:26:29.761Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
@fulmicoton fulmicoton added the bug Something isn't working label May 5, 2022
@fulmicoton
Copy link
Contributor Author

@HeenaBansal2009 Can you add the first line of stdout? It contains the sha1 of the version you used? Is it main or is the released 0.2.1?

Also did you use S3 or the local file storage?

@HeenaBansal2009
Copy link

@HeenaBansal2009 Can you add the first line of stdout? It contains the sha1 of the version you used? Is it main or is the released 0.2.1?

Also did you use S3 or the local file storage?

@fulmicoton , I used the local file storage. I am using 0.2.1

heena@Clickhouse1:~/quickwit-v0.2.1$ ./quickwit --version Quickwit 0.2.1 (commit-hash: a857636) heena@Clickhouse1:~/quickwit-v0.2.1$

Pasting the logs here .

``heena@Clickhouse1:~/quickwit-v0.2.1$ ./quickwit service run searcher
2022-05-04T20:07:19.736Z INFO quickwit: version="0.2.1" commit="a857636"
2022-05-04T20:07:19.747Z WARN quickwit_config::config: Seed list is empty.
2022-05-04T20:07:19.747Z INFO quickwit_cli: Loaded Quickwit config. config_uri=file:///home/heena/quickwit-v0.2.1/config/quickwit.yaml config=QuickwitConfig { version: 0, node_id: "node-patient-lLKH", listen_address: "127.0.0.1", rest_listen_port: 7280, peer_seeds: [], data_dir_path: "./qwdata", metastore_uri: "file:///home/heena/quickwit-v0.2.1/qwdata/indexes", default_index_root_uri: "file:///home/heena/quickwit-v0.2.1/qwdata/indexes", indexer_config: IndexerConfig { split_store_max_num_bytes: Byte(100000000000), split_store_max_num_splits: 1000 }, searcher_config: SearcherConfig { fast_field_cache_capacity: Byte(1000000000), split_footer_cache_capacity: Byte(500000000), max_num_concurrent_split_streams: 100 }, storage_config: None }


Connectivity checklist
✔ metastore

2022-05-04T20:07:19.755Z INFO quickwit_cluster::cluster: Create new cluster. node_id="node-patient-lLKH" listen_addr=127.0.0.1:7280
2022-05-04T20:07:19.761Z INFO quickwit_serve: Searcher ready to accept requests at http://127.0.0.1:7280/
2022-05-04T20:07:19.762Z INFO quickwit_serve::rest: Starting REST service. rest_addr=127.0.0.1:7280
2022-05-04T20:07:19.763Z INFO quickwit_serve::grpc: Start gRPC service. grpc_addr=127.0.0.1:7281
2022-05-04T20:07:24.970Z INFO quickwit_serve::rest: search_stream index_id=hackernews_5 request=SearchStreamRequestQueryString { query: "Ambulance", search_fields: None, start_timestamp: None, end_timestamp: None, fast_field: "id", output_format: Csv, partition_by_field: None }
2022-05-04T20:07:24.984Z INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T20:07:24.991Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:07:24.997Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:07:25.005Z INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T20:07:25.006Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:07:25.006Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:07:25.007Z ERROR quickwit_serve::rest: Error when streaming search results. error=Internal error: Internal error: An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: "6dc68fd1122c44a985ccf5348907c5f8.term"'..
2022-05-04T20:07:25.107Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T20:07:25.117Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T20:08:20.383Z INFO quickwit_serve::rest: search_stream index_id=hackernews_5 request=SearchStreamRequestQueryString { query: "Ambulance", search_fields: None, start_timestamp: None, end_timestamp: None, fast_field: "id", output_format: Csv, partition_by_field: None }
2022-05-04T20:08:20.384Z INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T20:08:20.433Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:08:20.433Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:08:20.434Z INFO search_adapter:leaf_search_stream: quickwit_search::service: leaf_search index="hackernews_5" splits=[SplitIdAndFooterOffsets { split_id: "01G26NHEB10T2DX37288EKX0SJ", split_footer_start: 270323695, split_footer_end: 278910648 }, SplitIdAndFooterOffsets { split_id: "01G26NHMCV1BAP61AS006H7A75", split_footer_start: 2678183120, split_footer_end: 2678792526 }, SplitIdAndFooterOffsets { split_id: "01G26NK8YX0DM4YSVH6J9YD1GN", split_footer_start: 349970236, split_footer_end: 350048435 }]
2022-05-04T20:08:20.448Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed
2022-05-04T20:08:20.476Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NHMCV1BAP61AS006H7A75}:warmup: quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:08:20.476Z ERROR search_adapter:leaf_search_stream:leaf_search_stream:leaf_search_stream_single_split{split_id=01G26NK8YX0DM4YSVH6J9YD1GN}:warmup: quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads"
2022-05-04T20:08:20.477Z ERROR quickwit_serve::rest: Error when streaming search results. error=Internal error: Internal error: An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: "ccf34dbac4614904b1124b751756dab8.term"'..
2022-05-04T20:08:20.481Z ERROR search_adapter:leaf_search_stream:leaf_search_stream: quickwit_search::search_stream::leaf: Failed to send leaf search stream result. Stop sending. Cause: channel closed

@fulmicoton
Copy link
Contributor Author

thank you this is super helpful!

@fulmicoton
Copy link
Contributor Author

@HeenaBansal2009 this is quite mysterious. Can you provide even more logs? What happened before the error...
Don't worry about being verbose. You can attach a file if necessary.

RUST_LOG=debug might help too.

@fulmicoton
Copy link
Contributor Author

Flushing the status of my investigation.

I have not been able to reproduce yet. The logs suggests everything is working somewhat correctly, but the warmup phase did not work properly for two splits.

@fulmicoton
Copy link
Contributor Author

@HeenaBansal2009

Also can you share the content of the meta.json that you obtain after running:

./quickwit split extract --index hackernews_5 --split 01G26NK8YX0DM4YSVH6J9YD1GN --target-dir .

as well as the current index configuration.

If you don't have your config file, you can send the qwdata/indexes/hackernews_5/metastore.json file.

My current suspicion is a bug triggered by a change of schema.

@HeenaBansal2009
Copy link

meta.json

{ "index_settings": { "docstore_compression": "lz4" }, "segments": [ { "segment_id": "6dc68fd1-122c-44a9-85cc-f5348907c5f8", "max_doc": 421980, "deletes": null } ], "schema": [ { "name": "id", "type": "u64", "options": { "indexed": true, "fieldnorms": false, "fast": "single", "stored": true } }, { "name": "type", "type": "text", "options": { "indexing": { "record": "basic", "fieldnorms": true, "tokenizer": "raw" }, "stored": true } }, { "name": "by", "type": "text", "options": { "indexing": { "record": "position", "fieldnorms": true, "tokenizer": "default" }, "stored": true } }, { "name": "time", "type": "i64", "options": { "indexed": true, "fieldnorms": false, "fast": "single", "stored": true } }, { "name": "text", "type": "text", "options": { "indexing": { "record": "position", "fieldnorms": true, "tokenizer": "raw" }, "stored": true } }, { "name": "kids", "type": "i64", "options": { "indexed": true, "fieldnorms": false, "fast": "multi", "stored": true } }, { "name": "url", "type": "text", "options": { "indexing": { "record": "position", "fieldnorms": true, "tokenizer": "default" }, "stored": true } }, { "name": "score", "type": "i64", "options": { "indexed": true, "fieldnorms": false, "fast": "single", "stored": true } }, { "name": "title", "type": "text", "options": { "indexing": { "record": "position", "fieldnorms": true, "tokenizer": "default" }, "stored": true } } ], "opstamp": 421981 }
metastore.json
`
{
"version": "0",
"index": {
"version": "1",
"index_id": "hackernews_5",
"index_uri": "file:///home/heena/quickwit-v0.2.1/qwdata/indexes/hackernews_5",
"checkpoint": {
".cli-ingest-source": {}
},
"doc_mapping": {
"field_mappings": [
{
"name": "id",
"type": "u64",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "type",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "raw",
"record": "basic"
},
{
"name": "by",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "default",
"record": "position"
},
{
"name": "time",
"type": "i64",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "text",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "raw",
"record": "position"
},
{
"name": "kids",
"type": "array",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "url",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "default",
"record": "position"
},
{
"name": "score",
"type": "i64",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "title",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "default",
"record": "position"
}
],
"tag_fields": [],
"store_source": false
},
"indexing_settings": {
"timestamp_field": "time",
"commit_timeout_secs": 60,
"split_num_docs_target": 10000000,
"merge_enabled": true,
"merge_policy": {
"demux_factor": 8,
"merge_factor": 10,
"max_merge_factor": 12
},
"resources": {
"num_threads": 1,
"heap_size": 2000000000
}
},
"search_settings": {
"default_search_fields": [
"title",
"by",
"text",
"url"
]
},
"create_timestamp": 1651637961,
"update_timestamp": 1651638973
},
"splits": [
{
"split_state": "Published",
"update_timestamp": 1651638912,
"version": "1",
"split_id": "01G26NHEB10T2DX37288EKX0SJ",
"num_docs": 327347,
"size_in_bytes": 432879779,
"time_range": {
"start": 1314110908,
"end": 1593532186
},
"create_timestamp": 1651638908,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 270323695,
"end": 278910648
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N674ASKNNVG8DQKDYKRJM",
"num_docs": 387410,
"size_in_bytes": 502431085,
"time_range": {
"start": 1253159097,
"end": 1584476765
},
"create_timestamp": 1651638541,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 312636135,
"end": 322041208
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26MQHW2BPQGPJJK3V44TJVJ",
"num_docs": 568590,
"size_in_bytes": 719620809,
"time_range": {
"start": 1160418111,
"end": 1510673702
},
"create_timestamp": 1651638064,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 438230039,
"end": 438328080
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26MYWBR1YCXASA89KZ8F8GW",
"num_docs": 411217,
"size_in_bytes": 519948004,
"time_range": {
"start": 1160418111,
"end": 1509492103
},
"create_timestamp": 1651638300,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 317238052,
"end": 326780516
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N81RVV4BNM0C2MWER0Z78",
"num_docs": 478959,
"size_in_bytes": 625017918,
"time_range": {
"start": 1292822357,
"end": 1587103597
},
"create_timestamp": 1651638603,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 388813589,
"end": 388900337
}
},
{
"split_state": "Published",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26NHMCV1BAP61AS006H7A75",
"num_docs": 3678305,
"size_in_bytes": 4736677883,
"time_range": {
"start": 1160418111,
"end": 1591921222
},
"create_timestamp": 1651638906,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 2678183120,
"end": 2678792526
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N0Q0X5PDV52YA6EKKBGFF",
"num_docs": 271265,
"size_in_bytes": 344296679,
"time_range": {
"start": 1216780920,
"end": 1511405528
},
"create_timestamp": 1651638358,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 212666409,
"end": 218978775
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N2HPB28PTYA8TDTKRP8D6",
"num_docs": 279867,
"size_in_bytes": 361323563,
"time_range": {
"start": 1230934379,
"end": 1513359201
},
"create_timestamp": 1651638419,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 225477171,
"end": 232727176
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N4CBC75E4TY1HNZMS7J9C",
"num_docs": 396428,
"size_in_bytes": 511474431,
"time_range": {
"start": 1241269791,
"end": 1516223095
},
"create_timestamp": 1651638481,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 321116808,
"end": 331051249
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N9WDRQMEM9FQ68ZH04Y4G",
"num_docs": 307345,
"size_in_bytes": 400861275,
"time_range": {
"start": 1299809877,
"end": 1588657284
},
"create_timestamp": 1651638660,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 250380683,
"end": 257950016
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26NDS1DD9ZRN5GQB1F8HKPK",
"num_docs": 331782,
"size_in_bytes": 429910307,
"time_range": {
"start": 1306025856,
"end": 1590737498
},
"create_timestamp": 1651638788,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 270652419,
"end": 278863516
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26NFKNBXHCCJXPZSTNHBBY8",
"num_docs": 245442,
"size_in_bytes": 321793812,
"time_range": {
"start": 1310760863,
"end": 1591921222
},
"create_timestamp": 1651638846,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 200656781,
"end": 206628263
}
},
{
"split_state": "Published",
"update_timestamp": 1651638973,
"version": "1",
"split_id": "01G26NK8YX0DM4YSVH6J9YD1GN",
"num_docs": 421980,
"size_in_bytes": 559498837,
"time_range": {
"start": 1318868981,
"end": 1595630349
},
"create_timestamp": 1651638968,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 349970236,
"end": 350048435
}
}
]
}

{
"version": "0",
"index": {
"version": "1",
"index_id": "hackernews_5",
"index_uri": "file:///home/heena/quickwit-v0.2.1/qwdata/indexes/hackernews_5",
"checkpoint": {
".cli-ingest-source": {}
},
"doc_mapping": {
"field_mappings": [
{
"name": "id",
"type": "u64",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "type",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "raw",
"record": "basic"
},
{
"name": "by",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "default",
"record": "position"
},
{
"name": "time",
"type": "i64",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "text",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "raw",
"record": "position"
},
{
"name": "kids",
"type": "array",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "url",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "default",
"record": "position"
},
{
"name": "score",
"type": "i64",
"stored": true,
"fast": true,
"indexed": true
},
{
"name": "title",
"type": "text",
"stored": true,
"fast": false,
"tokenizer": "default",
"record": "position"
}
],
"tag_fields": [],
"store_source": false
},
"indexing_settings": {
"timestamp_field": "time",
"commit_timeout_secs": 60,
"split_num_docs_target": 10000000,
"merge_enabled": true,
"merge_policy": {
"demux_factor": 8,
"merge_factor": 10,
"max_merge_factor": 12
},
"resources": {
"num_threads": 1,
"heap_size": 2000000000
}
},
"search_settings": {
"default_search_fields": [
"title",
"by",
"text",
"url"
]
},
"create_timestamp": 1651637961,
"update_timestamp": 1651638973
},
"splits": [
{
"split_state": "Published",
"update_timestamp": 1651638912,
"version": "1",
"split_id": "01G26NHEB10T2DX37288EKX0SJ",
"num_docs": 327347,
"size_in_bytes": 432879779,
"time_range": {
"start": 1314110908,
"end": 1593532186
},
"create_timestamp": 1651638908,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 270323695,
"end": 278910648
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N674ASKNNVG8DQKDYKRJM",
"num_docs": 387410,
"size_in_bytes": 502431085,
"time_range": {
"start": 1253159097,
"end": 1584476765
},
"create_timestamp": 1651638541,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 312636135,
"end": 322041208
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26MQHW2BPQGPJJK3V44TJVJ",
"num_docs": 568590,
"size_in_bytes": 719620809,
"time_range": {
"start": 1160418111,
"end": 1510673702
},
"create_timestamp": 1651638064,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 438230039,
"end": 438328080
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26MYWBR1YCXASA89KZ8F8GW",
"num_docs": 411217,
"size_in_bytes": 519948004,
"time_range": {
"start": 1160418111,
"end": 1509492103
},
"create_timestamp": 1651638300,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 317238052,
"end": 326780516
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N81RVV4BNM0C2MWER0Z78",
"num_docs": 478959,
"size_in_bytes": 625017918,
"time_range": {
"start": 1292822357,
"end": 1587103597
},
"create_timestamp": 1651638603,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 388813589,
"end": 388900337
}
},
{
"split_state": "Published",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26NHMCV1BAP61AS006H7A75",
"num_docs": 3678305,
"size_in_bytes": 4736677883,
"time_range": {
"start": 1160418111,
"end": 1591921222
},
"create_timestamp": 1651638906,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 2678183120,
"end": 2678792526
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N0Q0X5PDV52YA6EKKBGFF",
"num_docs": 271265,
"size_in_bytes": 344296679,
"time_range": {
"start": 1216780920,
"end": 1511405528
},
"create_timestamp": 1651638358,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 212666409,
"end": 218978775
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N2HPB28PTYA8TDTKRP8D6",
"num_docs": 279867,
"size_in_bytes": 361323563,
"time_range": {
"start": 1230934379,
"end": 1513359201
},
"create_timestamp": 1651638419,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 225477171,
"end": 232727176
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N4CBC75E4TY1HNZMS7J9C",
"num_docs": 396428,
"size_in_bytes": 511474431,
"time_range": {
"start": 1241269791,
"end": 1516223095
},
"create_timestamp": 1651638481,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 321116808,
"end": 331051249
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26N9WDRQMEM9FQ68ZH04Y4G",
"num_docs": 307345,
"size_in_bytes": 400861275,
"time_range": {
"start": 1299809877,
"end": 1588657284
},
"create_timestamp": 1651638660,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 250380683,
"end": 257950016
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26NDS1DD9ZRN5GQB1F8HKPK",
"num_docs": 331782,
"size_in_bytes": 429910307,
"time_range": {
"start": 1306025856,
"end": 1590737498
},
"create_timestamp": 1651638788,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 270652419,
"end": 278863516
}
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1651638945,
"version": "1",
"split_id": "01G26NFKNBXHCCJXPZSTNHBBY8",
"num_docs": 245442,
"size_in_bytes": 321793812,
"time_range": {
"start": 1310760863,
"end": 1591921222
},
"create_timestamp": 1651638846,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 200656781,
"end": 206628263
}
},
{
"split_state": "Published",
"update_timestamp": 1651638973,
"version": "1",
"split_id": "01G26NK8YX0DM4YSVH6J9YD1GN",
"num_docs": 421980,
"size_in_bytes": 559498837,
"time_range": {
"start": 1318868981,
"end": 1595630349
},
"create_timestamp": 1651638968,
"tags": [],
"demux_num_ops": 0,
"footer_offsets": {
"start": 349970236,
"end": 350048435
}
}
]
}
`

@HeenaBansal2009
Copy link

@fmassot , I am unable to attach json files here , pasted the content.
This is my index configuration file:

#
# Index config file for gh-archive dataset.
#
version: 0

index_id:  hackernews_5

doc_mapping:
  store_source: false
  field_mappings:
    - name: id
      type: u64
      fast: true
    - name: type
      type: text
      tokenizer: raw
    - name: by
      type: text
      tokenizer: default
      record: position
    - name: time
      type: i64
      stored: true
      indexed: true
      fast: true
    - name: text
      type: text
      tokenizer: raw
      record: position
    - name: kids
      type: array<i64>
    - name: url
      type: text
      tokenizer: default
      record: position
    - name: score
      type: i64
      fast: true
    - name: title
      type: text
      tokenizer: default
      record: position

indexing_settings:
  timestamp_field: time

search_settings:
  default_search_fields: [title,by,text,url ]

@HeenaBansal2009
Copy link

RUST_LOG=debug

Where I can provide this level of logs . I don't see anything with help option. Just wandering if I ma putting it at right place or do you want me to add somewhere else .

/quickwit service run searcher RUST_LOG=debug 
error: Found argument 'RUST_LOG=debug' which wasn't expected, or isn't valid in this context

USAGE:
    quickwit service run searcher [OPTIONS] --config <CONFIG>

For more information try --help

@fulmicoton
Copy link
Contributor Author

Ah sorry. This is controlled by an environment variable.
If you are on linux or mac os, the following should work.

export RUST_LOG=debug

@HeenaBansal2009
Copy link

Ah sorry. This is controlled by an environment variable. If you are on linux or mac os, the following should work.

export RUST_LOG=debug

Please see attached.
Quickwit extended logs.docx

@fulmicoton
Copy link
Contributor Author

I did not find a change in schema. This might be a bug in tantivy's sstable based dictionary. I see that you set the raw tokenizer on the text field which I assume can be large.

There might a strange edge case hitting on super large tokens that breaks our sstable dictionary in some way.
I have enough information to try to properly reproduce I think.

One last question:

  • Did you index the entire hackernews dataset?
  • Do you observe the bug for any keyword or only specific ones?

In particular, can you test a keyword that happens early in the alphabet, like "ace"?

@fulmicoton
Copy link
Contributor Author

And thank you so much for your patience and effort!

@HeenaBansal2009
Copy link

I did not find a change in schema. This might be a bug in tantivy's sstable based dictionary. I see that you set the raw tokenizer on the text field which I assume can be large.

There might a strange edge case hitting on super large tokens that breaks our sstable dictionary in some way. I have enough information to try to properly reproduce I think.

One last question:

  • Did you index the entire hackernews dataset?
    Yes. I did ingest complete hackernews data.
  • Do you observe the bug for any keyword or only specific ones?

In particular, can you test a keyword that happens early in the alphabet, like "ace"?
I tested with "ace" and "google", I am getting the same error.
heena@Clickhouse1:~/quickwit-v0.2.1$ export QW_CONFIG=./config/quickwit.yaml heena@Clickhouse1:~/quickwit-v0.2.1$ curl "http://0.0.0.0:7280/api/v1/hackernews_5/search/stream?query=ace&outputFormat=csv&fastField=id" curl: (18) transfer closed with outstanding read data remaining heena@Clickhouse1:~/quickwit-v0.2.1$ curl "http://0.0.0.0:7280/api/v1/hackernews_5/search/stream?query=google&outputFormat=csv&fastField=id" curl: (18) transfer closed with outstanding read data remaining heena@Clickhouse1:~/quickwit-v0.2.1$

@HeenaBansal2009
Copy link

HeenaBansal2009 commented May 5, 2022

@fulmicoton , Let me try if I can share the hackernews data.I am not sure if I can add this much huge file in GitHub,
I downloaded the data from ClickHouse/ClickHouse#29693 (comment) and converted from native format to JSON using clickhouse and then feed it to Quickwit.

If you are comfortable executing queries with quickwit, I can share queries to get the hackernews data in to json format with dateTime field value changed to i64.

@fulmicoton
Copy link
Contributor Author

Sharing 01G26NHMCV1BAP61AS006H7A75.split would be the most helpful actually.

@fulmicoton
Copy link
Contributor Author

@HeenaBansal2009 I am able to reproduce ! Thank you. I'll let you know when it is fixed.

@HeenaBansal2009
Copy link

HeenaBansal2009 commented May 5, 2022

@fulmicoton Great ! Thanks , I am waiting to hear back .
If you don't mind , Once you debug, I would like to know about the root cause as well.
Thanks for your patience and support here .

@fulmicoton
Copy link
Contributor Author

@HeenaBansal2009 In the meanwhile, I was able to identify that the problem is related to the "raw" tokenizer in on the text field.
You probably want the default tokenizer there.

    - name: text
      type: text
      record: position

This will require you to reindex your dataset.

@fulmicoton
Copy link
Contributor Author

fulmicoton commented May 5, 2022

Ok the bug is fully understood. The sstable index was not added to the hotdirectory because we have a filter that filter out chunks larger than 10megabytes to be added.

Because of the raw tokenizer, the split had an unusual number of unique tokens, but most importantly unusually larges "last terms " per block. The slice containing the stable index ended up being 20mb large and was therefore dismissed.

@HeenaBansal2009
Copy link

@fulmicoton , Great. I am keeping eye on the fix .
However , if this situation arises when text has raw tokenenizer, If I change the tokenizer type for text field , do you think this can help with the error and I will be able to use curl ?
Currently, my task is to measure the performance of Quickwit with large set of data against my proprietary setup?

@fulmicoton
Copy link
Contributor Author

@HeenaBansal2009

I am not sure I understand.

The bug is... an actual bug and it will be fixed in Quickwit 0.3 which is meant to be released next week.
We take bugs very seriously :)

Now it triggers only when we have large splits with very large tokens which is most likely not something useful.

For a text with several words in it, that is in average longer than 30 characters, you will always want to use a tokenizer.
The raw tokenizer is more for small field like a user id (c3ke8sa0A), or a severity (INFO, WARNING, etc).

For your performance experiment, do you want us to have a look at the schema and make sure that everything looks fine?

fulmicoton added a commit that referenced this issue May 6, 2022
Before it could be skipped if it was greater than 10mb,
but Quickwit does not work at all if it not there.

Such size was uncommon and due to a misconfiguration.
We are mitigating the size problem independently.

Closes #1366
fulmicoton added a commit that referenced this issue May 6, 2022
Before it could be skipped if it was greater than 10mb,
but Quickwit does not work at all if it not there.

Such size was uncommon and due to a misconfiguration.
We are mitigating the size problem independently.

Closes #1366
@HeenaBansal2009
Copy link

@fulmicoton ,
In my use case, the text field/payload field will always have string more than 30 characters like below and we are planing to keep it as not stored, but was tokenized/indexed.

For this field , I am planning configuration to be like :
name: payload
type: text
tokenizer: raw
record: position
stored: false

Since payload is more than 30 characters , do you recommend using setting tokenizer to be "default" instead . If it is default , IP and other valuable info will be tokenized and cannot be searched without phrase query
and search could be impacted. Please correct me if my understanding is wrong.

Secondly, Our frequent queries search includes like query=VPN AUTHENTICATED USER .
Does it needs to be phrase query as well as per quickwit 0.2.1?

Please suggest/recommend, if changing index configuration for this field can make the difference in performance based on the payload type we have.

{"event_id": "123e4567-e89b-12d3-a456-426614174000", "payload": "1331901000.000000 CHEt7z3AzG4gyCNgci 192.168.202.79 50465 192.168.229.251 80 1 HEAD 192.168.229.251 /DEASLog02.nsf - Mozilla/5.0 (compatible; Nmap Scripting Engine; http://nmap.org/book/nse.html) 0 0 404 Not Found - - - (empty) - - - - - - -"}

@HeenaBansal2009
Copy link

@HeenaBansal2009

I am not sure I understand.

The bug is... an actual bug and it will be fixed in Quickwit 0.3 which is meant to be released next week. We take bugs very seriously :)
Thats's Great ! I really appreciate your efforts here.

@fulmicoton
Copy link
Contributor Author

fulmicoton commented May 6, 2022

Oops I replied too fast.

Yes you should probably tokenize your text.

You are correct, with the default tokenizer the IP will have to be searched as a phrase query which is not great.
raw is simply not an option. With raw you will return 0 document in your search.

Ideally we should work on having a tokenizer that suits your use case.

We have a ticket related to that: #1143

In the meanwhile can you give the default tokenizer + phrase query solution a try?

Also if you care a lot about performance or if you never plan to target a specific field, it might be faster to concatenate by,title, and body and treat them as a single field.

@HeenaBansal2009
Copy link

Thanks @fulmicoton for your inputs. Yes absolutely #1143 will match our criteria for FT search.
I am looking forward for the fix.

@HeenaBansal2009
Copy link

HeenaBansal2009 commented May 9, 2022

@fulmicoton , I must say Quickwit data ingestion is pretty fast than Elastic Search. I am impressed . :-)

A quick question:
After indexing json file into quickwit , I see the metrics like 'Indexed 8649189 documents in 9.80min. '
I can see only no. of documents indexed. Is there nay way , I can find how many rows were inserted in to quickwit , So that I can prepare the performance metrics according to number of rows inserted across speed.

@fmassot
Copy link
Contributor

fmassot commented May 9, 2022

I'm not sure to understand what you mean by "rows" but in Quickwit, you have several data structures where we store information:

  • we have a doc store which is a row-oriented storage (document id => document stored values)
  • we have a columnar storage called fastfield where we store the document field values in a continuous manner
  • we have an inverted index (very simplistic view is term -> term info -> list of doc IDs that contains the term)

@fmassot
Copy link
Contributor

fmassot commented May 9, 2022

As for indexing speed, we have seen an ingestion throughput of 40 MB/s on small servers (4vCPU) with local SSD on the same kind of events you have.

@HeenaBansal2009
Copy link

Like for ES , We got the metrics like :
28737558/28737558[1:23:43<00:00/5721.98docs/s] "index_time_in_millis": 28737558,(38m:12s)

Similarly , I am looking for QuickWit, Quickwit insert the same data file in almost 1/10 times lesser time.
I see the metrics on console in terms of docs(which can be different as per inmplenetation/partition of data).
I am looking for no. of rows/records inserted.

Like in database : we have no of rows inserted to no of records available in datafile.

@fmassot
Copy link
Contributor

fmassot commented May 9, 2022

Mmm, maybe a command like quickwit index describe --index wikipedia --config ./config/quickwit.yaml is what you are looking for? See the docs here: https://quickwit.io/docs/reference/cli#index-describe

You can only view the number of "published documents" there and the size of the index. "published" means that is ready for search.
Another way to have access to the number of documents is to do a match all query *.

@guilload
Copy link
Member

guilload commented May 9, 2022

In Quickwit jargon, 1 doc = 1 document = 1 record = 1 row.

quickwit index ingest outputs the number of docs ingested for the file.
quickwit index describe outputs the total number of docs in the index.

Does that help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants