-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in quickwit search stream StorageDirectory only supports async reads
#1366
Comments
@HeenaBansal2009 Can you add the first line of stdout? It contains the sha1 of the version you used? Is it main or is the released 0.2.1? Also did you use S3 or the local file storage? |
@fulmicoton , I used the local file storage. I am using 0.2.1
Pasting the logs here . ``heena@Clickhouse1:~/quickwit-v0.2.1$ ./quickwit service run searcher Connectivity checklist 2022-05-04T20:07:19.755Z INFO quickwit_cluster::cluster: Create new cluster. node_id="node-patient-lLKH" listen_addr=127.0.0.1:7280 |
thank you this is super helpful! |
@HeenaBansal2009 this is quite mysterious. Can you provide even more logs? What happened before the error... RUST_LOG=debug might help too. |
Flushing the status of my investigation. I have not been able to reproduce yet. The logs suggests everything is working somewhat correctly, but the warmup phase did not work properly for two splits. |
Also can you share the content of the meta.json that you obtain after running: ./quickwit split extract --index hackernews_5 --split 01G26NK8YX0DM4YSVH6J9YD1GN --target-dir . as well as the current index configuration. If you don't have your config file, you can send the qwdata/indexes/hackernews_5/metastore.json file. My current suspicion is a bug triggered by a change of schema. |
meta.json
{ |
@fmassot , I am unable to attach json files here , pasted the content.
|
Where I can provide this level of logs . I don't see anything with help option. Just wandering if I ma putting it at right place or do you want me to add somewhere else .
|
Ah sorry. This is controlled by an environment variable.
|
Please see attached. |
I did not find a change in schema. This might be a bug in tantivy's sstable based dictionary. I see that you set the raw tokenizer on the text field which I assume can be large. There might a strange edge case hitting on super large tokens that breaks our sstable dictionary in some way. One last question:
In particular, can you test a keyword that happens early in the alphabet, like "ace"? |
And thank you so much for your patience and effort! |
|
@fulmicoton , Let me try if I can share the hackernews data.I am not sure if I can add this much huge file in GitHub, If you are comfortable executing queries with quickwit, I can share queries to get the hackernews data in to json format with dateTime field value changed to i64. |
Sharing |
@HeenaBansal2009 I am able to reproduce ! Thank you. I'll let you know when it is fixed. |
@fulmicoton Great ! Thanks , I am waiting to hear back . |
@HeenaBansal2009 In the meanwhile, I was able to identify that the problem is related to the "raw" tokenizer in on the text field.
This will require you to reindex your dataset. |
Ok the bug is fully understood. The sstable index was not added to the hotdirectory because we have a filter that filter out chunks larger than 10megabytes to be added. Because of the raw tokenizer, the split had an unusual number of unique tokens, but most importantly unusually larges "last terms " per block. The slice containing the stable index ended up being 20mb large and was therefore dismissed. |
@fulmicoton , Great. I am keeping eye on the fix . |
I am not sure I understand. The bug is... an actual bug and it will be fixed in Quickwit 0.3 which is meant to be released next week. Now it triggers only when we have large splits with very large tokens which is most likely not something useful. For a text with several words in it, that is in average longer than 30 characters, you will always want to use a tokenizer. For your performance experiment, do you want us to have a look at the schema and make sure that everything looks fine? |
Before it could be skipped if it was greater than 10mb, but Quickwit does not work at all if it not there. Such size was uncommon and due to a misconfiguration. We are mitigating the size problem independently. Closes #1366
Before it could be skipped if it was greater than 10mb, but Quickwit does not work at all if it not there. Such size was uncommon and due to a misconfiguration. We are mitigating the size problem independently. Closes #1366
@fulmicoton , For this field , I am planning configuration to be like : Since payload is more than 30 characters , do you recommend using setting tokenizer to be "default" instead . If it is default , IP and other valuable info will be tokenized and cannot be searched without phrase query Secondly, Our frequent queries search includes like query=VPN AUTHENTICATED USER . Please suggest/recommend, if changing index configuration for this field can make the difference in performance based on the payload type we have.
|
|
Oops I replied too fast. Yes you should probably tokenize your text. You are correct, with the default tokenizer the IP will have to be searched as a phrase query which is not great. Ideally we should work on having a tokenizer that suits your use case. We have a ticket related to that: #1143 In the meanwhile can you give the default tokenizer + phrase query solution a try? Also if you care a lot about performance or if you never plan to target a specific field, it might be faster to concatenate by,title, and body and treat them as a single field. |
Thanks @fulmicoton for your inputs. Yes absolutely #1143 will match our criteria for FT search. |
@fulmicoton , I must say Quickwit data ingestion is pretty fast than Elastic Search. I am impressed . :-) A quick question: |
I'm not sure to understand what you mean by "rows" but in Quickwit, you have several data structures where we store information:
|
As for indexing speed, we have seen an ingestion throughput of 40 MB/s on small servers (4vCPU) with local SSD on the same kind of events you have. |
Like for ES , We got the metrics like : Similarly , I am looking for QuickWit, Quickwit insert the same data file in almost 1/10 times lesser time. Like in database : we have no of rows inserted to no of records available in datafile. |
Mmm, maybe a command like You can only view the number of "published documents" there and the size of the index. "published" means that is ready for search. |
In Quickwit jargon, 1 doc = 1 document = 1 record = 1 row. Does that help? |
Copy pasted from #1357 (reply in thread)
am able to ingest data in quickwit and search . However when I search using curl command , I am getting read async error.
What could go wrong here.
heena@Clickhouse1:~/quickwit-v0.2.1$ ./quickwit index search --index hackernews_5 --query Ambulance 2022-05-04T13:25:27.169Z ERROR quickwit_directories::storage_directory: path="6dc68fd1122c44a985ccf5348907c5f8.term" msg="Unsupported operation. StorageDirectory only supports async reads" 2022-05-04T13:25:27.171Z ERROR quickwit_directories::storage_directory: path="ccf34dbac4614904b1124b751756dab8.term" msg="Unsupported operation. StorageDirectory only supports async reads" { "numHits": 1, "hits": [ { "by": [ "sgk284" ], "id": [ 2923885 ], "kids": [ 2923989, 2925247, 2924320, 2925442, 2924224, 2923994, 2924209, 2924702, 2925235, 2925010, 2924319, 2924638, 2925781, 2923943, 2924298 ], "score": [ 622 ], "text": [ "" ], "time": [ 1314251037 ], "title": [ "Icon Ambulance" ], "type": [ "story" ], "url": [ "https://plus.google.com/107117483540235115863/posts/gcSStkKxXTw" ] } ], "elapsedTimeMicros": 77324, "errors": [ "SplitSearchError { error: \"Internal error:
An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: \"ccf34dbac4614904b1124b751756dab8.term\"'.\", split_id: \"01G26NHMCV1BAP61AS006H7A75\", retryable_error: true }", "SplitSearchError { error: \"Internal error:
An IO error occurred: 'Unsupported operation. StorageDirectory only supports async reads: \"6dc68fd1122c44a985ccf5348907c5f8.term\"'.\", split_id: \"01G26NK8YX0DM4YSVH6J9YD1GN\", retryable_error: true }" ] }
The output with curl command to search the same keyword.
heena@Clickhouse1:~/quickwit-v0.2.1$ curl "http://0.0.0.0:7280/api/v1/hackernews_5/search/stream?query=Ambulance&outputFormat=csv&fastField=id" curl: (18) transfer closed with outstanding read data remaining heena@Clickhouse1:~/quickwit-v0.2.1$
Attached the console logs when queried the commands ,This might be helpful
The text was updated successfully, but these errors were encountered: