-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about running extract_docs_from_index.py #37
Comments
I have also problems while running the extract_docs_from_index.py. Could you please release the document.tsv files? |
We cannot release the dataset directly due to the data usage agreement. However, I could provide a script that builds the file from the ir-datasets package, if that would help? Note that for this to work, you would need the original dataset source files. Let me know if this is something you'd want. |
The error says that the index was created with a newer Lucene version than the current software supports. I think you should be able to add a codecs JAR to your CLASSPATH to overcome this. Here's one that might work: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/resources/lucene-backward-codecs-8.0.0.jar You'll probably need to add it to the classpath here: https://github.com/Georgetown-IR-Lab/cedr/blob/master/cedr/extract_docs_from_index.py#L25 Let me know if this helps! |
Thanks very much! I had addressed the problem successfully, but I still have a question... Specifically, how do you get the "train_pairs" in your study? |
您好,能看一下您处理的document.tsv文件的样例吗?我没有拿到完整的文件,所以不知道数据应该处理成什么样子 |
Hi,Can you share the files under your index-robust04-20191213 folder, please? |
I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini:
awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv
but I get an error:
and I do not change any code in the file.
my java version is:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Do I have the correct java?
Could you give some advice on this error?
Thanks a lot!
I index the Robust04 document files myself and run the extract_docs_from_index.py successfully!
Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!
The text was updated successfully, but these errors were encountered: