Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about running extract_docs_from_index.py #37

Open
yiyaxiaozhi opened this issue Mar 20, 2021 · 7 comments
Open

Question about running extract_docs_from_index.py #37

yiyaxiaozhi opened this issue Mar 20, 2021 · 7 comments

Comments

@yiyaxiaozhi
Copy link

yiyaxiaozhi commented Mar 20, 2021

I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini:
awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

but I get an error:
image
and I do not change any code in the file.

my java version is:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Do I have the correct java?

Could you give some advice on this error?
Thanks a lot!


I index the Robust04 document files myself and run the extract_docs_from_index.py successfully!
Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!

@WHU-ZQH
Copy link

WHU-ZQH commented Apr 23, 2021

I have also problems while running the extract_docs_from_index.py. Could you please release the document.tsv files?

@seanmacavaney
Copy link
Contributor

We cannot release the dataset directly due to the data usage agreement. However, I could provide a script that builds the file from the ir-datasets package, if that would help? Note that for this to work, you would need the original dataset source files.

Let me know if this is something you'd want.

@WHU-ZQH
Copy link

WHU-ZQH commented Apr 23, 2021

I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini:
awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

but I get an error:
image
and I do not change any code in the file.

my java version is:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Do I have the correct java?

Could you give some advice on this error?
Thanks a lot!

I index the Robust04 document files myself and run the extract_docs_from_index.py successfully!
Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!

I have the same problem with you. Could you please tell me how to solve it?

@seanmacavaney
Copy link
Contributor

The error says that the index was created with a newer Lucene version than the current software supports. I think you should be able to add a codecs JAR to your CLASSPATH to overcome this. Here's one that might work: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/resources/lucene-backward-codecs-8.0.0.jar

You'll probably need to add it to the classpath here: https://github.com/Georgetown-IR-Lab/cedr/blob/master/cedr/extract_docs_from_index.py#L25

Let me know if this helps!

@WHU-ZQH
Copy link

WHU-ZQH commented Apr 27, 2021

The error says that the index was created with a newer Lucene version than the current software supports. I think you should be able to add a codecs JAR to your CLASSPATH to overcome this. Here's one that might work: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/resources/lucene-backward-codecs-8.0.0.jar

You'll probably need to add it to the classpath here: https://github.com/Georgetown-IR-Lab/cedr/blob/master/cedr/extract_docs_from_index.py#L25

Let me know if this helps!

Thanks very much! I had addressed the problem successfully, but I still have a question... Specifically, how do you get the "train_pairs" in your study?

@Akakaala
Copy link

Akakaala commented Jan 5, 2022

我尝试使用此命令运行extract_docs_from_index.py 并且索引是由Pyserini 提供的预索引: awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

但我收到一个错误: 我没有更改文件中的任何代码。 图片

我的Java版本是: OpenJDK的版本“1.8.0_282” 的OpenJDK运行时环境(编译1.8.0_282-8u282-b08-0ubuntu1〜20.04-B08) OpenJDK的64位服务器VM(编译25.282-B08,混合模式) 我是否有正确的java?

你能就这个错误给出一些建议吗? 非常感谢!

我自己索引了 Robust04 文档文件并成功运行了 extract_docs_from_index.py! 然后我用pandas包查看document.tsv文件,发现这里有73855条记录。我不知道那里应该有多少个文件,如果您能告诉我正确的记录数,我将不胜感激!

您好,能看一下您处理的document.tsv文件的样例吗?我没有拿到完整的文件,所以不知道数据应该处理成什么样子

@yysirs
Copy link

yysirs commented May 31, 2022

Hi,Can you share the files under your index-robust04-20191213 folder, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants