Question about running extract_docs_from_index.py #37

yiyaxiaozhi · 2021-03-20T12:48:23Z

I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini:
awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

but I get an error:

and I do not change any code in the file.

my java version is:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Do I have the correct java?

Could you give some advice on this error?
Thanks a lot!

I index the Robust04 document files myself and run the extract_docs_from_index.py successfully!
Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!

WHU-ZQH · 2021-04-23T11:54:32Z

I have also problems while running the extract_docs_from_index.py. Could you please release the document.tsv files?

seanmacavaney · 2021-04-23T11:57:20Z

We cannot release the dataset directly due to the data usage agreement. However, I could provide a script that builds the file from the ir-datasets package, if that would help? Note that for this to work, you would need the original dataset source files.

Let me know if this is something you'd want.

WHU-ZQH · 2021-04-23T14:14:35Z

I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini:
awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

but I get an error:

and I do not change any code in the file.

my java version is:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Do I have the correct java?

Could you give some advice on this error?
Thanks a lot!

I index the Robust04 document files myself and run the extract_docs_from_index.py successfully!
Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!

I have the same problem with you. Could you please tell me how to solve it?

seanmacavaney · 2021-04-23T14:20:36Z

The error says that the index was created with a newer Lucene version than the current software supports. I think you should be able to add a codecs JAR to your CLASSPATH to overcome this. Here's one that might work: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/resources/lucene-backward-codecs-8.0.0.jar

You'll probably need to add it to the classpath here: https://github.com/Georgetown-IR-Lab/cedr/blob/master/cedr/extract_docs_from_index.py#L25

Let me know if this helps!

WHU-ZQH · 2021-04-27T01:40:39Z

The error says that the index was created with a newer Lucene version than the current software supports. I think you should be able to add a codecs JAR to your CLASSPATH to overcome this. Here's one that might work: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/resources/lucene-backward-codecs-8.0.0.jar

You'll probably need to add it to the classpath here: https://github.com/Georgetown-IR-Lab/cedr/blob/master/cedr/extract_docs_from_index.py#L25

Let me know if this helps!

Thanks very much! I had addressed the problem successfully, but I still have a question... Specifically, how do you get the "train_pairs" in your study?

Akakaala · 2022-01-05T08:19:07Z

我尝试使用此命令运行extract_docs_from_index.py 并且索引是由Pyserini 提供的预索引： awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

但我收到一个错误：我没有更改文件中的任何代码。

我的Java版本是： OpenJDK的版本“1.8.0_282” 的OpenJDK运行时环境（编译1.8.0_282-8u282-b08-0ubuntu1〜20.04-B08） OpenJDK的64位服务器VM（编译25.282-B08，混合模式）我是否有正确的java？

你能就这个错误给出一些建议吗？非常感谢！

我自己索引了 Robust04 文档文件并成功运行了 extract_docs_from_index.py！然后我用pandas包查看document.tsv文件，发现这里有73855条记录。我不知道那里应该有多少个文件，如果您能告诉我正确的记录数，我将不胜感激！

您好，能看一下您处理的document.tsv文件的样例吗？我没有拿到完整的文件，所以不知道数据应该处理成什么样子

yysirs · 2022-05-31T12:27:32Z

Hi，Can you share the files under your index-robust04-20191213 folder, please?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about running extract_docs_from_index.py #37

Question about running extract_docs_from_index.py #37

yiyaxiaozhi commented Mar 20, 2021 •

edited

Loading

WHU-ZQH commented Apr 23, 2021

seanmacavaney commented Apr 23, 2021

WHU-ZQH commented Apr 23, 2021

seanmacavaney commented Apr 23, 2021

WHU-ZQH commented Apr 27, 2021

Akakaala commented Jan 5, 2022

yysirs commented May 31, 2022

Question about running extract_docs_from_index.py #37

Question about running extract_docs_from_index.py #37

Comments

yiyaxiaozhi commented Mar 20, 2021 • edited Loading

WHU-ZQH commented Apr 23, 2021

seanmacavaney commented Apr 23, 2021

WHU-ZQH commented Apr 23, 2021

seanmacavaney commented Apr 23, 2021

WHU-ZQH commented Apr 27, 2021

Akakaala commented Jan 5, 2022

yysirs commented May 31, 2022

yiyaxiaozhi commented Mar 20, 2021 •

edited

Loading