Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cord-19 index and result #1153

Closed
zkt12 opened this issue May 6, 2020 · 6 comments · Fixed by #1165
Closed

Cord-19 index and result #1153

zkt12 opened this issue May 6, 2020 · 6 comments · Fixed by #1165

Comments

@zkt12
Copy link

zkt12 commented May 6, 2020

I follow exactly the instructions to build cord19 index, and retrieval. The ndcg@10 (query-udel round1 04-10) on full-text index is 0.4996, which is claimed 0.5407. Why?
And, when I build the paragraph index (05-01), only 1.72m are indexed, which is claimed 1.76m.

@lintool
Copy link
Member

lintool commented May 6, 2020

As a first step to debugging, why don't you start with the pre-built indexes first?
We also provide the 05-01 index pre-built.

After that, we need more details... what OS, Java version, etc.
It'd be helpful if you copy and pasted the exact commands to be clear.

@zkt12
Copy link
Author

zkt12 commented May 6, 2020

I tried to get pre-built indexes, seems the dropbox is not available for mainland China, even use VPN.

So I built the indexes like this:

DATE=2020-05-01
DATA_DIR=./cord19-"${DATE}"
mkdir "${DATA_DIR}"

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/comm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/noncomm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/custom_license.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/biorxiv_medrxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/arxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/metadata.csv -P "${DATA_DIR}"
ls "${DATA_DIR}"/*.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}"

sh target/appassembler/bin/IndexCollection
-collection Cord19ParagraphCollection -generator Cord19Generator
-threads 8 -input "${DATA_DIR}"
-index "${DATA_DIR}"/lucene-index-cord19-paragraph-"${DATE}"
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > log.cord19-paragraph.${DATE}.txt

and retrieval like this:

target/appassembler/bin/SearchCollection -index lucene-index-covid-paragraph-2020-05-01
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1-udel.xml -topicfield query -removedups -strip_segment_id
-bm25 -output runs/run.covid-r1.paragraph.query-udel.bm25.txt

My OS is Ubuntu 16.04, Java 11.0.6

@lintool
Copy link
Member

lintool commented May 6, 2020

Well, you're trying to evaluate retrieval against the 5/1 corpus using qrels from the 4/10 corpus, so of course your numbers are going to be lower... TREC-COVID round 1 is against the 4/10 corpus, so to replicate results you'll have to use that.

@zkt12
Copy link
Author

zkt12 commented May 6, 2020

Sorry, I didn't make it clear.
I've built the abstract, full-text and paragraph indexes for both 04-10 and 05-01 corpus. There are 2 differences with what you claimed:

  1. When I retrieval on 04_10 full-text index, the ndcg@10 is 0.4996.
  2. When I build 05_01 paragraph index, there are only 1.72m are indexed.

@lintool
Copy link
Member

lintool commented May 6, 2020

Our indexes are mirrored here: https://git.uwaterloo.ca/jimmylin/cord19-indexes

Can you try with pre-built indexes?

@lintool
Copy link
Member

lintool commented May 6, 2020

I think I understand the issue now. After the construction of the pre-built indexes, I manually did some data cleaning to blacklist a few outlier documents, see: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/Cord19Generator.java#L95

For example: 37491d1

Note that this was done independently of search results; in fact you can see from the commit id that my manual cleaning predates the release of the round 1 results.

Thus, if you use the latest HEAD to go back and index the corpus from 4/10, you'd get slightly different document counts. This changes term and document statistics slightly... apparently enough to have an impact on the effectiveness. Small changes though.

Hope this clarifies things.

To be clear, this explains:

And, when I build the paragraph index (05-01), only 1.72m are indexed, which is claimed 1.76m.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants