Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New MS MARCO (V1) doc regressions #1721

Closed
lintool opened this issue Jan 6, 2022 · 3 comments · Fixed by #1723
Closed

New MS MARCO (V1) doc regressions #1721

lintool opened this issue Jan 6, 2022 · 3 comments · Fixed by #1723

Comments

@lintool
Copy link
Member

lintool commented Jan 6, 2022

More work on the reproducibility issue described in https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-doc2query-details.md

There will be a forthcoming PR swapping in new "ground truth" for MS MARCO (V1) doc regressions. The segmentation has been fixed, and expansion now uses JSON-formatted data.

For reference, these are the final source ground truth corpora:

# msmarco-doc
$ wc /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc/docs.jsonl
    3213835  3633917727 23321565733 /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc/docs.jsonl

$ md5sum /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc/docs.jsonl
c97a4c4a3f5e7c20df2783691cc7028e  /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc/docs.jsonl

# msmarco-doc-segmented
$ wc /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented/docs.jsonl
   20545677  4205249629 28040486489 /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented/docs.jsonl

$ md5sum /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented/docs.jsonl
2a07583e377a95574223efd69f2877c5  /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented/docs.jsonl

# msmarco-doc-docTTTTTquery
$ wc /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_d2q-t5/docs.jsonl
    3213835  5009978479 30998099395 /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_d2q-t5/docs.jsonl

$ md5sum /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_d2q-t5/docs.jsonl
3b6e1b27edf390a6eb4155d19a4b7630  /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_d2q-t5/docs.jsonl

# msmarco-doc-segmented-docTTTTTquery
$ wc /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented_d2q-t5_10/docs.jsonl
   20545677  5563966525 35734339979 /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented_d2q-t5_10/docs.jsonl

$ md5sum /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented_d2q-t5_10/docs.jsonl
c24261af16575b4527916676f542b793  /store/scratch/rpradeep/msmarco-v1/collections/msmarco_v1_doc_segmented_d2q-t5_10/docs.jsonl

These were generated by @ronakice on orca.

@ronakice please confirm this is correct.

@ronakice
Copy link
Member

ronakice commented Jan 7, 2022

Yes, all these look correct to me!

@lintool
Copy link
Member Author

lintool commented Jan 7, 2022

I've split into separate files and repackaged as follows, on orca:

$ ls -l /store/collections/msmarco/*.tar
-rw-r--r-- 1 jimmylin jimmylin 10088448000 Jan  7 12:45 /store/collections/msmarco/msmarco-doc-docTTTTTquery.tar
-rw-r--r-- 1 jimmylin jimmylin  7821808128 Jan  7 10:58 /store/collections/msmarco/msmarco-doc-segmented-docTTTTTquery.tar
-rw-r--r-- 1 jimmylin jimmylin  6222943744 Jan  7 11:31 /store/collections/msmarco/msmarco-doc-segmented.tar
-rw-r--r-- 1 jimmylin jimmylin  8464093803 Jan  7 12:15 /store/collections/msmarco/msmarco-doc.tar

$ md5sum /store/collections/msmarco/*.tar
2f2debe5478cbf034e9c19603003060f  /store/collections/msmarco/msmarco-doc-docTTTTTquery.tar
c26e9ff07bf2ad9f377e5373b520fa04  /store/collections/msmarco/msmarco-doc-segmented-docTTTTTquery.tar
d46d4cf3fb47b6dfc50b37463dabe0a2  /store/collections/msmarco/msmarco-doc-segmented.tar
2d36ae5e632a4b75d633dcb5c5a87b82  /store/collections/msmarco/msmarco-doc.tar

Recording the checksums for future reference.

I have confirmed that the packaging is correct by uncompressing the collection and then running:

$ cat docs-* | md5

I've confirmed that the checksums of the single-file corpora match above.

@lintool
Copy link
Member Author

lintool commented Jan 7, 2022

Okay, final steps, I've confirmed that the regressions run successfully on orca:

python src/main/python/run_regression.py --index --verify --search --regression msmarco-doc >& logs/log.msmarco-doc &
python src/main/python/run_regression.py --index --verify --search --regression msmarco-doc-docTTTTTquery >& logs/log.msmarco-doc-docTTTTTquery &
python src/main/python/run_regression.py --index --verify --search --regression msmarco-doc-segmented >& logs/log.msmarco-doc-segmented &
python src/main/python/run_regression.py --index --verify --search --regression msmarco-doc-segmented-docTTTTTquery >& logs/log.msmarco-doc-segmented-docTTTTTquery &

python src/main/python/run_regression.py --verify --search --regression dl19-doc
python src/main/python/run_regression.py --verify --search --regression dl19-doc-docTTTTTquery
python src/main/python/run_regression.py --verify --search --regression dl19-doc-segmented
python src/main/python/run_regression.py --verify --search --regression dl19-doc-segmented-docTTTTTquery

python src/main/python/run_regression.py --verify --search --regression dl20-doc
python src/main/python/run_regression.py --verify --search --regression dl20-doc-docTTTTTquery
python src/main/python/run_regression.py --verify --search --regression dl20-doc-segmented
python src/main/python/run_regression.py --verify --search --regression dl20-doc-segmented-docTTTTTquery

In the first block, we're actually building the index. DL19 and DL20 use the same indexes, so we're just performing retrieval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants