You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I took a look at the structure in collectionandqueries.tar.gz and it matches with what stated above. However, structure is different for top1000.dev.tar.gz:
Additional context
Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.
The text was updated successfully, but these errors were encountered:
Thanks for the report. I'm not able to reproduce it when following the instructions provided by the software:
Specifically:
When requesting scoreddocs of msmarco-passage/dev/small, I get the following message as it starts downloading:
$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
...
If I stop it there, perform the download to the specified location, and re-run. It works without a hitch:
$ curl https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz > /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
188714 Q0 1000052 0 0.0 run
1082792 Q0 1000084 0 0.0 run
995526 Q0 1000094 0 0.0 run
199776 Q0 1000115 0 0.0 run
660957 Q0 1000115 0 0.0 run
...
(The same would happen if using the Python API, rather than the CLI.)
It looks like above you're doing more of the extraction yourself, which I generally would not advise. First, it means that the downloads are not verified, so if there was a problem downloading the data, you may inadvertently be working with an incomplete or incorrect set of the data. Second, you may not perform the same pre-processing stages as the software, which can cause problems.
In most cases, I'd suggest just letting the software download the files automatically for you. It really only makes sense to copy/symlink them if you already have a copy and don't want to bother waiting for the download. And when you do this, it's best the follow the instructions given by the software about where to place the files.
In this case, TarExtract transparently performs gzip decompression, in addition to extracting the file. It then performs additional processing via ExtractQidPid to convert the file into a standard file format.
Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.
There's a GitHub Action that runs tests for Windows, but I don't have a Windows machine myself to test stuff on. Nor am I particularly experienced with Windows systems, in general, to provide advice. I appreciate the reports to help improve Windows support, and would welcome contributions that improve the experience on Windows!
Describe the bug
In
msmarco_passage.py
line 199-204, thedev/small
dataset was:I took a look at the structure in
collectionandqueries.tar.gz
and it matches with what stated above. However, structure is different fortop1000.dev.tar.gz
:In the downloaded
tar.gz
file, there were nodev/scoreddocs
, and thetop1000.dev
was kept withintop1000.dev.tar
instead.Affected dataset(s)
msmarco-passage
To Reproduce
Steps to reproduce the behavior:
top1000.dev.tar.gz
from hereExpected behavior
Following what was stated at
msmarco_passage.py
Line 203, I would expect the following structure:or
Additional context
Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.
The text was updated successfully, but these errors were encountered: