File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209

yuenherny · 2022-09-03T08:19:33Z

Describe the bug
In msmarco_passage.py line 199-204, the dev/small dataset was:

subsets['dev/small'] = Dataset(
        collection,
        TsvQueries(Cache(TarExtract(dlc['collectionandqueries'], 'queries.dev.small.tsv'), base_path/'dev/small/queries.tsv'), namespace='msmarco', lang='en'),
        TrecQrels(Cache(TarExtract(dlc['collectionandqueries'], 'qrels.dev.small.tsv'), base_path/'dev/small/qrels'), QRELS_DEFS),
        TrecScoredDocs(Cache(ExtractQidPid(TarExtract(dlc['dev/scoreddocs'], 'top1000.dev')), base_path/'dev/ms.run')),
    )

I took a look at the structure in collectionandqueries.tar.gz and it matches with what stated above. However, structure is different for top1000.dev.tar.gz:

top1000.dev.tar.gz
|-- top1000.dev.tar
     |-- top1000.dev

In the downloaded tar.gz file, there were no dev/scoreddocs, and the top1000.dev was kept within top1000.dev.tar instead.

Affected dataset(s)

msmarco-passage

To Reproduce
Steps to reproduce the behavior:

Download the top1000.dev.tar.gz from here
Open the file in 7zip or anything similar.

Expected behavior
Following what was stated at msmarco_passage.py Line 203, I would expect the following structure:

top1000.dev.tar.gz
|-- dev
     |-- scoreddocs.tar
          |-- top1000.dev

or

top1000.dev.tar.gz
|-- dev/scoreddocs.tar
     |-- top1000.dev

Additional context
Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.

The text was updated successfully, but these errors were encountered:

seanmacavaney · 2022-09-03T08:42:56Z

Thanks for the report. I'm not able to reproduce it when following the instructions provided by the software:

Specifically:

When requesting scoreddocs of msmarco-passage/dev/small, I get the following message as it starts downloading:

$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
...

If I stop it there, perform the download to the specified location, and re-run. It works without a hitch:

$ curl https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz > /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
188714 Q0 1000052 0 0.0 run
1082792 Q0 1000084 0 0.0 run
995526 Q0 1000094 0 0.0 run
199776 Q0 1000115 0 0.0 run
660957 Q0 1000115 0 0.0 run
...

(The same would happen if using the Python API, rather than the CLI.)

It looks like above you're doing more of the extraction yourself, which I generally would not advise. First, it means that the downloads are not verified, so if there was a problem downloading the data, you may inadvertently be working with an incomplete or incorrect set of the data. Second, you may not perform the same pre-processing stages as the software, which can cause problems.

In most cases, I'd suggest just letting the software download the files automatically for you. It really only makes sense to copy/symlink them if you already have a copy and don't want to bother waiting for the download. And when you do this, it's best the follow the instructions given by the software about where to place the files.

In this case, TarExtract transparently performs gzip decompression, in addition to extracting the file. It then performs additional processing via ExtractQidPid to convert the file into a standard file format.

Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.

There's a GitHub Action that runs tests for Windows, but I don't have a Windows machine myself to test stuff on. Nor am I particularly experienced with Windows systems, in general, to provide advice. I appreciate the reports to help improve Windows support, and would welcome contributions that improve the experience on Windows!

yuenherny added the bug Something isn't working label Sep 3, 2022

seanmacavaney mentioned this issue Sep 3, 2022

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209

yuenherny commented Sep 3, 2022

seanmacavaney commented Sep 3, 2022

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209

Comments

yuenherny commented Sep 3, 2022

seanmacavaney commented Sep 3, 2022