Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209

Open
yuenherny opened this issue Sep 3, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@yuenherny
Copy link

Describe the bug
In msmarco_passage.py line 199-204, the dev/small dataset was:

subsets['dev/small'] = Dataset(
        collection,
        TsvQueries(Cache(TarExtract(dlc['collectionandqueries'], 'queries.dev.small.tsv'), base_path/'dev/small/queries.tsv'), namespace='msmarco', lang='en'),
        TrecQrels(Cache(TarExtract(dlc['collectionandqueries'], 'qrels.dev.small.tsv'), base_path/'dev/small/qrels'), QRELS_DEFS),
        TrecScoredDocs(Cache(ExtractQidPid(TarExtract(dlc['dev/scoreddocs'], 'top1000.dev')), base_path/'dev/ms.run')),
    )

I took a look at the structure in collectionandqueries.tar.gz and it matches with what stated above. However, structure is different for top1000.dev.tar.gz:

top1000.dev.tar.gz
|-- top1000.dev.tar
     |-- top1000.dev

In the downloaded tar.gz file, there were no dev/scoreddocs, and the top1000.dev was kept within top1000.dev.tar instead.

Affected dataset(s)

  • msmarco-passage

To Reproduce
Steps to reproduce the behavior:

  1. Download the top1000.dev.tar.gz from here
  2. Open the file in 7zip or anything similar.

Expected behavior
Following what was stated at msmarco_passage.py Line 203, I would expect the following structure:

top1000.dev.tar.gz
|-- dev
     |-- scoreddocs.tar
          |-- top1000.dev

or

top1000.dev.tar.gz
|-- dev/scoreddocs.tar
     |-- top1000.dev

Additional context
Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.

@yuenherny yuenherny added the bug Something isn't working label Sep 3, 2022
@seanmacavaney
Copy link
Collaborator

Thanks for the report. I'm not able to reproduce it when following the instructions provided by the software:

Specifically:

When requesting scoreddocs of msmarco-passage/dev/small, I get the following message as it starts downloading:

$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
...

If I stop it there, perform the download to the specified location, and re-run. It works without a hitch:

$ curl https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz > /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
188714 Q0 1000052 0 0.0 run
1082792 Q0 1000084 0 0.0 run
995526 Q0 1000094 0 0.0 run
199776 Q0 1000115 0 0.0 run
660957 Q0 1000115 0 0.0 run
...

(The same would happen if using the Python API, rather than the CLI.)

It looks like above you're doing more of the extraction yourself, which I generally would not advise. First, it means that the downloads are not verified, so if there was a problem downloading the data, you may inadvertently be working with an incomplete or incorrect set of the data. Second, you may not perform the same pre-processing stages as the software, which can cause problems.

In most cases, I'd suggest just letting the software download the files automatically for you. It really only makes sense to copy/symlink them if you already have a copy and don't want to bother waiting for the download. And when you do this, it's best the follow the instructions given by the software about where to place the files.

In this case, TarExtract transparently performs gzip decompression, in addition to extracting the file. It then performs additional processing via ExtractQidPid to convert the file into a standard file format.

Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.

There's a GitHub Action that runs tests for Windows, but I don't have a Windows machine myself to test stuff on. Nor am I particularly experienced with Windows systems, in general, to provide advice. I appreciate the reports to help improve Windows support, and would welcome contributions that improve the experience on Windows!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants