Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data #43

Open
wangxinzhe123 opened this issue Mar 28, 2022 · 6 comments
Open

data #43

wangxinzhe123 opened this issue Mar 28, 2022 · 6 comments

Comments

@wangxinzhe123
Copy link

Because I want to run this code with other data sets, how can I get .run and .pair files similar to those in /data?

@seanmacavaney
Copy link
Contributor

Hi- the format description of these files are given here: https://github.com/Georgetown-IR-Lab/cedr#getting-started

In short, training pairs are sampled from lines like [query-id] [doc-id] and run files are the standard TREC run format: [query-id] 0 [doc-id] [rank] [score] [runtag]. The latter can be the output of various retrieval systems, and the former can just be sampled from run files (depending on what you want to train with).

@wangxinzhe123
Copy link
Author

Does the .run and .pair files need to be built manually or automatically by running some program?

@cmacdonald
Copy link
Contributor

There is also an integration plugin for CEDR using PyTerrier - see
https://github.com/terrierteam/pyterrier_bert#cedr-usage
(though its a little more dated compared to other PyTerrier plugins now)

@seanmacavaney
Copy link
Contributor

@wangxinzhe123 -- ultimately how you construct these files depends on your experimental setup. The main questions are:

  1. What results do you want CEDR to re-rank?
  2. What data do you want CEDR to sample as training data?

@wangxinzhe123
Copy link
Author

Excuse me, can you provide the index file containing the indexbuildindex parameter?

@seanmacavaney
Copy link
Contributor

That again depends on what experiment you're running -- especially since you mention that you're running it with different datasets.

Since you brought up Indri, here's documentation on it: https://sourceforge.net/p/lemur/wiki/IndriBuildIndex%20Parameters/

I'm not very familiar with Indri, however. I'm happy to help out using PyTerrier though -- especially if you provide some details on what you're trying to do. Here's the documentation on indexing: https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants