Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement rf #1236

Merged
merged 24 commits into from
Jun 3, 2020
Merged

Implement rf #1236

merged 24 commits into from
Jun 3, 2020

Conversation

lukuang
Copy link
Contributor

@lukuang lukuang commented May 28, 2020

In this PR I implemented relevance feedback.

@lukuang lukuang requested a review from lintool May 28, 2020 00:15
@codecov
Copy link

codecov bot commented May 28, 2020

Codecov Report

Merging #1236 into master will decrease coverage by 0.40%.
The diff coverage is 14.01%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1236      +/-   ##
============================================
- Coverage     48.18%   47.78%   -0.41%     
- Complexity      732      742      +10     
============================================
  Files           147      147              
  Lines          8563     8683     +120     
  Branches       1217     1244      +27     
============================================
+ Hits           4126     4149      +23     
- Misses         4097     4185      +88     
- Partials        340      349       +9     
Impacted Files Coverage Δ Complexity Δ
...ain/java/io/anserini/rerank/lib/AxiomReranker.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...n/java/io/anserini/rerank/lib/BM25PrfReranker.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
.../main/java/io/anserini/rerank/ScoredDocuments.java 15.58% <3.33%> (-7.82%) 4.00 <1.00> (+1.00) ⬇️
...main/java/io/anserini/search/SearchCollection.java 42.49% <13.79%> (-7.51%) 38.00 <0.00> (+2.00) ⬇️
.../main/java/io/anserini/rerank/lib/Rm3Reranker.java 52.47% <50.00%> (-1.79%) 9.00 <0.00> (ø)
src/main/java/io/anserini/search/SearchArgs.java 100.00% <100.00%> (ø) 9.00 <0.00> (ø)
...java/io/anserini/ltr/feature/CountBigramPairs.java 89.61% <0.00%> (+19.48%) 33.00% <0.00%> (+7.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f454cc2...2cac99a. Read the comment docs.

@@ -126,4 +131,36 @@ public static ScoredDocuments fromESDocs(SearchHits rs) {

return scoredDocs;
}

public static ScoredDocuments fromRelDocs(Map<String, Integer> queryRelDocs, IndexReader reader) throws IOException {
ScoredDocuments scoredDocs = new ScoredDocuments();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about fromQrels? and queryRelDocs -> qrels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have Rel here and other places to indicate that during the process there are only relevant documents. I think it would be better to keep that in place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need non-relevant documents for BM25?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but RF only affects the feedback methods such as rm3, and we only need relevant documents for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So shouldn't we load all the qrels, and each technique can ignore the parts they don't need? E.g., RM3 ignores all the rel grade = 0.

In terms of future proofing, we probably want to add an option of what's rel in the future?
E.g., in multi-grade, just 2 is rel, or both 1+2 are rel?

So loading all qrels seems more general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sg.

FeatureVector f = new FeatureVector();

Set<String> vocab = new HashSet<>();
int numdocs = docs.documents.length < fbDocs ? docs.documents.length : fbDocs;
int numdocs;
if (useRf){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after )

if (useRf){
numdocs = docs.documents.length;
}
else{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else {

@@ -101,6 +101,9 @@
"the top documents from the initial round ranking.")
public int rerankcutoff = 50;

@Option(name = "-rfQrels", metaVar = "[file]", usage = "qrels file used for relevance feedback")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just -qrels? Is rf redundant?
Dunno... up for discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add rf for two reasons:

  • it is future proof, in case there is another need for qrels
  • it is more meaningful and shows what it is for.

But I do not have a strong opinion on this. If you think it is unnecessary, I will change it to -qrels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: future proofing, then -rf.qrels? Just like how we have -bm25.b? I'll leave the choice to you...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel -rfQrels is better since we do not have -bm25 corresponding to -bm25.b?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the discussion above, -rf.qrels would be better here since we will have the relevance grade parameter in the future.

@@ -134,7 +142,8 @@
final private String runTag;

private SearcherThread(IndexReader reader, SortedMap<K, Map<String, String>> topics, TaggedSimilarity taggedSimilarity,
RerankerCascade cascade, String outputPath, String runTag) {
RerankerCascade cascade, Map<String, ScoredDocuments> relScoredDocs, String outputPath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relScoredDocs -> qrels

@@ -403,6 +439,41 @@ public void close() throws IOException {
return cascades;
}

private void readRelDocsFromQrels(String qrels) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just loadQrels?

if (!Files.exists(qrelsFilePath) || !Files.isRegularFile(qrelsFilePath) || !Files.isReadable(qrelsFilePath)) {
throw new IllegalArgumentException("Qrels file : " + qrelsFilePath + " does not exist or is not a (readable) file.");
}
Map<String, Map<String, Integer>> relDocs = new HashMap<String, Map<String, Integer>> ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new HashMap<>() will do?

Map<String, Map<String, Integer>> relDocs = new HashMap<String, Map<String, Integer>> ();
InputStream fin = Files.newInputStream(Paths.get(qrels), StandardOpenOption.READ);
BufferedInputStream in = new BufferedInputStream(fin);
BufferedReader bRdr = new BufferedReader(new InputStreamReader(in));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we generally user reader instead of bRdr.

@lintool
Copy link
Member

lintool commented Jun 3, 2020

All regressions passed. Getting ready to merge!

@lintool lintool merged commit 59395b4 into castorini:master Jun 3, 2020
crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022
* Adding prebuilt indexes regressions for HC4

* Updating Index Stats

* Adding HC4 Test Cases

* fix hc4 readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants