Implement rf #1236

lukuang · 2020-05-28T00:15:33Z

In this PR I implemented relevance feedback.

codecov · 2020-05-28T00:21:08Z

Codecov Report

Merging #1236 into master will decrease coverage by 0.40%.
The diff coverage is 14.01%.

@@             Coverage Diff              @@
##             master    #1236      +/-   ##
============================================
- Coverage     48.18%   47.78%   -0.41%     
- Complexity      732      742      +10     
============================================
  Files           147      147              
  Lines          8563     8683     +120     
  Branches       1217     1244      +27     
============================================
+ Hits           4126     4149      +23     
- Misses         4097     4185      +88     
- Partials        340      349       +9

Impacted Files	Coverage Δ	Complexity Δ
...ain/java/io/anserini/rerank/lib/AxiomReranker.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...n/java/io/anserini/rerank/lib/BM25PrfReranker.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
.../main/java/io/anserini/rerank/ScoredDocuments.java	`15.58% <3.33%> (-7.82%)`	`4.00 <1.00> (+1.00)`	⬇️
...main/java/io/anserini/search/SearchCollection.java	`42.49% <13.79%> (-7.51%)`	`38.00 <0.00> (+2.00)`	⬇️
.../main/java/io/anserini/rerank/lib/Rm3Reranker.java	`52.47% <50.00%> (-1.79%)`	`9.00 <0.00> (ø)`
src/main/java/io/anserini/search/SearchArgs.java	`100.00% <100.00%> (ø)`	`9.00 <0.00> (ø)`
...java/io/anserini/ltr/feature/CountBigramPairs.java	`89.61% <0.00%> (+19.48%)`	`33.00% <0.00%> (+7.00%)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f454cc2...2cac99a. Read the comment docs.

…ment_rf

lintool · 2020-06-01T00:41:03Z

src/main/java/io/anserini/rerank/ScoredDocuments.java

@@ -126,4 +131,36 @@ public static ScoredDocuments fromESDocs(SearchHits rs) {

    return scoredDocs;
  }
+
+  public static ScoredDocuments fromRelDocs(Map<String, Integer> queryRelDocs, IndexReader reader) throws IOException {
+    ScoredDocuments scoredDocs = new ScoredDocuments();


How about fromQrels? and queryRelDocs -> qrels?

I have Rel here and other places to indicate that during the process there are only relevant documents. I think it would be better to keep that in place?

Don't you need non-relevant documents for BM25?

Yes, but RF only affects the feedback methods such as rm3, and we only need relevant documents for that?

So shouldn't we load all the qrels, and each technique can ignore the parts they don't need? E.g., RM3 ignores all the rel grade = 0.

In terms of future proofing, we probably want to add an option of what's rel in the future?
E.g., in multi-grade, just 2 is rel, or both 1+2 are rel?

So loading all qrels seems more general?

lintool · 2020-06-01T00:41:55Z

src/main/java/io/anserini/rerank/lib/Rm3Reranker.java

    FeatureVector f = new FeatureVector();

    Set<String> vocab = new HashSet<>();
-    int numdocs = docs.documents.length < fbDocs ? docs.documents.length : fbDocs;
+    int numdocs;
+    if (useRf){


space after )

lintool · 2020-06-01T00:42:04Z

src/main/java/io/anserini/rerank/lib/Rm3Reranker.java

+    if (useRf){
+      numdocs = docs.documents.length;
+    }
+    else{


lintool · 2020-06-01T00:42:50Z

src/main/java/io/anserini/search/SearchArgs.java

@@ -101,6 +101,9 @@
      "the top documents from the initial round ranking.")
  public int rerankcutoff = 50;

+  @Option(name = "-rfQrels", metaVar = "[file]", usage = "qrels file used for relevance feedback")


How about just -qrels? Is rf redundant?
Dunno... up for discussion.

I add rf for two reasons:

it is future proof, in case there is another need for qrels

it is more meaningful and shows what it is for.

But I do not have a strong opinion on this. If you think it is unnecessary, I will change it to -qrels

re: future proofing, then -rf.qrels? Just like how we have -bm25.b? I'll leave the choice to you...

I feel -rfQrels is better since we do not have -bm25 corresponding to -bm25.b?

According to the discussion above, -rf.qrels would be better here since we will have the relevance grade parameter in the future.

lintool · 2020-06-01T00:43:18Z

src/main/java/io/anserini/search/SearchCollection.java

@@ -134,7 +142,8 @@
    final private String runTag;

    private SearcherThread(IndexReader reader, SortedMap<K, Map<String, String>> topics, TaggedSimilarity taggedSimilarity,
-                           RerankerCascade cascade, String outputPath, String runTag) {
+                           RerankerCascade cascade, Map<String, ScoredDocuments> relScoredDocs, String outputPath, 


relScoredDocs -> qrels

lintool · 2020-06-01T00:43:46Z

src/main/java/io/anserini/search/SearchCollection.java

@@ -403,6 +439,41 @@ public void close() throws IOException {
    return cascades;
  }

+  private void readRelDocsFromQrels(String qrels) throws IOException {


just loadQrels?

lintool · 2020-06-01T00:44:12Z

src/main/java/io/anserini/search/SearchCollection.java

+    if (!Files.exists(qrelsFilePath) || !Files.isRegularFile(qrelsFilePath) || !Files.isReadable(qrelsFilePath)) {
+        throw new IllegalArgumentException("Qrels file : " + qrelsFilePath + " does not exist or is not a (readable) file.");
+    }
+    Map<String, Map<String, Integer>> relDocs = new HashMap<String, Map<String, Integer>> ();


new HashMap<>() will do?

lintool · 2020-06-01T00:44:42Z

src/main/java/io/anserini/search/SearchCollection.java

+    Map<String, Map<String, Integer>> relDocs = new HashMap<String, Map<String, Integer>> ();
+    InputStream fin = Files.newInputStream(Paths.get(qrels), StandardOpenOption.READ);
+    BufferedInputStream in = new BufferedInputStream(fin);
+    BufferedReader bRdr = new BufferedReader(new InputStreamReader(in));


I think we generally user reader instead of bRdr.

…warning for missing documents

lintool · 2020-06-03T15:15:11Z

All regressions passed. Getting ready to merge!

* Adding prebuilt indexes regressions for HC4 * Updating Index Stats * Adding HC4 Test Cases * fix hc4 readme

lukuang added 7 commits May 27, 2020 12:20

initial changes

c0e36c8

initial implementation of rf

e0cbd2b

fix bug

d1174e6

fix typo and add better tags for RF

dc94493

minor bug fix

604c430

fix logic bug of when there are no relevant documents

41f5c21

fix typo

0bba119

lukuang requested a review from lintool May 28, 2020 00:15

lukuang added 11 commits May 27, 2020 22:41

fix bug

f1ed01f

initial changes

c31537c

initial implementation of rf

c535233

fix bug

26ab923

fix typo and add better tags for RF

38003ad

minor bug fix

f539cae

fix logic bug of when there are no relevant documents

00175ff

fix typo

4abb0ae

fix bug

345e494

Merge branch 'implement_rf' of github.com:lukuang/Anserini into imple…

3cfa0f2

…ment_rf

Merge branch 'master' into implement_rf

e1fadfc

lintool requested changes Jun 1, 2020

View reviewed changes

lukuang and others added 6 commits May 31, 2020 21:27

minor fixes to address CR

ae33440

rename the argument

0885285

address CR for naming and for future support of relevance grade& add …

46fac2a

…warning for missing documents

fix bug on BM25prf side of docid mismatch

d480afa

improve logic

146a86f

Merge branch 'master' into implement_rf

2cac99a

lintool merged commit 59395b4 into castorini:master Jun 3, 2020

crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022

Adding HC4 bindings and pre-built indexes (castorini#1236)

3bdeaea

* Adding prebuilt indexes regressions for HC4 * Updating Index Stats * Adding HC4 Test Cases * fix hc4 readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement rf #1236

Implement rf #1236

lukuang commented May 28, 2020

codecov bot commented May 28, 2020 •

edited

Loading

lintool Jun 1, 2020

lukuang Jun 1, 2020

lintool Jun 1, 2020

lukuang Jun 1, 2020

lintool Jun 1, 2020

lukuang Jun 1, 2020

lintool Jun 1, 2020

lintool Jun 1, 2020

lintool Jun 1, 2020

lukuang Jun 1, 2020

lintool Jun 1, 2020

lukuang Jun 1, 2020

lukuang Jun 1, 2020

lintool Jun 1, 2020

lintool Jun 1, 2020

lintool Jun 1, 2020

lintool Jun 1, 2020

lintool commented Jun 3, 2020

Implement rf #1236

Implement rf #1236

Conversation

lukuang commented May 28, 2020

codecov bot commented May 28, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool commented Jun 3, 2020

codecov bot commented May 28, 2020 •

edited

Loading