-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement rf #1236
Implement rf #1236
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1236 +/- ##
============================================
- Coverage 48.18% 47.78% -0.41%
- Complexity 732 742 +10
============================================
Files 147 147
Lines 8563 8683 +120
Branches 1217 1244 +27
============================================
+ Hits 4126 4149 +23
- Misses 4097 4185 +88
- Partials 340 349 +9
Continue to review full report at Codecov.
|
@@ -126,4 +131,36 @@ public static ScoredDocuments fromESDocs(SearchHits rs) { | |||
|
|||
return scoredDocs; | |||
} | |||
|
|||
public static ScoredDocuments fromRelDocs(Map<String, Integer> queryRelDocs, IndexReader reader) throws IOException { | |||
ScoredDocuments scoredDocs = new ScoredDocuments(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about fromQrels
? and queryRelDocs
-> qrels
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have Rel
here and other places to indicate that during the process there are only relevant documents. I think it would be better to keep that in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need non-relevant documents for BM25?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but RF only affects the feedback methods such as rm3, and we only need relevant documents for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So shouldn't we load all the qrels, and each technique can ignore the parts they don't need? E.g., RM3 ignores all the rel grade = 0.
In terms of future proofing, we probably want to add an option of what's rel in the future?
E.g., in multi-grade, just 2 is rel, or both 1+2 are rel?
So loading all qrels seems more general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sg.
FeatureVector f = new FeatureVector(); | ||
|
||
Set<String> vocab = new HashSet<>(); | ||
int numdocs = docs.documents.length < fbDocs ? docs.documents.length : fbDocs; | ||
int numdocs; | ||
if (useRf){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after )
if (useRf){ | ||
numdocs = docs.documents.length; | ||
} | ||
else{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else {
@@ -101,6 +101,9 @@ | |||
"the top documents from the initial round ranking.") | |||
public int rerankcutoff = 50; | |||
|
|||
@Option(name = "-rfQrels", metaVar = "[file]", usage = "qrels file used for relevance feedback") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just -qrels
? Is rf redundant?
Dunno... up for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add rf
for two reasons:
- it is future proof, in case there is another need for qrels
- it is more meaningful and shows what it is for.
But I do not have a strong opinion on this. If you think it is unnecessary, I will change it to -qrels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: future proofing, then -rf.qrels
? Just like how we have -bm25.b
? I'll leave the choice to you...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel -rfQrels
is better since we do not have -bm25
corresponding to -bm25.b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the discussion above, -rf.qrels
would be better here since we will have the relevance grade parameter in the future.
@@ -134,7 +142,8 @@ | |||
final private String runTag; | |||
|
|||
private SearcherThread(IndexReader reader, SortedMap<K, Map<String, String>> topics, TaggedSimilarity taggedSimilarity, | |||
RerankerCascade cascade, String outputPath, String runTag) { | |||
RerankerCascade cascade, Map<String, ScoredDocuments> relScoredDocs, String outputPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relScoredDocs
-> qrels
@@ -403,6 +439,41 @@ public void close() throws IOException { | |||
return cascades; | |||
} | |||
|
|||
private void readRelDocsFromQrels(String qrels) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just loadQrels
?
if (!Files.exists(qrelsFilePath) || !Files.isRegularFile(qrelsFilePath) || !Files.isReadable(qrelsFilePath)) { | ||
throw new IllegalArgumentException("Qrels file : " + qrelsFilePath + " does not exist or is not a (readable) file."); | ||
} | ||
Map<String, Map<String, Integer>> relDocs = new HashMap<String, Map<String, Integer>> (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new HashMap<>()
will do?
Map<String, Map<String, Integer>> relDocs = new HashMap<String, Map<String, Integer>> (); | ||
InputStream fin = Files.newInputStream(Paths.get(qrels), StandardOpenOption.READ); | ||
BufferedInputStream in = new BufferedInputStream(fin); | ||
BufferedReader bRdr = new BufferedReader(new InputStreamReader(in)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we generally user reader
instead of bRdr
.
All regressions passed. Getting ready to merge! |
* Adding prebuilt indexes regressions for HC4 * Updating Index Stats * Adding HC4 Test Cases * fix hc4 readme
In this PR I implemented relevance feedback.