Skip to content

Commit

Permalink
Refactor to enable cleaner Pyserini bindings (#2584)
Browse files Browse the repository at this point in the history
+ dense searchers batch_search - change method signature to take queries, then qids - to be consistent
  with SimpleSearcher and SimpleImpactSearcher
+ dense searchers: refactor ThreadPoolExecutor to use try-with-resources, see #2579
  • Loading branch information
lintool authored Sep 6, 2024
1 parent 0660fbf commit 05613e3
Show file tree
Hide file tree
Showing 21 changed files with 450 additions and 169 deletions.
7 changes: 4 additions & 3 deletions docs/fatjar-regressions/fatjar-regressions-v0.37.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,19 +36,20 @@ curl -X GET "http://localhost:8081/api/v1.0/indexes/msmarco-v2.1-doc/search?quer

The json results are the same as the output of the `-outputRerankerRequests` option in `SearchCollection`, described below for TREC 2024 RAG.
Use the `hits` parameter to specify the number of hits to return, e.g., `hits=1000` to return the top 1000 hits.
Switch to `msmarco-v2.1-doc-segmented` in the route to query the segmented docs instead.

Details of the built-in webapp and REST API can be found [here](../rest-api.md).

## TREC 2024 RAG

For the TREC 2024 RAG Track, we have thus far only implemented BM25 baselines on the MS MARCO V2.1 document corpus (both the doc and doc segmented variants).
For the [TREC 2024 RAG Track](https://trec-rag.github.io/), we have thus far only implemented BM25 baselines on the MS MARCO V2.1 document corpus (both the doc and doc segmented variants).

❗ Beware, you need lots of space to run these experiments.
The `msmarco-v2.1-doc` prebuilt index is 63 GB uncompressed.
The `msmarco-v2.1-doc-segmented` prebuilt index is 84 GB uncompressed.
Both indexes will be downloaded automatically.

This release of Anserini comes with the test topic for the TREC 2024 RAG track (`-topics rag24.test`).
This release of Anserini comes with bindings for the test topics for the TREC 2024 RAG track (`-topics rag24.test`).
To generate jsonl output containing the raw documents that can be reranked and further processed, use the `-outputRerankerRequests` option to specify an output file.
For example:

Expand All @@ -61,7 +62,7 @@ java -cp $ANSERINI_JAR io.anserini.search.SearchCollection \
-outputRerankerRequests $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl
```

And the output looks something like:
And the output looks something like (pipe through `jq` to pretty-print):

```bash
$ head -n 1 $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl | jq
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.gz \
-topicReader JsonStringVector \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.txt \
-hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.tsv.gz \
-topicReader TsvString \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-bioasq.test.txt \
-encoder BgeBaseEn15 -hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-encoder BgeBaseEn15 -hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.gz \
-topicReader JsonStringVector \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-bioasq.test.bge-base-en-v1.5.jsonl.txt \
-hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-bioasq.test.tsv.gz \
-topicReader TsvString \
-output runs/run.beir-v1.0.0-bioasq.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-bioasq.test.txt \
-encoder BgeBaseEn15 -hits 1000 -efSearch 1000 -removeQuery -threads 16 &
-encoder BgeBaseEn15 -hits 1000 -efSearch 2000 -removeQuery -threads 16 &
```

Evaluation can be performed using `trec_eval`:
Expand Down
164 changes: 163 additions & 1 deletion src/main/java/io/anserini/index/IndexInfo.java

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/main/java/io/anserini/search/BaseSearchArgs.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
/**
* This is the base class that holds common arguments for configuring searchers. Note that, explicitly, there are no
* arguments that are specific to the retrieval implementation (e.g., for HNSW searchers), and that there are no
* arguments that define queries and outputs (which are to be defined by sub-classes that may call the searcher in
* arguments that define queries and outputs (which are to be defined by subclasses that may call the searcher in
* different ways).
*/
public class BaseSearchArgs {
Expand Down
132 changes: 88 additions & 44 deletions src/main/java/io/anserini/search/FlatDenseSearcher.java
Original file line number Diff line number Diff line change
Expand Up @@ -129,78 +129,122 @@ public FlatDenseSearcher(Args args) {
}
}

public SortedMap<K, ScoredDoc[]> batch_search(List<K> qids, List<String> queries, int hits) {
/**
* Searches the collection in batch using multiple threads.
*
* @param queries list of queries
* @param qids list of unique query ids
* @param k number of hits
* @param threads number of threads
* @return a map of query id to search results
*/
public SortedMap<K, ScoredDoc[]> batch_search(List<String> queries, List<K> qids, int k, int threads) {
final SortedMap<K, ScoredDoc[]> results = new ConcurrentSkipListMap<>();
final ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(args.threads);
final AtomicInteger cnt = new AtomicInteger();

final long start = System.nanoTime();
assert qids.size() == queries.size();
for (int i=0; i<qids.size(); i++) {
K qid = qids.get(i);
String queryString = queries.get(i);

// This is the per-query execution, in parallel.
executor.execute(() -> {
try {
results.put(qid, search(qid, queryString, hits));
} catch (IOException e) {
throw new CompletionException(e);
}

int n = cnt.incrementAndGet();
if (n % 100 == 0) {
LOG.info(String.format("%d queries processed", n));
}
});
}

executor.shutdown();
try(ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(threads)) {
assert qids.size() == queries.size();
for (int i = 0; i < qids.size(); i++) {
K qid = qids.get(i);
String queryString = queries.get(i);

try {
// Wait for existing tasks to terminate.
while (!executor.awaitTermination(1, TimeUnit.MINUTES));
} catch (InterruptedException ie) {
// (Re-)Cancel if current thread also interrupted.
executor.shutdownNow();
// Preserve interrupt status.
Thread.currentThread().interrupt();
// This is the per-query execution, in parallel.
executor.execute(() -> {
try {
results.put(qid, search(qid, queryString, k));
} catch (IOException e) {
throw new CompletionException(e);
}

int n = cnt.incrementAndGet();
if (n % 100 == 0) {
LOG.info(String.format("%d queries processed", n));
}
});
}

executor.shutdown();

try {
// Wait for existing tasks to terminate.
while (!executor.awaitTermination(1, TimeUnit.MINUTES));
} catch (InterruptedException ie) {
// (Re-)Cancel if current thread also interrupted.
executor.shutdownNow();
// Preserve interrupt status.
Thread.currentThread().interrupt();
}
}
final long durationMillis = TimeUnit.MILLISECONDS.convert(System.nanoTime() - start, TimeUnit.NANOSECONDS);

LOG.info(queries.size() + " queries processed in " +
DurationFormatUtils.formatDuration(durationMillis, "HH:mm:ss") +
LOG.info("{} queries processed in {}{}", queries.size(),
DurationFormatUtils.formatDuration(durationMillis, "HH:mm:ss"),
String.format(" = ~%.2f q/s", queries.size() / (durationMillis / 1000.0)));

return results;
}

public ScoredDoc[] search(float[] queryFloat, int hits) throws IOException {
return search(null, queryFloat, hits);
/**
* Searches the collection with a query vector.
*
* @param query query vector
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(float[] query, int k) throws IOException {
return search(null, query, k);
}

public ScoredDoc[] search(@Nullable K qid, float[] queryFloat, int hits) throws IOException {
KnnFloatVectorQuery query = new KnnFloatVectorQuery(Constants.VECTOR, queryFloat, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(query, hits, BREAK_SCORE_TIES_BY_DOCID, true);
/**
* Searches the collection with a query vector.
*
* @param qid query id
* @param query query vector
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(@Nullable K qid, float[] query, int k) throws IOException {
KnnFloatVectorQuery vectorQuery = new KnnFloatVectorQuery(Constants.VECTOR, query, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(vectorQuery, k, BREAK_SCORE_TIES_BY_DOCID, true);

return super.processLuceneTopDocs(qid, topDocs);
}

public ScoredDoc[] search(String queryString, int hits) throws IOException {
return search(null, queryString, hits);
/**
* Searches the collection with a string query that will be encoded by the underlying encoder.
*
* @param query query
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(String query, int k) throws IOException {
return search(null, query, k);
}

public ScoredDoc[] search(@Nullable K qid, String queryString, int hits) throws IOException {
/**
* Searches the collection with a string query that will be encoded by the underlying encoder.
*
* @param qid query id
* @param query query
* @param k number of hits
* @return array of search results
* @throws IOException if error encountered during search
*/
public ScoredDoc[] search(@Nullable K qid, String query, int k) throws IOException {
if (encoder != null) {
try {
return search(qid, encoder.encode(queryString), hits);
return search(qid, encoder.encode(query), k);
} catch (OrtException e) {
throw new RuntimeException("Error encoding query.");
}
}

KnnFloatVectorQuery query = generator.buildQuery(Constants.VECTOR, queryString, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(query, hits, BREAK_SCORE_TIES_BY_DOCID, true);
KnnFloatVectorQuery vectorQuery = generator.buildQuery(Constants.VECTOR, query, DUMMY_EF_SEARCH);
TopDocs topDocs = getIndexSearcher().search(vectorQuery, k, BREAK_SCORE_TIES_BY_DOCID, true);

return super.processLuceneTopDocs(qid, topDocs);
}
Expand Down
Loading

0 comments on commit 05613e3

Please sign in to comment.