Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More onboarding doc updates #2151

Merged
merged 11 commits into from
Jul 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,17 @@ If you're a Waterloo student traversing the [onboarding path](https://github.com
+ Be able to evaluate the retrieved results above.
+ Understand the MRR metric.

What's Anserini?
Well, it's the repo that you're in right now.
Anserini is a toolkit (in Java) for reproducible information retrieval research built on the [Luence search library](https://lucene.apache.org/).
The Lucene search library provides components of the popular [Elasticsearch](https://www.elastic.co/) platform.

Think of it this way: Lucene provides a "kit of parts".
Elasticsearch provides "assembly of parts" targeted to production search applications, with a REST-centric API.
Anserini provides an alternative way of composing the same core components together, targeted at information retrieval researchers.
By building on Lucene, Anserini aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications.
That is, most things done with Anserini can be "translated" into Elasticsearch quite easily.

## Data Prep

In this guide, we're just going through the mechanical steps of data prep.
Expand Down Expand Up @@ -263,8 +274,9 @@ We can find the MRR@10 for `qid` 1048585 above:

```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec | grep 1048585
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec | grep 1048585

recip_rank 1048585 1.0000
```

Expand All @@ -280,6 +292,8 @@ In short, it's complicated.
At this time, look back through the learning outcomes again and make sure you're good.
As a next step in the onboarding path, you basically [do the same thing again in Python with Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) (as opposed to Java with Anserini here).

Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text.

## BM25 Tuning

This section is **not** part of the onboarding path, so feel free to skip.
Expand Down
7 changes: 5 additions & 2 deletions docs/start-here.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,11 @@ What's the problem we're trying to solve?

This is the definition I typically give:

> Given an information need expressed as a query _q_, the text ranking task is to return a ranked list of _k_ texts {_d<sub>1</sub>_, _d<sub>2</sub>_ ... _d<sub>k</sub>_} from an arbitrarily large but finite collection
> Given an information need expressed as a query _q_, the text retrieval task is to return a ranked list of _k_ texts {_d<sub>1</sub>_, _d<sub>2</sub>_ ... _d<sub>k</sub>_} from an arbitrarily large but finite collection
of texts _C_ = {_d<sub>i</sub>_} that maximizes a metric of interest, for example, nDCG, AP, etc.

This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, etc.
This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, the top-_k_ document retrieval problem, etc.
In most contexts, "ranking" and "retrieval" are used interchangeably.
Basically, this is what _search_ (i.e., information retrieval) is all about.

Let's try to unpack the definition a bit.
Expand Down Expand Up @@ -268,5 +269,7 @@ By now you should be able to connect the concepts we introduced to how they mani
From here, you're now ready to proceed to try and reproduce the [BM25 Baselines for MS MARCO Passage Ranking
](experiments-msmarco-passage.md).

Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text.

## Reproduction Log[*](reproducibility.md)