From 4b8f051c25992a5d87ecf8d30d45a93aff17abc4 Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Fri, 21 Jul 2023 09:45:24 -0400 Subject: [PATCH] More onboarding doc updates (#2151) --- docs/experiments-msmarco-passage.md | 18 ++++++++++++++++-- docs/start-here.md | 7 +++++-- 2 files changed, 21 insertions(+), 4 deletions(-) diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 5a3e72266b..1ad1ce260a 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -14,6 +14,17 @@ If you're a Waterloo student traversing the [onboarding path](https://github.com + Be able to evaluate the retrieved results above. + Understand the MRR metric. +What's Anserini? +Well, it's the repo that you're in right now. +Anserini is a toolkit (in Java) for reproducible information retrieval research built on the [Luence search library](https://lucene.apache.org/). +The Lucene search library provides components of the popular [Elasticsearch](https://www.elastic.co/) platform. + +Think of it this way: Lucene provides a "kit of parts". +Elasticsearch provides "assembly of parts" targeted to production search applications, with a REST-centric API. +Anserini provides an alternative way of composing the same core components together, targeted at information retrieval researchers. +By building on Lucene, Anserini aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications. +That is, most things done with Anserini can be "translated" into Elasticsearch quite easily. + ## Data Prep In this guide, we're just going through the mechanical steps of data prep. @@ -263,8 +274,9 @@ We can find the MRR@10 for `qid` 1048585 above: ```bash $ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \ - collections/msmarco-passage/qrels.dev.small.trec \ - runs/run.msmarco-passage.dev.small.trec | grep 1048585 + collections/msmarco-passage/qrels.dev.small.trec \ + runs/run.msmarco-passage.dev.small.trec | grep 1048585 + recip_rank 1048585 1.0000 ``` @@ -280,6 +292,8 @@ In short, it's complicated. At this time, look back through the learning outcomes again and make sure you're good. As a next step in the onboarding path, you basically [do the same thing again in Python with Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) (as opposed to Java with Anserini here). +Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text. + ## BM25 Tuning This section is **not** part of the onboarding path, so feel free to skip. diff --git a/docs/start-here.md b/docs/start-here.md index a8a8dd121a..8af9480e57 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -20,10 +20,11 @@ What's the problem we're trying to solve? This is the definition I typically give: -> Given an information need expressed as a query _q_, the text ranking task is to return a ranked list of _k_ texts {_d1_, _d2_ ... _dk_} from an arbitrarily large but finite collection +> Given an information need expressed as a query _q_, the text retrieval task is to return a ranked list of _k_ texts {_d1_, _d2_ ... _dk_} from an arbitrarily large but finite collection of texts _C_ = {_di_} that maximizes a metric of interest, for example, nDCG, AP, etc. -This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, etc. +This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, the top-_k_ document retrieval problem, etc. +In most contexts, "ranking" and "retrieval" are used interchangeably. Basically, this is what _search_ (i.e., information retrieval) is all about. Let's try to unpack the definition a bit. @@ -276,5 +277,7 @@ By now you should be able to connect the concepts we introduced to how they mani From here, you're now ready to proceed to try and reproduce the [BM25 Baselines for MS MARCO Passage Ranking ](experiments-msmarco-passage.md). +Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text. + ## Reproduction Log[*](reproducibility.md)