From 8c989964e73a2044553e7bb638dd5f7ebed2ea53 Mon Sep 17 00:00:00 2001 From: lintool Date: Thu, 20 Jul 2023 10:21:14 -0400 Subject: [PATCH 01/10] initial docs --- docs/start-here.md | 244 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 244 insertions(+) create mode 100644 docs/start-here.md diff --git a/docs/start-here.md b/docs/start-here.md new file mode 100644 index 0000000000..9566079cd4 --- /dev/null +++ b/docs/start-here.md @@ -0,0 +1,244 @@ +# Anserini: Start Here + +This page provides the entry point for an introduction to information retrieval (i.e., search). +It also serves as an [onboarding path](https://github.com/lintool/guide/blob/master/ura.md) for University of Waterloo undergraduates who wish to join my research group. + +As a high-level tip for anyone going through these exercises: try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blindly copying and pasting commands into a shell). + +**Learning outcomes:** + ++ Understand the definition of the retrieval problem in terms of the core concepts of queries, collections, and relevance. ++ Understand at a high level how retrieval systems are evaluated with queries and relevance judgments. ++ Download the MS MARCO passage ranking test collection and perform basic manipulations on the downloaded files. ++ Connect the contents of specific files in the download package to the concepts referenced above. + +## The Retrieval Problem + +Let's start at the top: +What's the problem we're trying to solve? + +This is the definition I typically give: + +> Given an information need expressed as a query _q_, the text ranking task is to return a ranked list of _k_ texts {_d1_, _d2_ ... _dk_} from an arbitrarily large but finite collection +of texts _C_ = {_di_} that maximizes a metric of interest, for example, nDCG, AP, etc. + +This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, etc. +Basically, this is what _search_ (i.e., information retrieval) is all about. + +Let's try to unpack the definition a bit. + +A **"query"** is a representation of an information need (i.e., the reason you're looking for information in the first place) that serves as the input to a retrieval system. + +The **"collection"** is what you're retrieving from (i.e., searching). +Some people say "corpus" (plural, "corpora", not "corpuses"), but the terms are used interchangeably. +A "collection" or "corpus" comprises "documents". +In standard parlance, a "document" is used generically to refer to any discrete information object that can be retrieved. +We call them "documents" even though in reality they may be webpages, passages, PDFs, Powerpoint slides, Excel spreadsheets, or even images, audio, or video. + +The output of retrieval is a ranking of documents (i.e., a ranked list, or just a sorted list). +Documents are identified by unique ids, and so a ranking is simply a list of ids. +The document contents can serve as input to downstream processing, e.g., fed into the prompt of a large language model as part of retrieval augmentation or generative question answering. + +**Relevance** is perhaps the most important concept in information retrieval. +The literature about relevance goes back at least fifty years and the notion is (surprisingly) difficult to pin down precisely. +However, at an intuitive level, relevance is a relationship between an information need and a document, i.e., is this document relevant to this information need? +Something like, "does this document contain the information I'm looking for?" + +Sweeping away a lot of complexity... relevance can be binary (i.e., not relevant, relevant) or "graded" (think Likert scale, e.g., not relevant, relevant, highly relevant). + +## The Evaluation Problem + +How do you know if a retrieval system is returning good results? +How do you know if _this_ retrieval system is better than _that_ retrieval system? + +Well, imagine if you had a list of "real-world" queries, and someone told you which documents were relevant to which queries. +Amazingly, these artifacts exist, and they're called **relevance judgments**, **qrels**. +Conceptually, they're triples along these lines: + +``` +q1 doc23 0 +q1 doc452 1 +q1 doc536 0 +q2 doc97 0 +... +``` + +That is, for `q1`, `doc23` is not relevant, `doc452` is relevant, and `doc536` is not relevant; for `q2`, `doc97` is not relevant... + +Now, given a set of queries, you feed each query into your retrieval system and get back a ranked list of document ids. + +The final thing you'd need is a **metric** that quantifies the "goodness" of the ranked list. +One easy to understand metric is precision at 10, often written P@10. +It simply means: of the top 10 documents, what fraction are relevant according to your qrels? +For a query, if five of them are relevant, you get a score of 0.5; if nine of them are relevant, you get a score of 0.9. +You compute per query, and then average across all queries. + +Information retrieval researchers have dozens of metrics, but a detailed explanation of each isn't important right now... +just recognize that _all_ metrics are imperfect, but they try to capture different aspects of the quality of a ranked list in terms of containing relevant documents. +For nearly all metrics, though, higher is better. + +So now with a metric, we have the ability to measure (i.e., quantify) the quality of a system's output. +And with that, we have the ability to compare the quality of two different systems or two model variants. + +And with that, we have the ability to iterate and build better retrieval systems (e.g., with machine learning). +Oversimplifying (of course), information retrieval is all about making that metric go up. + +Oh, where do these magical qrels come from? +Well, that's the story for another day... + +## MS MARCO + +Bringing together everything we've discussed so far, a test collection consists of three elements: + ++ a collection (or corpus) of documents ++ a set of queries ++ qrels, which tell us what documents are relevant to what queries + +Here, we're going to introduce the MS MARCO passage ranking test collection. + +In these instructions we're going to use Anserini's root directory as the working directory. +First, we need to download and extract the data: + +```bash +mkdir collections/msmarco-passage + +wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage + +# Alternative mirror: +# wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/collectionandqueries.tar.gz -P collections/msmarco-passage + +tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage +``` + +To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b18952c1386cd4564ba2ae69`. + +If you peak inside the collection: + +```bash +head collections/msmarco-passage/collection.tsv +``` + +You'll see that `collection.tsv` contains the documents that we're searching. +Note that generically we call them "documents" but in truth they are passages; we'll use the terms interchangeably. + +Each line represents a passage: +the first column contains a unique identifier for the passage (called the `docid`) and the second column contains the text of the passage itself. + +Next, we need to do a bit of data munging to get the MS MARCO tsv collection into something Anserini can easily work with, which is a jsonl format (which have one json object per line): + +```bash +python tools/scripts/msmarco/convert_collection_to_jsonl.py \ + --collection-path collections/msmarco-passage/collection.tsv \ + --output-folder collections/msmarco-passage/collection_jsonl +``` + +The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines). +You'll get something like this: + +```bash +$ wc collections/msmarco-passage/collection_jsonl/* + 1000000 58716381 374524070 collections/msmarco-passage/collection_jsonl/docs00.json + 1000000 59072018 377845773 collections/msmarco-passage/collection_jsonl/docs01.json + 1000000 58895092 375856044 collections/msmarco-passage/collection_jsonl/docs02.json + 1000000 59277129 377452947 collections/msmarco-passage/collection_jsonl/docs03.json + 1000000 59408028 378277584 collections/msmarco-passage/collection_jsonl/docs04.json + 1000000 60659246 383758389 collections/msmarco-passage/collection_jsonl/docs05.json + 1000000 63196730 400184520 collections/msmarco-passage/collection_jsonl/docs06.json + 1000000 56920456 364726419 collections/msmarco-passage/collection_jsonl/docs07.json + 841823 47767342 306155721 collections/msmarco-passage/collection_jsonl/docs08.json + 8841823 523912422 3338781467 total +``` + +As an aside, data munging along these lines is a very common data preparation step. +Collections rarely come in _exactly_ the format that your tools except, so you'll be frequently writing lots of small scripts that munge data along these lines. + +Similarly, we'll have to do a bit of data munging of the queries and the qrels also. +We're going to retain only the queries that are in the qrels file: + +```bash +python tools/scripts/msmarco/filter_queries.py \ + --qrels collections/msmarco-passage/qrels.dev.small.tsv \ + --queries collections/msmarco-passage/queries.dev.tsv \ + --output collections/msmarco-passage/queries.dev.small.tsv +``` + +The output queries file `collections/msmarco-passage/queries.dev.small.tsv` should contain 6980 lines. + +Check out the contents of the queries file: + +```bash +$ head collections/msmarco-passage/queries.dev.small.tsv +1048585 what is paula deen's brother +2 Androgen receptor define +524332 treating tension headaches without medication +1048642 what is paranoid sc +524447 treatment of varicose veins in legs +786674 what is prime rate in canada +1048876 who plays young dr mallard on ncis +1048917 what is operating system misconfiguration +786786 what is priority pass +524699 tricare service number +``` + +These are the queries in the development set of the MS MARCO passage ranking test collection. +The first field is a unique identifier for the query (called the `qid`) and the second column is the query itself. +These queries are taken from Bing search logs, so they're "realistic" web queries in that they may be ambiguous, contain typos, etc. + +Okay, let's now cross reference the `qid` with the relvance judgments, i.e., the qrels file: + +```bash +$ grep 1048585 collections/msmarco-passage/qrels.dev.small.tsv +1048585 0 7187158 1 +``` + +The above is the standard format for qrels file. +The first column is the `qid`; +the second column is (almost) always 0 (it's a historical artifact dating back decades); +the third column is a `docid`; +the fourth colum provides the relevance judgment itself. +In this case, 0 means "not relevant" and 1 means "relevant". +So, this entry says that the document with id 7187158 is relevant to the query with id 1048585. + +Well, how do we get the actual contents of document 1048585? +The simplest way is to grep through the collection itself: + +```bash +$ grep 7187158 collections/msmarco-passage/collection.tsv +7187158 Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba's… Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba's… +``` + +We see here that, indeed, the passage above is relevant to the query (i.e., provides information that answers the question). + +How big is the MS MARCO passage ranking test collection, btw? +Well, we've just seen that there are 6980 training queries. +For those, we have 7437 relevance judgments: + +``` +$ wc collections/msmarco-passage/qrels.dev.small.tsv +7437 29748 143300 collections/msmarco-passage/qrels.dev.small.tsv +```` + +This means that we have only about one relevance judgments per query. +We call these **sparse judgments**, in cases where we have relatively few relevance judgments per query. +In other cases, where we have many relevance judgments per query, we call those **dense judgments**. +There are important implications when using sparse vs. dense judgments, but that's for another time... + +This is just looking at the development set. +Now let's look at the training set: + +```bash +% wc collections/msmarco-passage/qrels.train.tsv +532761 2131044 10589532 collections/msmarco-passage/qrels.train.tsv +``` + +Wow, there are over 532k relevance judgments in MS MARCO! +(Yes, that's big!) +It's sufficient... for example... to train neural networks (transformers) to perform retrieval! +And, indeed, MS MARCO is perhaps the most common starting point for work in building neural retrieval models. +But that's for some other time.... + +Okay, go back and look at the learning outcomes at the top of this page. +By now you should be able to connect the concepts introduced to how they manifest in the MS MARCO passage ranking test collection. + +## Reproduction Log[*](reproducibility.md) + From f325468714de96f02a85ef09cf09f90ba8ee9596 Mon Sep 17 00:00:00 2001 From: lintool Date: Thu, 20 Jul 2023 13:09:00 -0400 Subject: [PATCH 02/10] Updates. --- docs/experiments-msmarco-passage.md | 225 ++++++++++++++-------------- docs/start-here.md | 57 ++++--- 2 files changed, 145 insertions(+), 137 deletions(-) diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 2ea1c2817c..592f6bc39d 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -2,26 +2,25 @@ This page contains instructions for running BM25 baselines on the [MS MARCO *passage* ranking task](https://microsoft.github.io/msmarco/). Note that there is a separate [MS MARCO *document* ranking task](experiments-msmarco-doc.md). +This exercise will require a machine with >8 GB RAM and >15 GB free disk space . -This exercise will require a machine with >8 GB RAM and at least 15 GB free disk space . +If you're a Waterloo student traversing the [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), [start here](start-here.md +). -If you're a Waterloo undergraduate going through this guide as the [screening exercise](https://github.com/lintool/guide/blob/master/ura.md) of joining my research group, try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blinding copying and pasting commands into a shell). -In particular, you'll want to pay attention to the "What's going on here?" sections. - -
-What's going on here? - -As a really high level summary: in the MS MARCO passage ranking task, you're given a bunch of passages to search and a bunch of queries. -The system's task is to return the best passages for each query (i.e., passages that are relevant). - -Note that "the things you're searching" are called documents (in the generic sense), even though they're actually passages (extracted from web pages) in this case. -You could be search web pages, PDFs, Excel spreadsheets, and even podcasts. -Information retrieval researchers refer to these all as "documents". -
+**Learning outcomes**, building on previous steps in the onboarding path: ++ Be able to use Anserini to index the MS MARCO passage collection. ++ Be able to use Anserini to search the MS MARCO passage collection with the dev queries. ++ Be able to evaluate the retrieved results above. ++ Understand the MRR metric. ## Data Prep +In this guide, we're just going through the mechanical steps of data prep. +To better understand what you're actually doing, go through the [start here](start-here.md +) guide. +The guide contains the same exact instructions, but provide more detailed explanations. + We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset: @@ -38,21 +37,6 @@ tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/ To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b18952c1386cd4564ba2ae69`. -
-What's going on here? - -If you peak inside the collection: - -```bash -head collections/msmarco-passage/collection.tsv -``` - -You'll see that `collection.tsv` contains the passages that we're searching. -Each line represents a passage: -the first column contains a unique identifier for the passage (called the `docid`) and the second column contains the text of the passage itself. - -
- Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line): ```bash @@ -63,10 +47,33 @@ python tools/scripts/msmarco/convert_collection_to_jsonl.py \ The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines). +We need to do a bit of data munging on the queries as well. +There are queries in the dev set that don't have relevance judgments. +Let's discard them: + +```bash +python tools/scripts/msmarco/filter_queries.py \ + --qrels collections/msmarco-passage/qrels.dev.small.tsv \ + --queries collections/msmarco-passage/queries.dev.tsv \ + --output collections/msmarco-passage/queries.dev.small.tsv +``` + +The output queries file `collections/msmarco-passage/queries.dev.small.tsv` should contain 6980 lines. ## Indexing -We can now index these docs as a `JsonCollection` using Anserini: +In building a retrieval system, there are generally two phases: + ++ In the **indexing** phase, an indexer takes the document collection (i.e., corpus) and builds an index, which is a data structure that supports effcient retrieval. ++ In the **retrieval** (or **search**) phase, the retrieval system returns a ranked list given a query _q_, with the aid of the index constructed in the previous phase. + +(There's also a training phase when we start to discuss models that _learn_ from data, but we're not there yet.) + +Given a (static) document collection, indexing only needs to be performed once, and hence there are fewer constraints on latency, throughput, and other aspects of performance (just needs to be "reasonable"). +On the other hand, retrieval needs to be fast, i.e., low latency, high throughput, etc. + +With the data prep above, we can now index the MS MARCO passage collection in `collections/msmarco-passage/collection_jsonl`. +We index these docs as a `JsonCollection` (a specification of how documents are encoded) using Anserini: ```bash target/appassembler/bin/IndexCollection \ @@ -77,48 +84,16 @@ target/appassembler/bin/IndexCollection \ -threads 9 -storePositions -storeDocvectors -storeRaw ``` +In this case, Lucene creates what is known as an **inverted index**. + Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes. ## Retrieval -Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file: - -```bash -python tools/scripts/msmarco/filter_queries.py \ - --qrels collections/msmarco-passage/qrels.dev.small.tsv \ - --queries collections/msmarco-passage/queries.dev.tsv \ - --output collections/msmarco-passage/queries.dev.small.tsv -``` - -The output queries file should contain 6980 lines. - -
-What's going on here? - -Check out the contents of the queries file: - -```bash -$ head collections/msmarco-passage/queries.dev.small.tsv -1048585 what is paula deen's brother -2 Androgen receptor define -524332 treating tension headaches without medication -1048642 what is paranoid sc -524447 treatment of varicose veins in legs -786674 what is prime rate in canada -1048876 who plays young dr mallard on ncis -1048917 what is operating system misconfiguration -786786 what is priority pass -524699 tricare service number -``` - -These are the queries we're going to feed to the search engine. -The first field is a unique identifier for the query (called the `qid`) and the second column is the query itself. -These queries are taken from Bing search logs, so they're "realistic" web queries in that they may be ambiguous, contain typos, etc. -
- -We can now perform a retrieval run using this smaller set of queries: +In the above step, we've built the inverted index. +Now we can now perform a retrieval run using queries we've prepared: ```bash target/appassembler/bin/SearchCollection \ @@ -130,6 +105,14 @@ target/appassembler/bin/SearchCollection \ -bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000 ``` +This is the **retrieval** (or **search**) phase. +We're performing retrieval _in batch_, on a set of queries. + +Retrieval here uses a "bag-of-words" model known as **BM25**. +A "bag of words" model just means that documents are scored based on the matching of query terms (i.e., words) that appear in the documents, without regard to the structure of the document, the order of the words, etc. +BM25 is perhaps the most popular bag-of-words retrieval model; it's the default in the popular [Elasticsearch](https://www.elastic.co/) platform. +We'll discuss retrieval models in much more detail later. + The above command uses BM25 with tuned parameters `k1=0.82`, `b=0.68`. The option `-hits` specifies the number of documents per query to be retrieved. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines. @@ -138,13 +121,12 @@ Retrieval speed will vary by machine: On a reasonably modern desktop with an SSD, with four threads (as specified above), the run takes a couple of minutes. Adjust the parallelism by changing the `-parallelism` argument. -
-What's going on here? +Congratulations, you've performed your first **retrieval run**! -Congratulations, you've performed your first retrieval run! - -You feed a search engine a bunch of queries, and the retrieval run is the output of the search engine. -For each query, the search engine gives back a ranked list of results (i.e., a list of hits). +Recap of what you've done: +You've fed the retrieval system a bunch of queries and the retrieval run is the output. +For each query, the retrieval system produced a ranked list of results (i.e., a list of hits). +The retrieval run contains the ranked lists for all queries you fed to it. Let's take a look: @@ -174,13 +156,23 @@ $ grep 7187158 collections/msmarco-passage/collection.tsv 7187158 Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba's… Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba's†``` -In this case, the hit seems relevant. -That is, it answers the query. -So here, the search engine did well. +In this case, the document (hit) seems relevant. +That is, it contains information that addresses the information need. +So here, the retrieval system "did well". +Remember that this document was indeed marked relevant in the qrels, as we saw in the [start here](start-here.md +) guide. -Note that this particular passage is a bit dirty (garbage characters, dups, etc.)... but that's pretty much a fact of life when you're dealing with the web. +As an additional sanity check, run the following: -
+```bash +$ cut -f 1 runs/run.msmarco-passage.dev.small.tsv | uniq | wc + 6980 6980 51039 +``` + +This tells us that there are 6980 unique tokens in the first column of the run file. +Since the first column indicates the `qid`, it means that the file contains ranked lists for 6980 queries, which checks out. + +## Evaluation Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script: @@ -198,34 +190,12 @@ QueriesRanked: 6980 ##################### ``` -
-What's going on here? - -So how do we know if a search engine is any good? -One method is manual examination, which is what we did above. -That is, we actually looked at the results by hand. - -Obviously, this isn't scalable if we want to evaluate lots of queries... -If only someone told us which documents were relevant to which queries... - -Well, someone has! (Specifically, human editors hired by Microsoft Bing in this case.) -These are captured in what are known as relevance judgments. -Take a look: - -```bash -$ grep 1048585 collections/msmarco-passage/qrels.dev.small.tsv -1048585 0 7187158 1 -``` - -This says that `docid` 7187158 is relevant to `qid` 1048585, which confirms our intuition above. -The file is in what is known as the qrels format. -You can ignore the second column. -The fourth column "1", says that the `docid` is relevant. -In some cases (though not here), that column might say "0", i.e., that the `docid` is _not_ relevant. +(Yea, the number of digits of precision is a bit... excessive) -With relevance judgments (qrels), we can now automatically evaluate the search engine output (i.e., the run). -The final ingredient we need is a metric (i.e., how to score). +Remember from the [start here](start-here.md +) guide that with relevance judgments (qrels), we can automatically evaluate the retrieval system output (i.e., the run). +The final ingredient is a metric, i.e., how to quantify the "quality" of a ranked list. Here, we're using a metric called MRR, or mean reciprocal rank. The idea is quite simple: We look at where the relevant `docid` appears. @@ -238,7 +208,6 @@ If the relevant `docid` doesn't appear in the top 10, then the system gets a sco That's the score of a query. We take the average of the scores across all queries (6980 in this case), and we arrive at the score for the entire run. -
You can find this run on the [MS MARCO Passage Ranking Leaderboard](https://microsoft.github.io/msmarco/) as the entry named "BM25 (Lucene8, tuned)", dated 2019/06/26. So you've just reproduced (part of) a leaderboard submission! @@ -260,7 +229,8 @@ And run the `trec_eval` tool: ```bash tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \ - collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec + collections/msmarco-passage/qrels.dev.small.trec \ + runs/run.msmarco-passage.dev.small.trec ``` The output should be: @@ -270,22 +240,49 @@ map all 0.1957 recall_1000 all 0.8573 ``` -Average precision and recall@1000 are the two metrics we care about the most. +In many retrieval applications, average precision and recall@1000 are the two metrics we care about the most. -
-What's going on here? +You can use `trec_eval` to compute the MRR@10 also, which gives results identical to above (just fewer digits of precision): -Don't worry so much about the details here for now. -The tl;dr is that there are different formats for run files and lots of different metrics you can compute. -`trec_eval` is a standard tool used by information retrieval researchers. +``` +tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank \ + collections/msmarco-passage/qrels.dev.small.trec \ + runs/run.msmarco-passage.dev.small.trec +``` + +It's a different command-line incantation of `trec_eval` to compute MRR@10. +And if you add `-q`, the tool will spit out the MRR@10 _per query_ (for all 6908 queries, in addition to the final average). + +``` +tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \ + collections/msmarco-passage/qrels.dev.small.trec \ + runs/run.msmarco-passage.dev.small.trec +``` -In fact, researchers have been trying to answer the question "how do we know if a search result is good and how do we measure it" for over half a century... -and the question still has not been fully resolved. +We can find the MRR@10 for `qid` 1048585 above: + +```bash +$ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \ + collections/msmarco-passage/qrels.dev.small.trec \ + runs/run.msmarco-passage.dev.small.trec | grep 1048585 +recip_rank 1048585 1.0000 +``` + +This is consistent with the example we worked through above. +At this point, make sure that the connections between a query, the relevance judgments for a query, the ranked list, and the metric (MRR@10) are clear in your mind. +Work through a few more examples (take another query, look at its qrels and ranked list, and compute its MRR@10 by hand) to convince yourself that you understand what's going on. + +The tl;dr is that there are different formats for run files and lots of different metrics you can compute. +`trec_eval` is a standard tool used by information retrieval researchers (which has many command-line options that you'll slowly learn over time). +Researchers have been trying to answer the question "how do we know if a search result is good and how do we measure it" for over half a century... and the question still has not been fully resolved. In short, it's complicated. -
+ +At this time, look back through the learning outcomes again and make sure you're good. ## BM25 Tuning +This section is **not** part of the onboarding path, so feel free to skip. + Note that this figure differs slightly from the value reported in [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375), which uses the Anserini (system-wide) default of `k1=0.9`, `b=0.4`. Tuning was accomplished with `tools/scripts/msmarco/tune_bm25.py`, using the queries found [here](https://github.com/castorini/Anserini-data/tree/master/MSMARCO); the basic approach is grid search of parameter values in tenth increments. diff --git a/docs/start-here.md b/docs/start-here.md index 9566079cd4..02877d464c 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -1,9 +1,10 @@ # Anserini: Start Here This page provides the entry point for an introduction to information retrieval (i.e., search). -It also serves as an [onboarding path](https://github.com/lintool/guide/blob/master/ura.md) for University of Waterloo undergraduates who wish to join my research group. +It also serves as an [onboarding path](https://github.com/lintool/guide/blob/master/ura.md) for University of Waterloo undergraduate (and graduate) students who wish to join my research group. As a high-level tip for anyone going through these exercises: try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blindly copying and pasting commands into a shell). +By this, I mean, actually _read_ the surrounding explanations, understand the purpose of the commands, and use this guide as a springboard for additional explorations (for example, dig deeper into the code). **Learning outcomes:** @@ -40,7 +41,7 @@ Documents are identified by unique ids, and so a ranking is simply a list of ids The document contents can serve as input to downstream processing, e.g., fed into the prompt of a large language model as part of retrieval augmentation or generative question answering. **Relevance** is perhaps the most important concept in information retrieval. -The literature about relevance goes back at least fifty years and the notion is (surprisingly) difficult to pin down precisely. +The literature on relevance goes back at least fifty years and the notion is (surprisingly) difficult to pin down precisely. However, at an intuitive level, relevance is a relationship between an information need and a document, i.e., is this document relevant to this information need? Something like, "does this document contain the information I'm looking for?" @@ -52,7 +53,7 @@ How do you know if a retrieval system is returning good results? How do you know if _this_ retrieval system is better than _that_ retrieval system? Well, imagine if you had a list of "real-world" queries, and someone told you which documents were relevant to which queries. -Amazingly, these artifacts exist, and they're called **relevance judgments**, **qrels**. +Amazingly, these artifacts exist, and they're called **relevance judgments** or **qrels**. Conceptually, they're triples along these lines: ``` @@ -67,11 +68,11 @@ That is, for `q1`, `doc23` is not relevant, `doc452` is relevant, and `doc536` i Now, given a set of queries, you feed each query into your retrieval system and get back a ranked list of document ids. -The final thing you'd need is a **metric** that quantifies the "goodness" of the ranked list. -One easy to understand metric is precision at 10, often written P@10. +The final thing you need is a **metric** that quantifies the "goodness" of the ranked list. +One easy-to-understand metric is precision at 10, often written P@10. It simply means: of the top 10 documents, what fraction are relevant according to your qrels? For a query, if five of them are relevant, you get a score of 0.5; if nine of them are relevant, you get a score of 0.9. -You compute per query, and then average across all queries. +You compute P@10 per query, and then average across all queries. Information retrieval researchers have dozens of metrics, but a detailed explanation of each isn't important right now... just recognize that _all_ metrics are imperfect, but they try to capture different aspects of the quality of a ranked list in terms of containing relevant documents. @@ -80,10 +81,10 @@ For nearly all metrics, though, higher is better. So now with a metric, we have the ability to measure (i.e., quantify) the quality of a system's output. And with that, we have the ability to compare the quality of two different systems or two model variants. -And with that, we have the ability to iterate and build better retrieval systems (e.g., with machine learning). -Oversimplifying (of course), information retrieval is all about making that metric go up. +With a metric, we have the ability to iterate incrementally and build better retrieval systems (e.g., with machine learning). +Oversimplifying (of course), information retrieval is all about making the metric go up. -Oh, where do these magical qrels come from? +Oh, where do these magical relevance judgments (qrels) come from? Well, that's the story for another day... ## MS MARCO @@ -92,11 +93,13 @@ Bringing together everything we've discussed so far, a test collection consists + a collection (or corpus) of documents + a set of queries -+ qrels, which tell us what documents are relevant to what queries ++ relevance judgments (or qrels), which tell us which documents are relevant to which queries -Here, we're going to introduce the MS MARCO passage ranking test collection. +Here, we're going to introduce the [MS MARCO passage ranking test collection](https://microsoft.github.io/msmarco/). In these instructions we're going to use Anserini's root directory as the working directory. +Assuming you've cloned the repo already... + First, we need to download and extract the data: ```bash @@ -124,7 +127,7 @@ Note that generically we call them "documents" but in truth they are passages; w Each line represents a passage: the first column contains a unique identifier for the passage (called the `docid`) and the second column contains the text of the passage itself. -Next, we need to do a bit of data munging to get the MS MARCO tsv collection into something Anserini can easily work with, which is a jsonl format (which have one json object per line): +Next, we need to do a bit of data munging to get the collection into something Anserini can easily work with, which is a jsonl format (where we have one json object per line): ```bash python tools/scripts/msmarco/convert_collection_to_jsonl.py \ @@ -133,7 +136,8 @@ python tools/scripts/msmarco/convert_collection_to_jsonl.py \ ``` The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines). -You'll get something like this: +Look inside a file to see the json format we use. +The entire collection is now something like this: ```bash $ wc collections/msmarco-passage/collection_jsonl/* @@ -149,10 +153,10 @@ $ wc collections/msmarco-passage/collection_jsonl/* 8841823 523912422 3338781467 total ``` -As an aside, data munging along these lines is a very common data preparation step. -Collections rarely come in _exactly_ the format that your tools except, so you'll be frequently writing lots of small scripts that munge data along these lines. +As an aside, data munging along these lines is a very common data preparation operation. +Collections rarely come in _exactly_ the format that your tools except, so you'll be frequently writing lots of small scripts that munge data to convert from one format to another. -Similarly, we'll have to do a bit of data munging of the queries and the qrels also. +Similarly, we'll also have to do a bit of data munging of the queries and the qrels. We're going to retain only the queries that are in the qrels file: ```bash @@ -163,8 +167,9 @@ python tools/scripts/msmarco/filter_queries.py \ ``` The output queries file `collections/msmarco-passage/queries.dev.small.tsv` should contain 6980 lines. +Verify with `wc`. -Check out the contents of the queries file: +Check out its contents: ```bash $ head collections/msmarco-passage/queries.dev.small.tsv @@ -191,7 +196,7 @@ $ grep 1048585 collections/msmarco-passage/qrels.dev.small.tsv 1048585 0 7187158 1 ``` -The above is the standard format for qrels file. +The above is the standard format for a qrels file. The first column is the `qid`; the second column is (almost) always 0 (it's a historical artifact dating back decades); the third column is a `docid`; @@ -208,6 +213,9 @@ $ grep 7187158 collections/msmarco-passage/collection.tsv ``` We see here that, indeed, the passage above is relevant to the query (i.e., provides information that answers the question). +Note that this particular passage is a bit dirty (garbage characters, dups, etc.)... but that's pretty much a fact of life when you're dealing with the web. + +Before proceeding, try the same thing with a few more queries: map the queries to the relevance judgments to the actual documents. How big is the MS MARCO passage ranking test collection, btw? Well, we've just seen that there are 6980 training queries. @@ -219,8 +227,8 @@ $ wc collections/msmarco-passage/qrels.dev.small.tsv ```` This means that we have only about one relevance judgments per query. -We call these **sparse judgments**, in cases where we have relatively few relevance judgments per query. -In other cases, where we have many relevance judgments per query, we call those **dense judgments**. +We call these **sparse judgments**, i.e., where we have relatively few relevance judgments per query (here, just about one relevance judgment per query). +In other cases, where we have many relevance judgments per query (potentially hundreds or even more), we call those **dense judgments**. There are important implications when using sparse vs. dense judgments, but that's for another time... This is just looking at the development set. @@ -231,14 +239,17 @@ Now let's look at the training set: 532761 2131044 10589532 collections/msmarco-passage/qrels.train.tsv ``` -Wow, there are over 532k relevance judgments in MS MARCO! +Wow, there are over 532k relevance judgments in the dataset! (Yes, that's big!) -It's sufficient... for example... to train neural networks (transformers) to perform retrieval! +It's sufficient... for example... to _train_ neural networks (transformers) to perform retrieval! And, indeed, MS MARCO is perhaps the most common starting point for work in building neural retrieval models. But that's for some other time.... Okay, go back and look at the learning outcomes at the top of this page. -By now you should be able to connect the concepts introduced to how they manifest in the MS MARCO passage ranking test collection. +By now you should be able to connect the concepts we introduced to how they manifest in the MS MARCO passage ranking test collection. + +From here, you're now ready to proceed to try and reproduce the [BM25 Baselines for MS MARCO Passage Ranking +](experiments-msmarco-passage.md). ## Reproduction Log[*](reproducibility.md) From 1300b1802a03ce4c80a441e854a605f1a138c9bb Mon Sep 17 00:00:00 2001 From: lintool Date: Thu, 20 Jul 2023 13:32:54 -0400 Subject: [PATCH 03/10] Updates. --- docs/experiments-msmarco-doc.md | 6 ++---- docs/experiments-msmarco-passage.md | 3 ++- docs/start-here.md | 25 +++++++++++++++++++++---- 3 files changed, 25 insertions(+), 9 deletions(-) diff --git a/docs/experiments-msmarco-doc.md b/docs/experiments-msmarco-doc.md index 2696c69aa9..a6ef5b61e5 100644 --- a/docs/experiments-msmarco-doc.md +++ b/docs/experiments-msmarco-doc.md @@ -3,10 +3,8 @@ This page contains instructions for running BM25 baselines on the [MS MARCO *document* ranking task](https://microsoft.github.io/msmarco/). Note that there is a separate [MS MARCO *passage* ranking task](experiments-msmarco-passage.md). -This exercise will require a machine with >8 GB RAM and at least 40 GB free disk space. - -If you're a Waterloo undergraduate going through this guide as the [screening exercise](https://github.com/lintool/guide/blob/master/ura.md) of joining my research group, make sure you do the [passage ranking exercise](experiments-msmarco-passage.md) first. -Similarly, try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blinding copying and pasting commands into a shell). +As of July 2023, this exercise has been removed form the Waterloo students [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), which [starts here](start-here.md +). ## Data Prep diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 592f6bc39d..5a3e72266b 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -7,7 +7,7 @@ This exercise will require a machine with >8 GB RAM and >15 GB free disk space . If you're a Waterloo student traversing the [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), [start here](start-here.md ). -**Learning outcomes**, building on previous steps in the onboarding path: +**Learning outcomes** for this guide, building on previous steps in the onboarding path: + Be able to use Anserini to index the MS MARCO passage collection. + Be able to use Anserini to search the MS MARCO passage collection with the dev queries. @@ -278,6 +278,7 @@ Researchers have been trying to answer the question "how do we know if a search In short, it's complicated. At this time, look back through the learning outcomes again and make sure you're good. +As a next step in the onboarding path, you basically [do the same thing again in Python with Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) (as opposed to Java with Anserini here). ## BM25 Tuning diff --git a/docs/start-here.md b/docs/start-here.md index 02877d464c..f5e66adef8 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -6,7 +6,7 @@ It also serves as an [onboarding path](https://github.com/lintool/guide/blob/mas As a high-level tip for anyone going through these exercises: try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blindly copying and pasting commands into a shell). By this, I mean, actually _read_ the surrounding explanations, understand the purpose of the commands, and use this guide as a springboard for additional explorations (for example, dig deeper into the code). -**Learning outcomes:** +**Learning outcomes** for this guide: + Understand the definition of the retrieval problem in terms of the core concepts of queries, collections, and relevance. + Understand at a high level how retrieval systems are evaluated with queries and relevance judgments. @@ -87,7 +87,24 @@ Oversimplifying (of course), information retrieval is all about making the metri Oh, where do these magical relevance judgments (qrels) come from? Well, that's the story for another day... -## MS MARCO +
+Additional readings... + +This is a very high-level summary of core concepts in information retrieval. +More nuanced explanations are presented in our book [Pretrained Transformers for Text Ranking: BERT and Beyond]( +https://link.springer.com/book/10.1007/978-3-031-02181-7). +If you can access the book (e.g., via your university), then please do, since it helps get our page views up. +However, if you're paywalled, a pre-publication version is available [on arXiv](https://arxiv.org/abs/2010.06467) for free. + +The parts you'll want to read are Section 1.1 "Text Ranking Problems" and all of Chapter 2 "Setting the Stage". + +When should you do these readings? +That's a good question: +If you absolutely want to know more _right now_, then go for it. +Otherwise, I think it's probably okay to continue along the onboarding path... although you'll need to circle back and get a deeper understanding of these concepts if you want to get into information retrieval research "more seriously". +
+ +## A Tour of MS MARCO Bringing together everything we've discussed so far, a test collection consists of three elements: @@ -221,7 +238,7 @@ How big is the MS MARCO passage ranking test collection, btw? Well, we've just seen that there are 6980 training queries. For those, we have 7437 relevance judgments: -``` +```bash $ wc collections/msmarco-passage/qrels.dev.small.tsv 7437 29748 143300 collections/msmarco-passage/qrels.dev.small.tsv ```` @@ -235,7 +252,7 @@ This is just looking at the development set. Now let's look at the training set: ```bash -% wc collections/msmarco-passage/qrels.train.tsv +$ wc collections/msmarco-passage/qrels.train.tsv 532761 2131044 10589532 collections/msmarco-passage/qrels.train.tsv ``` From dad1d44c76a90bc71da29bfead06a7d7c0d86085 Mon Sep 17 00:00:00 2001 From: lintool Date: Thu, 20 Jul 2023 15:16:25 -0400 Subject: [PATCH 04/10] Refactoring onboarding docs. --- docs/experiments-msmarco-passage.md | 18 ++++++++++++++++-- docs/start-here.md | 2 ++ 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 5a3e72266b..4ae0edc3d4 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -14,6 +14,17 @@ If you're a Waterloo student traversing the [onboarding path](https://github.com + Be able to evaluate the retrieved results above. + Understand the MRR metric. +What's Anserini? +Well, it's the repo that you're in right now. +Anserini is a toolkit (in Java) for reproducible information retrieval research built on the [Luence search library](https://lucene.apache.org/). +The Lucene search library provides components of the popular [Elasticsearch](https://www.elastic.co/) platform. + +Think of it this way: Lucene provides a "kit of parts". +Elasticsearch provides "assembly of parts" targeted to production search applications, with a REST-centric API. +Anserini provides an alternative way of composing the same core components together, targeted at information retrieval researchers. +By building on Lucene, Anserini aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications. +That is, most things done with Anserini can be "translated" into Elasticsearch quite easily. + ## Data Prep In this guide, we're just going through the mechanical steps of data prep. @@ -263,8 +274,9 @@ We can find the MRR@10 for `qid` 1048585 above: ```bash $ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \ - collections/msmarco-passage/qrels.dev.small.trec \ - runs/run.msmarco-passage.dev.small.trec | grep 1048585 + collections/msmarco-passage/qrels.dev.small.trec \ + runs/run.msmarco-passage.dev.small.trec | grep 1048585 + recip_rank 1048585 1.0000 ``` @@ -280,6 +292,8 @@ In short, it's complicated. At this time, look back through the learning outcomes again and make sure you're good. As a next step in the onboarding path, you basically [do the same thing again in Python with Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) (as opposed to Java with Anserini here). +Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use a 7-hexadecimal prefix for the link anchor text. + ## BM25 Tuning This section is **not** part of the onboarding path, so feel free to skip. diff --git a/docs/start-here.md b/docs/start-here.md index f5e66adef8..0db5ad9608 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -268,5 +268,7 @@ By now you should be able to connect the concepts we introduced to how they mani From here, you're now ready to proceed to try and reproduce the [BM25 Baselines for MS MARCO Passage Ranking ](experiments-msmarco-passage.md). +Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use a 7-hexadecimal prefix for the link anchor text. + ## Reproduction Log[*](reproducibility.md) From e246942fce129517acd816a121c544972c46f76d Mon Sep 17 00:00:00 2001 From: lintool Date: Thu, 20 Jul 2023 15:39:31 -0400 Subject: [PATCH 05/10] Updates. --- docs/start-here.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/start-here.md b/docs/start-here.md index 0db5ad9608..874eb326eb 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -23,7 +23,8 @@ This is the definition I typically give: > Given an information need expressed as a query _q_, the text ranking task is to return a ranked list of _k_ texts {_d1_, _d2_ ... _dk_} from an arbitrarily large but finite collection of texts _C_ = {_di_} that maximizes a metric of interest, for example, nDCG, AP, etc. -This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, etc. +This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, the top-_k_ document retrieval problem, etc. +In most contexts, "ranking" and "retrieval" is used interchangeably. Basically, this is what _search_ (i.e., information retrieval) is all about. Let's try to unpack the definition a bit. From cd1c4c0c6c4eda77cb0fb1c3c34035224afc354f Mon Sep 17 00:00:00 2001 From: lintool Date: Thu, 20 Jul 2023 15:41:41 -0400 Subject: [PATCH 06/10] Fixed merge conflicts. --- docs/experiments-msmarco-passage.md | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 002bb58129..4ae0edc3d4 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -8,7 +8,6 @@ If you're a Waterloo student traversing the [onboarding path](https://github.com ). **Learning outcomes** for this guide, building on previous steps in the onboarding path: -<<<<<<< HEAD + Be able to use Anserini to index the MS MARCO passage collection. + Be able to use Anserini to search the MS MARCO passage collection with the dev queries. @@ -25,13 +24,6 @@ Elasticsearch provides "assembly of parts" targeted to production search applica Anserini provides an alternative way of composing the same core components together, targeted at information retrieval researchers. By building on Lucene, Anserini aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications. That is, most things done with Anserini can be "translated" into Elasticsearch quite easily. -======= - -+ Be able to use Anserini to index the MS MARCO passage collection. -+ Be able to use Anserini to search the MS MARCO passage collection with the dev queries. -+ Be able to evaluate the retrieved results above. -+ Understand the MRR metric. ->>>>>>> master ## Data Prep @@ -282,14 +274,9 @@ We can find the MRR@10 for `qid` 1048585 above: ```bash $ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \ -<<<<<<< HEAD collections/msmarco-passage/qrels.dev.small.trec \ runs/run.msmarco-passage.dev.small.trec | grep 1048585 -======= - collections/msmarco-passage/qrels.dev.small.trec \ - runs/run.msmarco-passage.dev.small.trec | grep 1048585 ->>>>>>> master recip_rank 1048585 1.0000 ``` @@ -304,11 +291,8 @@ In short, it's complicated. At this time, look back through the learning outcomes again and make sure you're good. As a next step in the onboarding path, you basically [do the same thing again in Python with Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) (as opposed to Java with Anserini here). -<<<<<<< HEAD Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use a 7-hexadecimal prefix for the link anchor text. -======= ->>>>>>> master ## BM25 Tuning From 591aaad870baad7e3509760e9250daaa7fb86a76 Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Fri, 21 Jul 2023 09:31:56 -0400 Subject: [PATCH 07/10] Update docs/start-here.md Co-authored-by: Sahel Sharify --- docs/start-here.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start-here.md b/docs/start-here.md index 874eb326eb..a994ec9c6b 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -24,7 +24,7 @@ This is the definition I typically give: of texts _C_ = {_di_} that maximizes a metric of interest, for example, nDCG, AP, etc. This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, the top-_k_ document retrieval problem, etc. -In most contexts, "ranking" and "retrieval" is used interchangeably. +In most contexts, "ranking" and "retrieval" are used interchangeably. Basically, this is what _search_ (i.e., information retrieval) is all about. Let's try to unpack the definition a bit. From dc016f683b8a199313650c38f76b1c6e9a5ddc24 Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Fri, 21 Jul 2023 09:34:23 -0400 Subject: [PATCH 08/10] Update start-here.md --- docs/start-here.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start-here.md b/docs/start-here.md index a994ec9c6b..10a4e9527b 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -20,7 +20,7 @@ What's the problem we're trying to solve? This is the definition I typically give: -> Given an information need expressed as a query _q_, the text ranking task is to return a ranked list of _k_ texts {_d1_, _d2_ ... _dk_} from an arbitrarily large but finite collection +> Given an information need expressed as a query _q_, the text retrieval task is to return a ranked list of _k_ texts {_d1_, _d2_ ... _dk_} from an arbitrarily large but finite collection of texts _C_ = {_di_} that maximizes a metric of interest, for example, nDCG, AP, etc. This problem has been given various names, e.g., the search problem, the information retrieval problem, the text ranking problem, the top-_k_ document retrieval problem, etc. From 3957845dc08950e8da6c36247b0c36abc52ac3ee Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Fri, 21 Jul 2023 09:34:49 -0400 Subject: [PATCH 09/10] Update docs/start-here.md Co-authored-by: Sahel Sharify --- docs/start-here.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start-here.md b/docs/start-here.md index 10a4e9527b..096625982b 100644 --- a/docs/start-here.md +++ b/docs/start-here.md @@ -269,7 +269,7 @@ By now you should be able to connect the concepts we introduced to how they mani From here, you're now ready to proceed to try and reproduce the [BM25 Baselines for MS MARCO Passage Ranking ](experiments-msmarco-passage.md). -Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use a 7-hexadecimal prefix for the link anchor text. +Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text. ## Reproduction Log[*](reproducibility.md) From e2c3dd0813e47e728a2e64289943e3c54821afe5 Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Fri, 21 Jul 2023 09:36:01 -0400 Subject: [PATCH 10/10] Update docs/experiments-msmarco-passage.md Co-authored-by: Sahel Sharify --- docs/experiments-msmarco-passage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 4ae0edc3d4..1ad1ce260a 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -292,7 +292,7 @@ In short, it's complicated. At this time, look back through the learning outcomes again and make sure you're good. As a next step in the onboarding path, you basically [do the same thing again in Python with Pyserini](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) (as opposed to Java with Anserini here). -Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use a 7-hexadecimal prefix for the link anchor text. +Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text. ## BM25 Tuning