From 0e08dcef48480f1d39827e8fac4aad281e46b460 Mon Sep 17 00:00:00 2001 From: "ECE @ UWaterloo" Date: Wed, 27 Nov 2024 20:45:48 -0500 Subject: [PATCH 1/3] Update experiments-20newsgroups.md --- docs/experiments-20newsgroups.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/experiments-20newsgroups.md b/docs/experiments-20newsgroups.md index 5e4339a7e0..13a02e73e9 100644 --- a/docs/experiments-20newsgroups.md +++ b/docs/experiments-20newsgroups.md @@ -82,4 +82,5 @@ For convenience, we also provide pre-built indexes above. ## Reproduction Log[*](reproducibility.md) + Results reproduced by [@stephaniewhoo](http://github.com/stephaniewhoo) on 2020-11-24 (commit [`b7f1f08`](https://github.com/castorini/anserini/commit/b7f1f08689014159c1d5b2c9b9905b363af1cbbf)) ++ Results reproduced by [@b8zhong](http://github.com/b8zhong) on 2024-11-27 (commit [`a5e6771`](https://github.com/castorini/anserini/commit/a5e6771a0aedcfb1c394e345636236d536c8c57d)) From d5bcd4ae835312905de451cfaebb7036b5fbe586 Mon Sep 17 00:00:00 2001 From: "ECE @ UWaterloo" Date: Thu, 28 Nov 2024 09:03:06 -0500 Subject: [PATCH 2/3] Update 20newsgroups instructions Since sh target/appassembler/bin/IndexCollection is outdated, use the fatjar directly --- docs/experiments-20newsgroups.md | 53 +++++++++++++++++++++++--------- 1 file changed, 38 insertions(+), 15 deletions(-) diff --git a/docs/experiments-20newsgroups.md b/docs/experiments-20newsgroups.md index 13a02e73e9..5d349a311c 100644 --- a/docs/experiments-20newsgroups.md +++ b/docs/experiments-20newsgroups.md @@ -41,34 +41,57 @@ Now you should see the train and test splits merged into one folder in `collecti To index train and test together: ```bash -sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \ - -input collections/20newsgroups/20news-bydate \ - -index indexes/lucene-index.20newsgroups.all \ - -generator DefaultLuceneDocumentGenerator -threads 2 \ - -storePositions -storeDocvectors -storeRaw -optimize +java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ + -collection TwentyNewsgroupsCollection \ + -input collections/20newsgroups/20news-bydate \ + -index indexes/lucene-index.20newsgroups.all \ + -generator DefaultLuceneDocumentGenerator -threads 2 \ + -storePositions -storeDocvectors -storeRaw -optimize ``` To index just the train set: ```bash -sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \ - -input collections/20newsgroups/20news-bydate-train \ - -index indexes/lucene-index.20newsgroups.train \ - -generator DefaultLuceneDocumentGenerator -threads 2 \ - -storePositions -storeDocvectors -storeRaw -optimize +java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ + -collection TwentyNewsgroupsCollection \ + -input collections/20newsgroups/20news-bydate-train \ + -index indexes/lucene-index.20newsgroups.train \ + -generator DefaultLuceneDocumentGenerator -threads 2 \ + -storePositions -storeDocvectors -storeRaw -optimize ``` To index just the test set: ```bash -sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \ - -input collections/20newsgroups/20news-bydate-test \ - -index indexes/lucene-index.20newsgroups.test \ - -generator DefaultLuceneDocumentGenerator -threads 2 \ - -storePositions -storeDocvectors -storeRaw -optimize +java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ + -collection TwentyNewsgroupsCollection \ + -input collections/20newsgroups/20news-bydate-test \ + -index indexes/lucene-index.20newsgroups.test \ + -generator DefaultLuceneDocumentGenerator -threads 2 \ + -storePositions -storeDocvectors -storeRaw -optimize ``` Indexing should take just a few seconds. + +You can check the document count (for train and test together, or train/test individually) with: + +```bash +java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexReaderUtils \ + -index indexes/lucene-index.20newsgroups.all \ + -stats +``` + +Which should output: + +``` +Index statistics +---------------- +documents: 18846 +documents (non-empty): 18846 +unique terms: 165633 +total terms: 4219956 +``` + For reference, the number of docs indexed should be exactly as follows: | | # of docs | pre-built index | From 8a0f6aafafe08babe18262a3df0cc4a9d9d8eba8 Mon Sep 17 00:00:00 2001 From: "ECE @ UWaterloo" Date: Thu, 28 Nov 2024 10:07:32 -0500 Subject: [PATCH 3/3] From direct fatjar reference to bin/run.sh --- docs/experiments-20newsgroups.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/experiments-20newsgroups.md b/docs/experiments-20newsgroups.md index 5d349a311c..2211184a0e 100644 --- a/docs/experiments-20newsgroups.md +++ b/docs/experiments-20newsgroups.md @@ -41,7 +41,7 @@ Now you should see the train and test splits merged into one folder in `collecti To index train and test together: ```bash -java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ +bin/run.sh io.anserini.index.IndexCollection \ -collection TwentyNewsgroupsCollection \ -input collections/20newsgroups/20news-bydate \ -index indexes/lucene-index.20newsgroups.all \ @@ -52,7 +52,7 @@ java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ To index just the train set: ```bash -java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ +bin/run.sh io.anserini.index.IndexCollection \ -collection TwentyNewsgroupsCollection \ -input collections/20newsgroups/20news-bydate-train \ -index indexes/lucene-index.20newsgroups.train \ @@ -63,7 +63,7 @@ java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ To index just the test set: ```bash -java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexCollection \ +bin/run.sh io.anserini.index.IndexCollection \ -collection TwentyNewsgroupsCollection \ -input collections/20newsgroups/20news-bydate-test \ -index indexes/lucene-index.20newsgroups.test \ @@ -76,7 +76,7 @@ Indexing should take just a few seconds. You can check the document count (for train and test together, or train/test individually) with: ```bash -java -cp target/anserini-*-fatjar.jar io.anserini.index.IndexReaderUtils \ +bin/run.sh io.anserini.index.IndexReaderUtils \ -index indexes/lucene-index.20newsgroups.all \ -stats ```