Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core18 (Washington Post) regression failure #658

Closed
lintool opened this issue May 31, 2019 · 4 comments
Closed

Core18 (Washington Post) regression failure #658

lintool opened this issue May 31, 2019 · 4 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented May 31, 2019

@emmileaf reported a regression failure in #654 but exposing here at top level so this doesn't get buried.

@emmileaf can you copy and paste the regression log in this issue?

@emmileaf
Copy link
Member

Previous index:

BM25
map                     all     0.2491
P_30                    all     0.3580

Index statistics
----------------
documents:             595037
documents (non-empty): 595037
unique terms:          840929
total terms:           317882653
stored fields:
  article_url (indexOption: DOCS, hasVectors: false)
  author (indexOption: DOCS, hasVectors: false)
  contents (indexOption: DOCS_AND_FREQS_AND_POSITIONS, hasVectors: true)
  fullCaption (indexOption: DOCS, hasVectors: false)
  id (indexOption: DOCS, hasVectors: false)
  kicker (indexOption: DOCS, hasVectors: false)
  title (indexOption: DOCS, hasVectors: false)

Current index:

BM25
map                     all     0.2444 
P_30                    all     0.3520 

Index statistics
----------------
documents:             595037
documents (non-empty): 585493
unique terms:          817578
total terms:           304740621
stored fields:
  article_url (indexOption: DOCS, hasVectors: false)
  author (indexOption: DOCS, hasVectors: false)
  contents (indexOption: DOCS_AND_FREQS_AND_POSITIONS, hasVectors: true)
  id (indexOption: DOCS, hasVectors: false)
  title (indexOption: DOCS, hasVectors: false)

@ryan-clancy
Copy link
Contributor

ryan-clancy commented May 31, 2019

I reverted the commit (c4ab6bf) and removed the Optional wrapper around the title, but there appears to be some documents that have no title and no paragraph content. When unwrapping the Optional for the title and adding it to contents, we now have additional empty documents (where before it likely indexed Optional[], the empty Optional string value):

2019-05-31 01:48:05,053 - regression_test - INFO - ==========Verifying Index==========
2019-05-31 01:48:05,053 - regression_test - INFO - [Index]: lucene-index.core18.pos+docvectors+rawdocs
2019-05-31 01:48:05,829 - regression_test - INFO - documents: 595037
documents (non-empty): expected=595037, actual=594483
Traceback (most recent call last):
File "src/main/python/run_regression.py", line 237, in <module>
verify_index(yaml_data, args.index, args.dry_run)
File "src/main/python/run_regression.py", line 122, in verify_index
assert value == yaml_data['index_stats'][stat]
AssertionError

How do we want to handle this case? Should we add the kicker to the contents field as well here? When adding the kicker, there is only 7 empty documents:

2019-05-31 10:01:14,399 - regression_test - INFO - ==========Verifying Index==========
2019-05-31 10:01:14,400 - regression_test - INFO - [Index]: lucene-index.core18.pos+docvectors+rawdocs
2019-05-31 10:01:16,321 - regression_test - INFO - documents:             595037
documents (non-empty): expected=595037, actual=595030
Traceback (most recent call last):
  File "src/main/python/run_regression.py", line 237, in <module>
    verify_index(yaml_data, args.index, args.dry_run)
  File "src/main/python/run_regression.py", line 122, in verify_index
    assert value == yaml_data['index_stats'][stat]
AssertionError

and the regressions change to:

2019-05-31 10:04:26,680 - regression_test - INFO - ==========Verifying Results==========
2019-05-31 10:04:26,813 - regression_test - ERROR - !!!!!{"actual": 0.2495, "collection": "core18", "expected": 0.2491, "metric": "map", "model": "bm25", "topic": "All Topics"}!!!!!
2019-05-31 10:04:26,936 - regression_test - ERROR - !!!!!{"actual": 0.3567, "collection": "core18", "expected": 0.358, "metric": "p30", "model": "bm25", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,045 - regression_test - ERROR - !!!!!{"actual": 0.3136, "collection": "core18", "expected": 0.3147, "metric": "map", "model": "bm25+rm3", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,149 - regression_test - ERROR - !!!!!{"actual": 0.42, "collection": "core18", "expected": 0.4193, "metric": "p30", "model": "bm25+rm3", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,263 - regression_test - ERROR - !!!!!{"actual": 0.292, "collection": "core18", "expected": 0.2921, "metric": "map", "model": "bm25+ax", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,369 - regression_test - ERROR - !!!!!{"actual": 0.4027, "collection": "core18", "expected": 0.4007, "metric": "p30", "model": "bm25+ax", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,483 - regression_test - ERROR - !!!!!{"actual": 0.2526, "collection": "core18", "expected": 0.2522, "metric": "map", "model": "ql", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,597 - regression_test - ERROR - !!!!!{"actual": 0.3653, "collection": "core18", "expected": 0.3627, "metric": "p30", "model": "ql", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,709 - regression_test - ERROR - !!!!!{"actual": 0.3073, "collection": "core18", "expected": 0.3064, "metric": "map", "model": "ql+rm3", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,822 - regression_test - ERROR - !!!!!{"actual": 0.4, "collection": "core18", "expected": 0.4007, "metric": "p30", "model": "ql+rm3", "topic": "All Topics"}!!!!!
2019-05-31 10:04:27,952 - regression_test - ERROR - !!!!!{"actual": 0.2966, "collection": "core18", "expected": 0.2975, "metric": "map", "model": "ql+ax", "topic": "All Topics"}!!!!!
2019-05-31 10:04:28,060 - regression_test - ERROR - !!!!!{"actual": 0.406, "collection": "core18", "expected": 0.4073, "metric": "p30", "model": "ql+ax", "topic": "All Topics"}!!!!!

Some increase, some decrease - never by much.

Couple of options:

  • Accept the change above, updating regression numbers.
  • Add something else to the contents to prevent empty docs
  • Just leave the Optional[] wrapper as it was before

@lintool @emmileaf Thoughts?

@emmileaf
Copy link
Member

The first option and removing the Optional[] wrapper makes sense to me 🙂

Regarding the empty docs...I think some of them are actually somewhat empty in content, such as:

{"id": "58dae7e6-276b-11e2-b2a0-ae18d6159439",
 "article_url": "https://www.washingtonpost.com/2012/11/05/58dae7e6-276b-11e2-b2a0-ae18d6159439_story.html", 
"title": null, 
"author": "", 
"published_date": 1352902639000, 
"contents": [{"content": null, "mime": "text/plain", "type": "kicker"}, 
{"content": null, "mime": "text/plain", "type": "title"}, 
{"content": 1352902639000, "mime": "text/plain", "type": "date"}], 
"type": "article", "source": "The Washington Post"}
{"id": "5587b340-515e-11e7-91eb-9611861a988f", 
"article_url": "https://www.washingtonpost.com/classic-apps/2017/06/14/5587b340-515e-11e7-91eb-9611861a988f_story.html", 
"title": null, 
"author": "Kevin Schaul", 
"published_date": 1497487125000, 
"contents": [{"content": null, "mime": "text/plain", "type": "kicker"}, 
{"content": null, "mime": "text/plain", "type": "title"}, 
{"content": "By Kevin Schaul", "mime": "text/plain", "type": "byline"}, 
{"content": 1497487125000, "mime": "text/plain", "type": "date"}, 
{"role": "Graphics editor", "type": "author_info", "name": "Kevin Schaul", "bio": "Kevin Schaul is a graphics editor at the Washington Post."}],
 "type": "article", 
"source": "The Washington Post"}

If we want to have minimal empty docs by adding to content, perhaps we can try author or some text form of published_date? (though this would change the regression numbers as well, and not sure if it would make too much difference/improvement)

@lintool
Copy link
Member Author

lintool commented May 31, 2019

I think @r-clancy 's fix lg - the AP/P30 differences are noise, I think...

Should we add the kicker to the contents field as well here? When adding the kicker, there is only 7 empty documents:

Let's try this and see and see what we get in terms of regression numbers?

lintool added a commit that referenced this issue Jun 1, 2019
crystina-z pushed a commit to crystina-z/anserini that referenced this issue Oct 28, 2022
* add tct_colbert-v2 integration test

* add distilbert_tasb integration test

Co-authored-by: Lin Jack <[email protected]>
Co-authored-by: justram <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants