Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search: add analyzer for title search #3033

Merged
merged 1 commit into from
Feb 5, 2021
Merged

search: add analyzer for title search #3033

merged 1 commit into from
Feb 5, 2021

Conversation

ParthS007
Copy link
Member

@ParthS007 ParthS007 commented Jan 28, 2021

addresses #2930

@ParthS007
Copy link
Member Author

  1. Tokens generated for: /BTau/Run2010B-Apr21ReReco-v1/AOD
  • /AOD
  • /BTau
  • /BTau/Run2010B-Apr21ReReco-v1/AOD
  • /Run2010B-Apr21ReReco-v1
  1. Query for JetMETTauMonitor: /api/records/?q=%22JetMETTauMonitor%22 gives 3 hits:
  • recid: 79
  • recid: 6
  • Documentation: "cms-guide-for-research"

"type": "pattern_capture",
"preserve_original": true,
"patterns": [
"(/[a-zA-Z-_0-9]+)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot "close" the referenced issue, since this PR does not touch forward slashes. We can use "addresses" though.

BTW note that we have slashes in other fields, e.g. description. They would need the same treatment, so more fields would be treated the same WRT forward slashes.

But I'm not really sure what practical difference this PR makes against production. They seem to give usually the same results, which is good... But what search exactly did you compare against production system that this pattern capture addresses?

For example, I noticed a possible problem for DOIs. The DOI values also contain slashes. It is interesting to compare a search like /api/records/?q=10.7483 which gives thousands of records on PROD but only dozens with this PR. I think we have about 2,000 DOIs minted, so we should find more here? Note the dot...

Copy link
Member Author

@ParthS007 ParthS007 Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, addresses is more apt here. 👍

  2. I have to add the other fields in the mappings with this type of analyzer, searching of / without quotes will work only when we are done with query parsing on python side before sending it to ES.

  3. After testing both environments in my machine, I see their is a difference. Here I tested only with cms-primary-datasets:

For example:

  • q -> /Btau gives error on both.

  • q -> //Btau gives 4 results on prod in local env with 3 rec
    -> //Btau gives 4 results on dev in local env with 3 rec but the order is different

  1. doi was not present in the previous mapping and were mapped by default settings in ES. Currently I have set it to keyword so exact search will work similar to what is their in zenodo search.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT 4, the worry is that 10.7483 will find much less records than it should... The default mapping seems to work better there. Ideally, searching for 10.7483 should find all records that have DOI minted (of the form 10.7483/OPENDATA.CMS.HWS8.QPHJ and the like). The default mapping does it better, so I guess we need to address this as part of this PR.

BTW looking for exact DOI is working locally for me (does not lead to query error), but returning too much stuff:

doi-search

@ParthS007 ParthS007 changed the title search: add analyzer for exact title search search: add analyzer for title search Feb 5, 2021
@ParthS007 ParthS007 requested review from tiborsimko and removed request for diegodelemos February 5, 2021 15:45
Copy link
Member

@tiborsimko tiborsimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, works well locally

@tiborsimko tiborsimko merged commit f6c8d7f into cernopendata:master Feb 5, 2021
@ParthS007 ParthS007 deleted the 2930 branch February 18, 2021 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants