-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
search: add analyzer for title search #3033
Conversation
|
"type": "pattern_capture", | ||
"preserve_original": true, | ||
"patterns": [ | ||
"(/[a-zA-Z-_0-9]+)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot "close" the referenced issue, since this PR does not touch forward slashes. We can use "addresses" though.
BTW note that we have slashes in other fields, e.g. description. They would need the same treatment, so more fields would be treated the same WRT forward slashes.
But I'm not really sure what practical difference this PR makes against production. They seem to give usually the same results, which is good... But what search exactly did you compare against production system that this pattern capture addresses?
For example, I noticed a possible problem for DOIs. The DOI values also contain slashes. It is interesting to compare a search like /api/records/?q=10.7483
which gives thousands of records on PROD but only dozens with this PR. I think we have about 2,000 DOIs minted, so we should find more here? Note the dot...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Yes,
addresses
is more apt here. 👍 -
I have to add the other fields in the mappings with this type of analyzer, searching of
/
without quotes will work only when we are done with query parsing on python side before sending it to ES. -
After testing both environments in my machine, I see their is a difference. Here I tested only with
cms-primary-datasets
:
For example:
-
q ->
/Btau
gives error on both. -
q ->
//Btau
gives 4 results on prod in local env with 3 rec
->//Btau
gives 4 results on dev in local env with 3 rec but the order is different
doi
was not present in the previous mapping and were mapped by default settings in ES. Currently I have set it tokeyword
so exact search will work similar to what is their in zenodo search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WRT 4, the worry is that 10.7483 will find much less records than it should... The default mapping seems to work better there. Ideally, searching for 10.7483
should find all records that have DOI minted (of the form 10.7483/OPENDATA.CMS.HWS8.QPHJ
and the like). The default mapping does it better, so I guess we need to address this as part of this PR.
BTW looking for exact DOI is working locally for me (does not lead to query error), but returning too much stuff:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, works well locally
addresses #2930