-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escaping backslashes; blacklisting gene name recognition #272
Conversation
jvwong
commented
Nov 30, 2017
- The backslashes have to be escaped due to 'pathway-commons' npm package escape of backslashes.
- The 'blacklist' was present in pathways-search but removed, but I put it back because hgnSymbols contains stupid gene symbols alternative names (e.g. 'cell')
@@ -16,15 +16,21 @@ const processPhrase = (phrase, collection) => { | |||
'refseq' | |||
]; | |||
|
|||
// Do NOT recognize these gene symbols | |||
const symbolNameBlackList = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blacklist is checked in here:
https://github.com/PathwayCommons/app-ui/blob/development/src/server/pathway-commons/search/hgnc.js#L4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it not work without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch - then for that file, line 15 should be AND rather than OR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -5,7 +5,7 @@ const getHGNCData = require('./hgnc'); | |||
const sanitize = (s) => { | |||
// Escape (with '\'), to treat them literally, symbols, such as '*', ':', or space, | |||
// which otherwise play special roles in a Lucene query string. | |||
return s.replace(/([\!\*\+\-\&\|\(\)\[\]\{\}\^\~\?\:\/\\"\s])/g, '\\$1') | |||
return s.replace(/([\!\*\+\-\&\|\(\)\[\]\{\}\^\~\?\:\/\\"\s])/g, '\\\\$1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so could you give an example of what the query will look like after this?
the goal is that backslashes should be added to the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For input 'mu-type opioid' the 'queries' would be the array:
[ '(name:mu\\\\-type\\\\ opioid) OR (name:*mu\\\\-type\\\\ opioid*) OR (name:*mu\\\\-type* AND name:*opioid*)',
'(name:*mu\\\\-type* OR name:*opioid*)',
'mu-type opioid' ]
which, in turn, the 'pathway-common' npm package search function code interprets as:
['(name:mu\\-type\\ opioid) OR (name:*mu\\-type\\ opioid*) OR (name:*mu\\-type* AND name:*opioid*)',
...
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the pathway-commons npm package does this already, should we just not use sanitize?
Does it work if you just make sanitize return the original string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with line 8 change - should stay '\\$1'
, which e.g., replaces:
- ' ' with '\ '
- '\' with '\\'
- ':' with '\:'
- ' ' with '\ \ '
- etc.
which was correct
(and pathway-commons js client should not do any other tricks but URL-encoding query parameters...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then it should be changed in the pathway-commons module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe there is some weird behaviour where qs.stringify and the escapeLucene functions are both being run multiple times on the query... maybe it might make sense to move away from using the pathway-commons npm package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max and Igor's proposed new string:
s.replace(/([\/\\!*+-&":|()\[\]{}\^\~\?\s])/g, '\\$1')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manfred lib changes:
- add escape spaces for escapeLucene
- do not escapeLucene if toggle is undefined
https://github.com/PathwayCommons/cpath2-client/blob/master/src/search.js#L102
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
app changes:
- needs to send toggle to the lib to determine if the toggle should be set to true or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've (with @maxkfranz) got an idea; see PathwayCommons/cpath2-client#12.
I fixed the hgnc filtering bug here 5defbdd |