Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping backslashes; blacklisting gene name recognition #272

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions src/server/pathway-commons/search/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ const getHGNCData = require('./hgnc');
const sanitize = (s) => {
// Escape (with '\'), to treat them literally, symbols, such as '*', ':', or space,
// which otherwise play special roles in a Lucene query string.
return s.replace(/([\!\*\+\-\&\|\(\)\[\]\{\}\^\~\?\:\/\\"\s])/g, '\\$1')
return s.replace(/([\!\*\+\-\&\|\(\)\[\]\{\}\^\~\?\:\/\\"\s])/g, '\\\\$1')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so could you give an example of what the query will look like after this?
the goal is that backslashes should be added to the query?

Copy link
Member Author

@jvwong jvwong Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For input 'mu-type opioid' the 'queries' would be the array:

[ '(name:mu\\\\-type\\\\ opioid) OR (name:*mu\\\\-type\\\\ opioid*) OR (name:*mu\\\\-type* AND name:*opioid*)',
  '(name:*mu\\\\-type* OR name:*opioid*)',
  'mu-type opioid' ]

which, in turn, the 'pathway-common' npm package search function code interprets as:
['(name:mu\\-type\\ opioid) OR (name:*mu\\-type\\ opioid*) OR (name:*mu\\-type* AND name:*opioid*)',
...
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the pathway-commons npm package does this already, should we just not use sanitize?

Does it work if you just make sanitize return the original string?

Copy link
Member

@IgorRodchenkov IgorRodchenkov Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with line 8 change - should stay '\\$1', which e.g., replaces:


- ' ' with '\ '
- '\' with '\\'
- ':' with '\:'
- '  ' with '\ \ '
- etc.

which was correct
(and pathway-commons js client should not do any other tricks but URL-encoding query parameters...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then it should be changed in the pathway-commons module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe there is some weird behaviour where qs.stringify and the escapeLucene functions are both being run multiple times on the query... maybe it might make sense to move away from using the pathway-commons npm package.

Copy link
Member

@d2fong d2fong Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max and Igor's proposed new string:

s.replace(/([\/\\!*+-&":|()\[\]{}\^\~\?\s])/g, '\\$1')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manfred lib changes:

  • add escape spaces for escapeLucene
  • do not escapeLucene if toggle is undefined

https://github.com/PathwayCommons/cpath2-client/blob/master/src/search.js#L102

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

app changes:

  • needs to send toggle to the lib to determine if the toggle should be set to true or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've (with @maxkfranz) got an idea; see PathwayCommons/cpath2-client#12.

};

const processPhrase = (phrase, collection) => {
Expand All @@ -16,15 +16,21 @@ const processPhrase = (phrase, collection) => {
'refseq'
];

// Do NOT recognize these gene symbols
const symbolNameBlackList = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it not work without this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch - then for that file, line 15 should be AND rather than OR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'CELL'
];

const tokens = phrase.split(/\s+/g);

return tokens
.map(token => {
//if symbol is recognized by at least one source
const blackListed = symbolNameBlackList.indexOf(token.toUpperCase()) > -1;
const recognized = sourceList.some(source => utilities.sourceCheck(source, token))
|| collection.has(token.toUpperCase());
const sanitized = sanitize(token);
return recognized ? ( 'xrefid:' + sanitized ) : ( 'name:' + '*' + sanitized + '*' );
return !blackListed && recognized ? ( 'xrefid:' + sanitized ) : ( 'name:' + '*' + sanitized + '*' );
});
};

Expand Down