-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DYN-7365 Updating Lucene Search Algorithm #15473
DYN-7365 Updating Lucene Search Algorithm #15473
Conversation
Now when indexing node.Name if has empty space then will be removed. Added two different types of Wildqueries using wildcards and changed weights so we will be getting better results. Removed the Search by Category by empty space because it was polluting the other results. Also refactored the Search code removing redundant code and not used code.
Updating the Weight for the Fields: Description, SearchKeywords due that when searching a term like "combine" there are several nodes that contain the word "combine" in the SearchKeywords and those nodes are above than the expected node (List.Combine). Also I've added several test that validates the search for specific nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the ticket for this pull request: https://jira.autodesk.com/browse/DYN-7365
|
||
/// <summary> | ||
/// Search tags matching weight | ||
/// </summary> | ||
internal static int SearchTagsWeight = 6; | ||
internal static int SearchTagsWeight = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you speak to the reason to lower search tag weight here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the main problem that currently we have in Lucene Search is that we are getting some not related nodes in the search results ( you will notice it when checking that the SearchTerm is not in the Name or Category). The reason is due to the Description and SearchKeyword fields contain the SearchTerm and are affecting in some way the results.
I've found some cases like for example when you search for "combine", you will find some nodes (like Union, Difference and Concat - see image below) which has that word in the SearchKeyword/Description fields but those nodes are above nodes like CombinePath or ByCombinedTSpline which have the word "combine" in the Name. The logical thinking says that if the Name field has more weight than the Description/SearchKeywords fields then the nodes with the SearchTerm in the Name should appear upper but there is a term called Similarity(idf) in Lucene that is breaking that logic and basically we can say that is a process hapenning during indexing/quering that calculates a factor that is generated based in the repetitions of the word ("combine") in a specific Field (inside a Document), so right now the only way to fight against the Similarity is by decreasing the Weight for some fields, in this case that why I decreased the SearchTagsWeight and SearchDescriptionWeight values.
Probably in the future we will need to reimplement the Lucene algorithm that calculates the idf factor based in the repetitions of the word and the field in which was found. For example if we find that the word "point" is found in a hundred of Documents specifically in the field SearchKeywords we decrease the factor by 2 or by 4 but if is found in the field Name we don't alter the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some comments, also do we need to update our wiki about this newer approach?
isWildcard = false; | ||
termText = searchTerm; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I am getting this right, the more the boostOffset the less the boost will be, since we are subtracting the offset from the set weight for that field.
That means, num
> num*
> *num
> *num*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right. For example if you search for "number" just for the Name field we will be creating the next reg-ex Wildcardqueries (thinking that LuceneConfig.SearchNameWeight = 10 and LuceneConfig.WildcardsSearchNameWeight = 7):
Name:number - 10
Name:number* - 6
Name:*number - 5
Name:number - 4
Of course, this can be updated if you disagree with the assigned weight
From the GIF, when searching for |
Well in this case is a targered search, with typing "list.create" we are saying the we want a node which has "create" in the Name and belongs to the "List" category, for the nodes below that one we don't have too much control. |
Changing enums to be internal (instead of public) and adding more comments for making ideas more clear.
Yes, the wiki needs to be update to reflect changes in the weight and also for describing the Wildcardqueries added with reg exp |
UI Smoke TestsTest: success. 11 passed, 0 failed. |
@zeusongit PTAL again and let me know if you are good with the latest changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one comment still not clear
Thanks, let's catch up tomorrow with @RobertGlobant20 to discuss before merging |
@RobertGlobant20 Can you look at the latest regressions: |
I've updated two tests due that now that the Search algorithm was updated the number of results or the results orders are changing.
Moving piece of code before switch
With the latest updates in the Search algorithm now when searching for "list create" the List.Create node is at the second position, so I had to modify the test according to the results.
Purpose
Re-factoring parts of the Search algorithm and adding more unit tests.
Now when indexing node.Name if has empty space then will be removed (so when the user enter a search criteria with empty space will be matching more precisely the nodes indexed).
Added two different types of WildQueries using wildcards and changed weights so we will be getting better results.
Removed the Search by Category by empty space because it was polluting the other results we should re-consider if really is needed and if is needed we need to make a huge testing of several cases.
Also refactored the Search code removing redundant code, not used code and moved some code to other methods making the code more readable.
I've updated the Weight for the Fields: Description, SearchKeywords due that when searching a term like "combine" there are several nodes that contain the word "combine" in the SearchKeywords and those nodes are above than the expected node (List.Combine).
Also I've added several test that validates the search for the next specific nodes:
"point at parameter"
"select model element"
"family name"
"string"
"list.create" "list create"
"combine"
"translate"
Declarations
Check these if you believe they are true
*.resx
filesRelease Notes
Re-factoring parts of the Search algorithm and adding more unit tests.
Reviewers
@QilongTang @zeusongit
FYIs