Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DYN-7365 Updating Lucene Search Algorithm #15473

Conversation

RobertGlobant20
Copy link
Contributor

@RobertGlobant20 RobertGlobant20 commented Sep 9, 2024

Purpose

Re-factoring parts of the Search algorithm and adding more unit tests.

  • Now when indexing node.Name if has empty space then will be removed (so when the user enter a search criteria with empty space will be matching more precisely the nodes indexed).

  • Added two different types of WildQueries using wildcards and changed weights so we will be getting better results.

  • Removed the Search by Category by empty space because it was polluting the other results we should re-consider if really is needed and if is needed we need to make a huge testing of several cases.

  • Also refactored the Search code removing redundant code, not used code and moved some code to other methods making the code more readable.

  • I've updated the Weight for the Fields: Description, SearchKeywords due that when searching a term like "combine" there are several nodes that contain the word "combine" in the SearchKeywords and those nodes are above than the expected node (List.Combine).

  • Also I've added several test that validates the search for the next specific nodes:

"point at parameter"
"select model element"
"family name"
"string"
"list.create" "list create"
"combine"
"translate"

Declarations

Check these if you believe they are true

  • The codebase is in a better state after this PR
  • Is documented according to the standards
  • The level of testing this PR includes is appropriate
  • User facing strings, if any, are extracted into *.resx files
  • All tests pass using the self-service CI.
  • Snapshot of UI changes, if any.
  • Changes to the API follow Semantic Versioning and are documented in the API Changes document.
  • This PR modifies some build requirements and the readme is updated
  • This PR contains no files larger than 50 MB

Release Notes

Re-factoring parts of the Search algorithm and adding more unit tests.

Reviewers

@QilongTang @zeusongit

FYIs

Now when indexing node.Name if has empty space then will be removed.
Added two different types of Wildqueries using wildcards and changed weights so we will be getting better results.
Removed the Search by Category by empty space because it was polluting the other results.
Also refactored the Search code removing redundant code and not used code.
Updating the Weight for the Fields: Description, SearchKeywords due that when searching a term like "combine" there are several nodes that contain the word "combine" in the SearchKeywords and those nodes are above than the expected node (List.Combine).

Also I've added several test that validates the search for specific nodes.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the ticket for this pull request: https://jira.autodesk.com/browse/DYN-7365

@RobertGlobant20
Copy link
Contributor Author

This is a GIF showing the test cases described in the Jira task.
Revit_50sk8eemp9

@QilongTang QilongTang added this to the 3.4 milestone Sep 12, 2024

/// <summary>
/// Search tags matching weight
/// </summary>
internal static int SearchTagsWeight = 6;
internal static int SearchTagsWeight = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you speak to the reason to lower search tag weight here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the main problem that currently we have in Lucene Search is that we are getting some not related nodes in the search results ( you will notice it when checking that the SearchTerm is not in the Name or Category). The reason is due to the Description and SearchKeyword fields contain the SearchTerm and are affecting in some way the results.

I've found some cases like for example when you search for "combine", you will find some nodes (like Union, Difference and Concat - see image below) which has that word in the SearchKeyword/Description fields but those nodes are above nodes like CombinePath or ByCombinedTSpline which have the word "combine" in the Name. The logical thinking says that if the Name field has more weight than the Description/SearchKeywords fields then the nodes with the SearchTerm in the Name should appear upper but there is a term called Similarity(idf) in Lucene that is breaking that logic and basically we can say that is a process hapenning during indexing/quering that calculates a factor that is generated based in the repetitions of the word ("combine") in a specific Field (inside a Document), so right now the only way to fight against the Similarity is by decreasing the Weight for some fields, in this case that why I decreased the SearchTagsWeight and SearchDescriptionWeight values.

Probably in the future we will need to reimplement the Lucene algorithm that calculates the idf factor based in the repetitions of the word and the field in which was found. For example if we find that the word "point" is found in a hundred of Documents specifically in the field SearchKeywords we decrease the factor by 2 or by 4 but if is found in the field Name we don't alter the value.

image

Copy link
Contributor

@QilongTang QilongTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some comments, also do we need to update our wiki about this newer approach?

isWildcard = false;
termText = searchTerm;
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I am getting this right, the more the boostOffset the less the boost will be, since we are subtracting the offset from the set weight for that field.
That means, num > num* > *num > *num* ?

Copy link
Contributor Author

@RobertGlobant20 RobertGlobant20 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right. For example if you search for "number" just for the Name field we will be creating the next reg-ex Wildcardqueries (thinking that LuceneConfig.SearchNameWeight = 10 and LuceneConfig.WildcardsSearchNameWeight = 7):

Name:number - 10
Name:number* - 6
Name:*number - 5
Name:number - 4

Of course, this can be updated if you disagree with the assigned weight

@zeusongit
Copy link
Contributor

From the GIF, when searching for List.Create the Range and Sequence node is at second and third place while there are other nodes with List in their name below it, I thought lowering the tag and description weight should fix that, isn't it?
The same case can also be seen in 3.3, so what changed here?

@RobertGlobant20
Copy link
Contributor Author

From the GIF, when searching for List.Create the Range and Sequence node is at second and third place while there are other nodes with List in their name below it, I thought lowering the tag and description weight should fix that, isn't it? The same case can also be seen in 3.3, so what changed here?

Well in this case is a targered search, with typing "list.create" we are saying the we want a node which has "create" in the Name and belongs to the "List" category, for the nodes below that one we don't have too much control.
The nodes Range and Sequence belong to the List category as all the nodes listed below (as you can see in the tester app all has the same score except List.Create). Lowering the tag and description weight helps when searching nodes with empty space in which one of the words has several nodes with similar Name (e.g. "point at" which has several nodes with the word "point" in the Name ).

image

Changing enums to be internal (instead of public) and adding more comments for making ideas more clear.
@RobertGlobant20
Copy link
Contributor Author

LGTM with some comments, also do we need to update our wiki about this newer approach?

Yes, the wiki needs to be update to reflect changes in the weight and also for describing the Wildcardqueries added with reg exp

Copy link

github-actions bot commented Sep 13, 2024

UI Smoke Tests

Test: success. 11 passed, 0 failed.
TestComplete Test Result
Workflow Run: UI Smoke Tests
Check: UI Smoke Tests

@QilongTang
Copy link
Contributor

@zeusongit PTAL again and let me know if you are good with the latest changes

Copy link
Contributor

@zeusongit zeusongit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment still not clear

@QilongTang
Copy link
Contributor

LGTM with one comment still not clear

Thanks, let's catch up tomorrow with @RobertGlobant20 to discuss before merging

I've updated two tests due that now that the Search algorithm was updated the number of results or the results orders are changing.
Moving piece of code before switch
With the latest updates in the Search algorithm now when searching for "list create" the List.Create node is at the second position, so I had to modify the test according to the results.
@QilongTang QilongTang merged commit f1b9cc7 into DynamoDS:master Sep 18, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants