DYN-7365 Updating Lucene Search Algorithm #15473

RobertGlobant20 · 2024-09-09T22:46:28Z

Purpose

Re-factoring parts of the Search algorithm and adding more unit tests.

Now when indexing node.Name if has empty space then will be removed (so when the user enter a search criteria with empty space will be matching more precisely the nodes indexed).
Added two different types of WildQueries using wildcards and changed weights so we will be getting better results.
Removed the Search by Category by empty space because it was polluting the other results we should re-consider if really is needed and if is needed we need to make a huge testing of several cases.
Also refactored the Search code removing redundant code, not used code and moved some code to other methods making the code more readable.
I've updated the Weight for the Fields: Description, SearchKeywords due that when searching a term like "combine" there are several nodes that contain the word "combine" in the SearchKeywords and those nodes are above than the expected node (List.Combine).
Also I've added several test that validates the search for the next specific nodes:

"point at parameter"
"select model element"
"family name"
"string"
"list.create" "list create"
"combine"
"translate"

Declarations

Check these if you believe they are true

The codebase is in a better state after this PR
Is documented according to the standards
The level of testing this PR includes is appropriate
User facing strings, if any, are extracted into *.resx files
All tests pass using the self-service CI.
Snapshot of UI changes, if any.
Changes to the API follow Semantic Versioning and are documented in the API Changes document.
This PR modifies some build requirements and the readme is updated
This PR contains no files larger than 50 MB

Release Notes

Re-factoring parts of the Search algorithm and adding more unit tests.

Reviewers

@QilongTang @zeusongit

FYIs

Now when indexing node.Name if has empty space then will be removed. Added two different types of Wildqueries using wildcards and changed weights so we will be getting better results. Removed the Search by Category by empty space because it was polluting the other results. Also refactored the Search code removing redundant code and not used code.

Updating the Weight for the Fields: Description, SearchKeywords due that when searching a term like "combine" there are several nodes that contain the word "combine" in the SearchKeywords and those nodes are above than the expected node (List.Combine). Also I've added several test that validates the search for specific nodes.

github-actions

See the ticket for this pull request: https://jira.autodesk.com/browse/DYN-7365

RobertGlobant20 · 2024-09-09T22:47:09Z

This is a GIF showing the test cases described in the Jira task.

QilongTang · 2024-09-12T15:13:46Z

src/DynamoCore/Configuration/LuceneConfig.cs


        /// <summary>
        /// Search tags matching weight
        /// </summary>
-        internal static int SearchTagsWeight = 6;
+        internal static int SearchTagsWeight = 4;


Can you speak to the reason to lower search tag weight here?

Well, the main problem that currently we have in Lucene Search is that we are getting some not related nodes in the search results ( you will notice it when checking that the SearchTerm is not in the Name or Category). The reason is due to the Description and SearchKeyword fields contain the SearchTerm and are affecting in some way the results.

I've found some cases like for example when you search for "combine", you will find some nodes (like Union, Difference and Concat - see image below) which has that word in the SearchKeyword/Description fields but those nodes are above nodes like CombinePath or ByCombinedTSpline which have the word "combine" in the Name. The logical thinking says that if the Name field has more weight than the Description/SearchKeywords fields then the nodes with the SearchTerm in the Name should appear upper but there is a term called Similarity(idf) in Lucene that is breaking that logic and basically we can say that is a process hapenning during indexing/quering that calculates a factor that is generated based in the repetitions of the word ("combine") in a specific Field (inside a Document), so right now the only way to fight against the Similarity is by decreasing the Weight for some fields, in this case that why I decreased the SearchTagsWeight and SearchDescriptionWeight values.

Probably in the future we will need to reimplement the Lucene algorithm that calculates the idf factor based in the repetitions of the word and the field in which was found. For example if we find that the word "point" is found in a hundred of Documents specifically in the field SearchKeywords we decrease the factor by 2 or by 4 but if is found in the field Name we don't alter the value.

src/DynamoCore/Utilities/LuceneSearchUtility.cs

QilongTang

LGTM with some comments, also do we need to update our wiki about this newer approach?

src/DynamoCore/Utilities/LuceneSearchUtility.cs

zeusongit · 2024-09-12T17:07:47Z

src/DynamoCore/Utilities/LuceneSearchUtility.cs

+                    isWildcard = false;
+                    termText = searchTerm;
+                    break;
+            }


So if I am getting this right, the more the boostOffset the less the boost will be, since we are subtracting the offset from the set weight for that field.
That means, num > num* > *num > *num* ?

yes, you are right. For example if you search for "number" just for the Name field we will be creating the next reg-ex Wildcardqueries (thinking that LuceneConfig.SearchNameWeight = 10 and LuceneConfig.WildcardsSearchNameWeight = 7):

Name:number - 10
Name:number* - 6
Name:*number - 5
Name:number - 4

Of course, this can be updated if you disagree with the assigned weight

zeusongit · 2024-09-12T17:10:48Z

From the GIF, when searching for List.Create the Range and Sequence node is at second and third place while there are other nodes with List in their name below it, I thought lowering the tag and description weight should fix that, isn't it?
The same case can also be seen in 3.3, so what changed here?

src/DynamoCore/Utilities/LuceneSearchUtility.cs

RobertGlobant20 · 2024-09-13T20:29:18Z

From the GIF, when searching for List.Create the Range and Sequence node is at second and third place while there are other nodes with List in their name below it, I thought lowering the tag and description weight should fix that, isn't it? The same case can also be seen in 3.3, so what changed here?

Well in this case is a targered search, with typing "list.create" we are saying the we want a node which has "create" in the Name and belongs to the "List" category, for the nodes below that one we don't have too much control.
The nodes Range and Sequence belong to the List category as all the nodes listed below (as you can see in the tester app all has the same score except List.Create). Lowering the tag and description weight helps when searching nodes with empty space in which one of the words has several nodes with similar Name (e.g. "point at" which has several nodes with the word "point" in the Name ).

Changing enums to be internal (instead of public) and adding more comments for making ideas more clear.

RobertGlobant20 · 2024-09-13T21:34:04Z

LGTM with some comments, also do we need to update our wiki about this newer approach?

Yes, the wiki needs to be update to reflect changes in the weight and also for describing the Wildcardqueries added with reg exp

github-actions · 2024-09-13T21:54:06Z

UI Smoke Tests

Test: success. 11 passed, 0 failed.
TestComplete Test Result
Workflow Run: UI Smoke Tests
Check: UI Smoke Tests

QilongTang · 2024-09-17T17:47:13Z

@zeusongit PTAL again and let me know if you are good with the latest changes

zeusongit

LGTM with one comment still not clear

QilongTang · 2024-09-17T19:21:25Z

LGTM with one comment still not clear

Thanks, let's catch up tomorrow with @RobertGlobant20 to discuss before merging

QilongTang · 2024-09-17T19:59:06Z

@RobertGlobant20 Can you look at the latest regressions:
DynamoCoreWpfTests.NodeAutoCompleteSearchTests.SearchNodeAutocompletionSuggestions
Dynamo.Tests.SearchSideEffects.LuceneSearchNodesOrderingValidation

I've updated two tests due that now that the Search algorithm was updated the number of results or the results orders are changing.

Moving piece of code before switch

With the latest updates in the Search algorithm now when searching for "list create" the List.Create node is at the second position, so I had to modify the test according to the results.

RobertGlobant20 added 3 commits September 4, 2024 13:24

Merge branch 'master' into DYN-7365-Updating-LuceneSearch-Algorithm

8a2b7bf

RobertGlobant20 requested review from QilongTang and zeusongit September 9, 2024 22:46

github-actions bot reviewed Sep 9, 2024

View reviewed changes

QilongTang added this to the 3.4 milestone Sep 12, 2024

QilongTang reviewed Sep 12, 2024

View reviewed changes

src/DynamoCore/Utilities/LuceneSearchUtility.cs Outdated Show resolved Hide resolved

QilongTang reviewed Sep 12, 2024

View reviewed changes

src/DynamoCore/Utilities/LuceneSearchUtility.cs Outdated Show resolved Hide resolved

QilongTang reviewed Sep 12, 2024

View reviewed changes

src/DynamoCore/Utilities/LuceneSearchUtility.cs Show resolved Hide resolved

QilongTang reviewed Sep 12, 2024

View reviewed changes

zeusongit reviewed Sep 12, 2024

View reviewed changes

src/DynamoCore/Utilities/LuceneSearchUtility.cs Outdated Show resolved Hide resolved

zeusongit reviewed Sep 12, 2024

View reviewed changes

src/DynamoCore/Utilities/LuceneSearchUtility.cs Outdated Show resolved Hide resolved

DYN-7365 Updating Lucene Search Algorithm

40697ac

Changing enums to be internal (instead of public) and adding more comments for making ideas more clear.

QilongTang approved these changes Sep 17, 2024

View reviewed changes

QilongTang requested a review from zeusongit September 17, 2024 17:46

zeusongit approved these changes Sep 17, 2024

View reviewed changes

RobertGlobant20 added 3 commits September 17, 2024 15:45

DYN-7365 Updating Lucene Search Algorithm

83b1052

I've updated two tests due that now that the Search algorithm was updated the number of results or the results orders are changing.

DYN-7365 Updating Lucene Search Algorithm Code Review

31ad32b

Moving piece of code before switch

DYN-7365 Updating Lucene Search Algorithm - Regression

cd52202

With the latest updates in the Search algorithm now when searching for "list create" the List.Create node is at the second position, so I had to modify the test according to the results.

QilongTang merged commit f1b9cc7 into DynamoDS:master Sep 18, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DYN-7365 Updating Lucene Search Algorithm #15473

DYN-7365 Updating Lucene Search Algorithm #15473

RobertGlobant20 commented Sep 9, 2024 •

edited by QilongTang

Loading

github-actions bot left a comment

RobertGlobant20 commented Sep 9, 2024

QilongTang Sep 12, 2024

RobertGlobant20 Sep 13, 2024

QilongTang left a comment

zeusongit Sep 12, 2024

RobertGlobant20 Sep 13, 2024 •

edited

Loading

zeusongit commented Sep 12, 2024

RobertGlobant20 commented Sep 13, 2024

RobertGlobant20 commented Sep 13, 2024

github-actions bot commented Sep 13, 2024 •

edited

Loading

QilongTang commented Sep 17, 2024

zeusongit left a comment

QilongTang commented Sep 17, 2024

QilongTang commented Sep 17, 2024

DYN-7365 Updating Lucene Search Algorithm #15473

DYN-7365 Updating Lucene Search Algorithm #15473

Conversation

RobertGlobant20 commented Sep 9, 2024 • edited by QilongTang Loading

Purpose

Declarations

Release Notes

Reviewers

FYIs

github-actions bot left a comment

Choose a reason for hiding this comment

RobertGlobant20 commented Sep 9, 2024

QilongTang Sep 12, 2024

Choose a reason for hiding this comment

RobertGlobant20 Sep 13, 2024

Choose a reason for hiding this comment

QilongTang left a comment

Choose a reason for hiding this comment

zeusongit Sep 12, 2024

Choose a reason for hiding this comment

RobertGlobant20 Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

zeusongit commented Sep 12, 2024

RobertGlobant20 commented Sep 13, 2024

RobertGlobant20 commented Sep 13, 2024

github-actions bot commented Sep 13, 2024 • edited Loading

UI Smoke Tests

QilongTang commented Sep 17, 2024

zeusongit left a comment

Choose a reason for hiding this comment

QilongTang commented Sep 17, 2024

QilongTang commented Sep 17, 2024

RobertGlobant20 commented Sep 9, 2024 •

edited by QilongTang

Loading

RobertGlobant20 Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading