Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DYN-6061 search missing nodes #14258

Merged
merged 2 commits into from
Aug 15, 2023
Merged

DYN-6061 search missing nodes #14258

merged 2 commits into from
Aug 15, 2023

Conversation

RobertGlobant20
Copy link
Contributor

Purpose

Fixing missing nodes in Lucene nodes search
Due that we were using a StandardAnalyzer for Lucene searching we were not able to find some nodes in the results like "+", "*", "And". So I had to implement a Custom Analyzer so we will be using a specific Tokenizer that supports those special characters in the search term.
Also I've added a Unit Test that validates if a set of specific nodes are found or not.

Declarations

Check these if you believe they are true

  • The codebase is in a better state after this PR
  • Is documented according to the standards
  • The level of testing this PR includes is appropriate
  • User facing strings, if any, are extracted into *.resx files
  • All tests pass using the self-service CI.
  • Snapshot of UI changes, if any.
  • Changes to the API follow Semantic Versioning and are documented in the API Changes document.
  • This PR modifies some build requirements and the readme is updated
  • This PR contains no files larger than 50 MB

Release Notes

Fixing missing nodes in Lucene nodes search

Reviewers

@QilongTang @reddyashish

FYIs

Due that we were using a StandardAnalyzer for Lucene searching  we were not able to find some nodes in the results like "+", "*", "And". So I had to implement a Custom Analyzer so we will be using a specific Tokenizer that supports those special characters in the search term.
Also I've added a Unit Test that validates if a set of specific nodes are found or not.
@RobertGlobant20
Copy link
Contributor Author

Screenshot showing the result before my fix and after my fix:
image

@QilongTang QilongTang added this to the 2.19.0 milestone Aug 14, 2023
case "pt-BR":
return new BrazilianAnalyzer(LuceneConfig.LuceneNetVersion);
case "ru-RU":
return new RussianAnalyzer(LuceneConfig.LuceneNetVersion);
default:
return new StandardAnalyzer(LuceneConfig.LuceneNetVersion);
return new LuceneCustomAnalyzer(LuceneConfig.LuceneNetVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @RobertGlobant20 Why do we apply LuceneCustomAnalyzer only on certain cases? How about other languages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem is happening due that we are using the StandarAnalyzer (this use english by default), that removes some the english words like "a", "an", "and", "are", "as", "at", "be", "but", "by", "And", "Or", "If" and also some specific symbols like "+", "*", so when we execute a query those words/symbols are removed (or escaped) and the node won't listed in the results.

Not sure if for other languages this change will apply or not but we can test if is needed (e.g. if the language is Chinese Simplified probably will remove “并且”、“或者”、“如果” and I guess that the nodes "And", "Or", "If" are not translated (do they?) so there is not problem but we need to check with "+", "-", "*" nodes.

For doing this test we need to switch to Dynamo Chinese language and then create a package with several nodes in Chinese and then search the nodes "+", "And", "*" and also for the nodes in "Chinese", I will test this case tomorrow to see the behavior.

Let me know your thoughts about it.
Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, make sense

@QilongTang QilongTang merged commit 06beb16 into DynamoDS:master Aug 15, 2023
@QilongTang
Copy link
Contributor

@RobertGlobant20 Would you cherry-pick this?

QilongTang pushed a commit that referenced this pull request Aug 15, 2023
Due that we were using a StandardAnalyzer for Lucene searching  we were not able to find some nodes in the results like "+", "*", "And". So I had to implement a Custom Analyzer so we will be using a specific Tokenizer that supports those special characters in the search term.
Also I've added a Unit Test that validates if a set of specific nodes are found or not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants