Deprecate and remove camel-case nGram and edgeNGram tokenizers #50862

cbuescher · 2020-01-10T16:45:15Z

We already deprecated and removed the camel-case versions of the nGram and edgeNGram filters
a while ago and we should do the same with the nGram and edgeNGram tokenizers.
This PR deprecates the use of these names in favour of ngram and edge_ngram in 7
and disallows usage in new indices starting with 8. The deprecation part will be
backported to 7.6.

Closes #50561

We already deprecated and removed the camel-case versions of the nGram and edgeNGram filters a while ago and we should do the same with the nGram and edgeNGram tokenizers. This PR deprecates the use of these names in favour of ngram and edge_ngram in 7 and disallows usage in new indices starting with 8. The deprecation part will be backported to 7.6. Closes elastic#50561

elasticmachine · 2020-01-10T16:45:18Z

Pinging @elastic/es-search (:Search/Analysis)

mayya-sharipova · 2020-01-10T20:20:30Z

@cbuescher Is the intent of this PR to forbid to create an index with nGram tokenizer?

If it is, the it doesn't seem to work. With your patch, I can successfully create a index with nGram tokenizer:

curl -X PUT my_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{
               "type":"custom",
               "tokenizer":"nGram"
            }
         }
      }
   },
   "mappings":{
       "properties":{
          "title": {
             "type":"text",
             "analyzer":"my_analyzer"
         }
      }
   }
}

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

cbuescher · 2020-01-13T09:32:21Z

Is the intent of this PR to forbid to create an index with nGram tokenizer?

Yes, but the way we lazy-load analyzers and contained tokenizers doesn't allow us to catch this merely on definition of the analysis settings. The warnings/errors should be triggered however as soon as trying to use the defined analyzer through either the "_analysis" API or indexing something into the field.

mayya-sharipova · 2020-01-13T22:27:23Z

@cbuescher Thanks for the clarification, it makes sense now.

mayya-sharipova · 2020-01-13T22:28:23Z

...analysis-common/src/test/java/org/elasticsearch/analysis/common/EdgeNGramTokenizerTests.java

        {
-            try (IndexAnalyzers indexAnalyzers = buildAnalyzers(Version.CURRENT, "edgeNGram")) {
+            try (IndexAnalyzers indexAnalyzers = buildAnalyzers(
+                    VersionUtils.randomVersionBetween(random(), Version.V_7_3_0, VersionUtils.getPreviousVersion(Version.CURRENT)),


instead of Version.CURRENT should we put a specific version to make sure the test doesn't fail after v 8.0?

mayya-sharipova

@cbuescher thanks, makes sense

cbuescher · 2020-01-14T08:37:37Z

@mayya-sharipova thanks for reviewing, I'm going to wait for #50908 to be merged since that might change this PR and tests slightly and rework it from there. If the changes are big I might ping you again to take another look after that.

cbuescher · 2020-01-14T13:49:32Z

@elasticmachine update branch

elasticmachine · 2020-01-14T13:49:34Z

merge conflict between base and head

cbuescher · 2020-01-14T15:35:21Z

@mayya-sharipova with #50908 your previous example now also warns/throws right after the custom analyzer definition.

romseygeek · 2020-01-14T15:38:14Z

...alysis-common/src/test/java/org/elasticsearch/analysis/common/CommonAnalysisPluginTests.java

+        try (CommonAnalysisPlugin commonAnalysisPlugin = new CommonAnalysisPlugin()) {
+            Map<String, TokenizerFactory> tokenizers = createTestAnalysis(
+                    IndexSettingsModule.newIndexSettings("index", settings), settings, commonAnalysisPlugin).tokenizer;
+            TokenizerFactory tokenizerFactory = tokenizers.get(deprecatedName);


You shouldn't need to pull a factory and call create to get warnings anymore, they will be emitted when createTestAnalysis is called.

This tests prebuild tokenizer directly though, e.g. when they are used via “_analyze” endpoint directly they wouldn’t appear in analyzers. createTestAnalysis doesn’t trigger the warning in that case if nothing refers to them from analysis settings.

…ic#50862) We already deprecated and removed the camel-case versions of the nGram and edgeNGram filters a while ago and we should do the same with the nGram and edgeNGram tokenizers. This PR deprecates the use of these names in favour of ngram and edge_ngram in 7. Usage will be disallowed on new indices starting with 8 then. Closes elastic#50561

… (#50991) We deprecated and removed the camel-case versions of the nGram and edgeNGram filters a while ago and we should do the same with the nGram and edgeNGram tokenizers. This PR deprecates the use of these names in favour of ngram and edge_ngram in 7. Usage will be disallowed on new indices starting with 8 then.

…ic#50862) We already deprecated and removed the camel-case versions of the nGram and edgeNGram filters a while ago and we should do the same with the nGram and edgeNGram tokenizers. This PR deprecates the use of these names in favour of ngram and edge_ngram in 7 and disallows usage in new indices starting with 8. Closes elastic#50561

edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568

edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to #50376, #50862, #43568

…c#113009) edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568

cbuescher added :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.6.0 labels Jan 10, 2020

Adapt EdgeNGramTokenizerTests

fdf34e3

mayya-sharipova self-requested a review January 10, 2020 20:21

Merge branch 'master' of github.com:elastic/elasticsearch into fix-50561

d3a1240

mayya-sharipova reviewed Jan 13, 2020

View reviewed changes

mayya-sharipova approved these changes Jan 13, 2020

View reviewed changes

iter

53e25df

Christoph Büscher added 3 commits January 14, 2020 15:04

Merge branch 'master' into fix-50561

1532142

iter

bf9f861

Moving migration note from mappings -> analysis

1369e17

romseygeek reviewed Jan 14, 2020

View reviewed changes

cbuescher merged commit 9a4357a into elastic:master Jan 14, 2020

cbuescher added the >deprecation label Jan 14, 2020

cbuescher mentioned this pull request Jan 14, 2020

Deprecate and remove camel-case nGram and edgeNGram tokenizers (#50862) #50991

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

javanna mentioned this pull request Sep 17, 2024

Remove deprecations and 7.x related code from analysis common #113009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate and remove camel-case nGram and edgeNGram tokenizers #50862

Deprecate and remove camel-case nGram and edgeNGram tokenizers #50862

cbuescher commented Jan 10, 2020

elasticmachine commented Jan 10, 2020

mayya-sharipova commented Jan 10, 2020

cbuescher commented Jan 13, 2020

mayya-sharipova commented Jan 13, 2020

mayya-sharipova Jan 13, 2020

mayya-sharipova left a comment

cbuescher commented Jan 14, 2020

cbuescher commented Jan 14, 2020

elasticmachine commented Jan 14, 2020

cbuescher commented Jan 14, 2020

romseygeek Jan 14, 2020

cbuescher Jan 14, 2020

Deprecate and remove camel-case nGram and edgeNGram tokenizers #50862

Deprecate and remove camel-case nGram and edgeNGram tokenizers #50862

Conversation

cbuescher commented Jan 10, 2020

elasticmachine commented Jan 10, 2020

mayya-sharipova commented Jan 10, 2020

cbuescher commented Jan 13, 2020

mayya-sharipova commented Jan 13, 2020

mayya-sharipova Jan 13, 2020

Choose a reason for hiding this comment

mayya-sharipova left a comment

Choose a reason for hiding this comment

cbuescher commented Jan 14, 2020

cbuescher commented Jan 14, 2020

elasticmachine commented Jan 14, 2020

cbuescher commented Jan 14, 2020

romseygeek Jan 14, 2020

Choose a reason for hiding this comment

cbuescher Jan 14, 2020

Choose a reason for hiding this comment