OpenSearch Langdetect Ingest Processor

This is a port of spinscale's ElasticSearch Langdetect ingest plugin. The code was migrate using my migration script ElasticSearch to OpenSearch Migration Scripts

Uses the langdetect plugin to try to find out the language used in a field.

Note that OpenSearch has native support for langdetection nowadays using the inference ingest processor. See more in the documentation

Installation

OS	Command
1.1.0	`bin/opensearch-plugin install https://github.com/aparo/opensearch-ingest-langdetect/releases/download/1.1.0/ingest-langdetect-1.1.0.zip`
1.2.0	`bin/opensearch-plugin install https://github.com/aparo/opensearch-ingest-langdetect/releases/download/1.2.0/ingest-langdetect-1.2.0.zip`
1.2.2	`bin/opensearch-plugin install https://github.com/aparo/opensearch-ingest-langdetect/releases/download/1.2.2/ingest-langdetect-1.2.2.zip`
1.2.3	`bin/opensearch-plugin install https://github.com/aparo/opensearch-ingest-langdetect/releases/download/1.2.3/ingest-langdetect-1.2.3.zip`

Usage

PUT _ingest/pipeline/langdetect-pipeline
{
  "description": "A pipeline to do whatever",
  "processors": [
    {
      "langdetect" : {
        "field" : "my_field",
        "target_field" : "language"
      }
    }
  ]
}

PUT /my-index/my-type/1?pipeline=langdetect-pipeline
{
  "my_field" : "This is hopefully an english text, that will be detected."
}

GET /my-index/my-type/1

# Expected response
{
  "my_field" : "This is hopefully an english text, that will be detected.",
  "language": "en"
}

You could also set certain fields that use different analyzers for different languages

PUT _ingest/pipeline/langdetect-analyzer-pipeline
{
  "description": "A pipeline to index data into language specific analyzers",
  "processors": [
    {
      "langdetect": {
        "field": "my_field",
        "target_field": "lang"
      }
    },
    {
      "script": {
        "source": "ctx.language = [:];ctx.language[ctx.lang] = ctx.remove('my_field')"
      }
    }
  ]
}

PUT documents
{
  "mappings": {
    "doc" : {
      "properties" : {
        "language": {
          "properties": {
            "de" : {
              "type": "text",
              "analyzer": "german"
            },
            "en" : {
              "type": "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

PUT /my-index/doc/1?pipeline=langdetect-analyzer-pipeline
{
  "my_field" : "This is an english text"
}

PUT /my-index/doc/2?pipeline=langdetect-analyzer-pipeline
{
  "my_field" : "Das hier ist ein deutscher Text."
}

GET my-index/doc/1

GET my-index/doc/2

Configuration

Parameter	Use
field	Field name of where to read the content from
target_field	Field name to write the language to
max_length	Max length of of characters to read, defaults to 10kb, requires a byte size value, like 1mb
ignore_missing	Ignore missing source field. Not throwing exception in that case. Expects for boolean value, defaults to false.

Setup

In order to install this plugin, you need to create a zip distribution first by running

gradle clean check

This will produce a zip file in build/distributions.

After building the zip file, you can install it like this

bin/plugin install file:///path/to/ingest-langdetect/build/distribution/ingest-langdetect-0.0.1-SNAPSHOT.zip

Side notes

In order to cope with the security manager, a special factory is used to load the languages from the classpath. You can check out the SecureDetectorFactory class. This implementation also does not use jsonic to prevent the use of reflection when loading the languages.

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSearch Langdetect Ingest Processor

Installation

Usage

Configuration

Setup

Side notes

About

Releases 10

Packages

Contributors 4

Languages

License

aparo/opensearch-ingest-langdetect

Folders and files

Latest commit

History

Repository files navigation

OpenSearch Langdetect Ingest Processor

Installation

Usage

Configuration

Setup

Side notes

About

Resources

License

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 4

Languages

Packages