Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.x] URI parts ingest processor #66191

Merged

Conversation

danhermann
Copy link
Contributor

Adds a new uri_parts processor that decomposes a URI into its constituent parts. E.g.:

POST _ingest/pipeline/_simulate?verbose
{
  "pipeline": {
    "processors": [
      {
        "url_parts": {
          "field": "uri_field",
          "target_field": "url",
          "keep_original": true,
          "remove_if_successful": true
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "uri_field": "http://user:[email protected]:80/blarg.gif#ref"
      }
    }
  ]
}

results in:

"processor_type" : "uri_parts",
"status" : "success",
"doc" : {
  "_index" : "_index",
  "_id" : "_id",
  "_source" : {
    "url" : {
      "path" : "/blarg.gif",
      "fragment" : "ref",
      "extension" : "gif",
      "password" : "pw",
      "original" : "http://user:[email protected]:80/blarg.gif#ref",
      "scheme" : "http",
      "port" : 80,
      "user_info" : "user:pw",
      "domain" : "www.google.com",
      "username" : "user"
    }
  }

The processor relies on the java.net.URI class to parse the URI and attempts to map the parts into ECS fields. Some ECS fields are not part of the URI spec, so see the table below for how those are handled:

URL Parts Processor ECS java.net.URI Comments
domain url.domain getHost()
extension url.extension This is not part of the URI spec and is manually parsed out of the path element on a best-effort basis if a . exists in the path
fragment url.fragment getFragment()
url.full
original url.original The processor includes an option to retain the original URL
password url.password The URI spec defines an "authority" field but does not define either username or password though they are commonly presented with the username:password convention. The username and password fields are parsed out of the user_info field on a best-effort basis if a : exists.
path url.path getPath()
port url.port getPort()
query url.query getQuery()
url.registered_domain
scheme url.scheme getProtocol()
url.top_level_domain
username url.username See comment on password above
user_info getUserInfo() Corresponds to the "authority" field of the URI spec without the domain

Also introduces a new module for ingest processors.

Closes #57481

Backport of #65150

@danhermann danhermann added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP backport v7.11.0 labels Dec 10, 2020
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Dec 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@danhermann danhermann merged commit 0a67f1b into elastic:7.x Dec 10, 2020
@danhermann danhermann deleted the backport_7x_65150_uri_parts_processor branch December 10, 2020 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature Team:Data Management Meta label for data/management team v7.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants