Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.response.status_code as long instead of keyword or integer #564

Open
GuillaumeDuf opened this issue Sep 17, 2019 · 19 comments
Open

http.response.status_code as long instead of keyword or integer #564

GuillaumeDuf opened this issue Sep 17, 2019 · 19 comments
Labels

Comments

@GuillaumeDuf
Copy link

Hello ,
I did not found any discussion on this topic so , I ask the question
In ECS the http.response.status_code is mapped as long , why ?

Using an integer brings the advantage of taking up less space, but we are not supposed to do digital operations (sum/avg...) on an HTTP status because all codes are between 100 and 599.

Using a keyword will allow to run an aggregation without specifying a null_value.
The keyword type will also allow to make range query if necessary (it will be in alphanumeric order):
ECS source code

 - name: response.status_code
      format: string
      level: extended
      type: long
      description: >
        HTTP response status code.
      example: 404
@GuillaumeDuf GuillaumeDuf changed the title http.response.status_code as long instead of keyword http.response.status_code as long instead of keyword or integer Sep 17, 2019
@webmat
Copy link
Contributor

webmat commented Sep 23, 2019

You're right that codes like HTTP status codes aren't meant to have arithmetic operations done on them. We're making them keyword as much as possible in general. For HTTP we went with long because it was so well established that this wasn't going to change, and nobody would try to map another type of code to it (as opposed to event.code, which may be used to contain numeric or alphanum codes).

As for null_value, this hadn't come up in the discussion for HTTP. Can you expand on the problem you see here?

@andrewthad
Copy link
Contributor

For HTTP we went with long because it was so well established that this wasn't going to change, and nobody would try to map another type of code to it.

Then why is network.iana_number a keyword instead of a long?

@webmat
Copy link
Contributor

webmat commented Oct 3, 2019

@andrewthad Oh no, you win! You found an inconsistency!

Kidding, there's lots of inconsistencies in ECS 😂We're doing our best to avoid them, but additional eyes on the PRs are welcome. Seems like you have

Laser Eyes

Please let us know if either network.iana_number being keyword or response.status_code being long are causing a specific problem. We'll consider addressing those if that's the case.

If the questions were just being posed out of curiosity, that's fine too. But please close the issue, once the curiosity is satisfied :-)

@andrewthad
Copy link
Contributor

That's a wonderful gif. The types aren't causing any problems at the moment. I'm currently using ECS for semi-structured logs in a SIEM I'm working on, and when I pass records between services, I'm trying to make sure their representation is compact and supports efficient testing against common predicates. Numbers are better for this kind of thing. So I currently deviate from the standard in a few ways, mostly by making things numbers that are supposed to be text.

A thought that I've had is that it might be useful if the types weren't just Elasticsearch's types. Meaning that it could be useful to have a layer above ES types that maps down onto ES types. Something like:

  • 8-bit unsigned number -> long
  • 16-bit unsigned number -> long
  • 64-bit signed -> long
  • 64-bit unsigned identifier (number that only supports equality) -> keyword
  • term -> keyword

That's just a vague sketch, but then ECS could become more useful as an efficient representation of data for applications other than Elasticsearch.

@webmat
Copy link
Contributor

webmat commented Oct 3, 2019

Yes, that's something we could consider. I'll run that by a few other folks who may feel the same way, to get their take on this.

@webmat webmat added the discuss label Oct 3, 2019
@GuillaumeDuf
Copy link
Author

having status code has integer cause problems for machine learning jobs , because you can only use theses field as "metric field" but not a field to partition analysis by . ..

@webmat
Copy link
Contributor

webmat commented Oct 29, 2020

Thanks for chiming in again. This had fallen through the cracks of the floor :-)

I took a note on our list of possible breaking changes to fix at the next major (#839), and I'm also checking with colleagues on the ML team on what the possibilities are.

@webmat
Copy link
Contributor

webmat commented Oct 29, 2020

You're correct that the main ML UI doesn't expose numeric fields as being available to partition by.

But if you go to the JSON editor in the advanced section, you can use them as such.

@djptek
Copy link
Contributor

djptek commented Mar 4, 2021

There is a nice way to partition numeric fields (up to 100, which exceeds the 41 http response codes) to generate a histogram & or other visualizations using Lens, see #839 (comment)

@djptek
Copy link
Contributor

djptek commented Mar 4, 2021

Here's an example of partition using programmatically generated response codes (long) for every possible value using Lens

image

@djptek
Copy link
Contributor

djptek commented Jul 13, 2021

Closing due to lack of update, please feel free to reopen

@djptek djptek closed this as completed Jul 13, 2021
@iamhowardtheduck
Copy link

I would like to re-open this issue, as I have many customers who require the use of keyword for this error code.

@blookot
Copy link

blookot commented May 25, 2022

@webmat I'm quite surprised we need to discuss this issue
How has the status code been defined as long at the first place...?
status code must be a keyword. It's an issue in all kibana as well as ML as mentioned before.
Please update ECS
Even the logs from kibana sample dataset sets a "response.keyword" field.

If only we could add a keyword nested field under a long...
Or if we had a chance to define a runtime field converting the long to string...
But both won't work.

Thank you in advance.

@blookot
Copy link

blookot commented May 25, 2022

I'm adding a quick erratum: the runtime field to convert to keyword is:

PUT /test-mapping/_mapping
{
  "runtime": {
    "statuscode_keyword": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['statuscode'].value.toString());"
      }
    }
  }
}

Still this won't scale

@djptek
Copy link
Contributor

djptek commented May 25, 2022

Hi @blookot @iamhowardtheduck this is a long running discussion, which predates ECS. For the most recent history, please see top of this post. There are certainly arguments for both alternatives. Also, there is an existing user base whose historical and current data is linked to the type which has been applied.

If you need to run keyword analytics on this field, then rather than a runtime field perhaps consider:

  • using painless in an ingest node to create a keyword representation of the status code in a new field as target for your analytics
  • if space is premium, you also have the Mapping Parameters index and store

@djptek
Copy link
Contributor

djptek commented May 25, 2022

@blookot @iamhowardtheduck as to the root of all this, that's really down to whoever decided to encode what so far turned out to be 41 discrete values using 3 digits in 1991 or thereabouts, I'm sure they had their reasons

For the record, I would generally expect to map any emerging field with the suffix "code" to keyword, assuming minimal code &/or historical business data that I would need to rewrite/reprocess

@blookot
Copy link

blookot commented Jun 1, 2022

Hi @djptek
thank you for your answers.
You mean I could add the "index" mapping parameter to the http.response.status_code field, and that would make is aggregable (in visualizations and ML wizards)?
Which means index mappings (set from beats) could be customized to solve this...
Or simply add a second http.response.status_code_keyword field?

@djptek
Copy link
Contributor

djptek commented Jun 2, 2022

Hi @blookot

add a second http.response.status_code_keyword field

this would work - you'd need to add e.g. some painless in the index pipeline to populate that field &/or do a reindex to add this to your legacy data if required

@BBQigniter
Copy link

This really is strange - I had a look at https://github.com/elastic/integrations/blob/main/packages/nginx/kibana/ml_module/nginx-Logs-ml.json and wanted to rebuild the status_code_rate_nginx ML-job for one of our own indices that keeps to ECS, only to figure out that I cannot use http.response.status_code in the "Split field" drop-down for a "Multi-metric" job.

Having a quick look at the pipeline-config from the integration and https://docs.elastic.co/en/integrations/nginx, it seems it's also ingested as long - so how would that included ML-job work? Is it possible to create such jobs anyway via the API if you know what you do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants