Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize DNS related fields across entities #797

Closed
wants to merge 6 commits into from
Closed

Normalize DNS related fields across entities #797

wants to merge 6 commits into from

Conversation

dainperkins
Copy link
Contributor

@dainperkins dainperkins commented Mar 23, 2020

Adding fields to standardize domain name representations & features for:
dns, client, destination, host, server, source fields.

Closes #728

Adding fields to standardize domain name representations & features for:
dns, client, destination, host, server, source fields.
added PR #in Changelog.md
added country code & hostname to dns. and added country code to c/d/s/s/h
@webmat
Copy link
Contributor

webmat commented Mar 23, 2020

Tip: I added Closes #728 to the end of the PR text. This will automatically close the related issue when we merge this. Please do this, when a PR addresses an issue in full. When it addresses part of an issue, just mentioning the issue number in the description or a comment is useful as well, as it links the two together.

added answers features, and sorted name fields
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dain! Here's a first round of review.

I've made a few comments on the "client" field set, which apply to the equivalent fields in the other field sets as well.

@@ -16,7 +16,7 @@ Thanks, you're awesome :-) -->

#### Added

* Added `search.*` fields #729
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for future clarification - when I update from source, I should just add a line to CHANGELOG.next.md?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just add your line and don't touch the rest :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks the conflicts with the CHANGELOG.next are still outstanding.

CHANGELOG.next.md Show resolved Hide resolved
CHANGELOG.next.md Outdated Show resolved Hide resolved
Comment on lines 129 to 136
- name: country_code
level: extended
type: keyword
short: The country code portion of the domain.
description: >
The country code portion of the domain.

example: uk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should go all the way to adding country_code.

I understand the value, but even just parsing out the eTLD and the registered_domain reliably puts a high burden on implementations, IMO.

Asking them to also figure out the country code (only when appropriate, so not when it's a ".com", ".org" or ".beer") opens up other cans of worm: what about the region? Being from Quebec Canada I'm thinking of our beloved ".qc.ca" tld and the like.

@andrewkroh What do you think? Does the public suffix list help fill this country code (aka it's not much more of a burden)? Or does it not help with country code?

If we decide to keep it, we should at least say in the description that it should be filled only when there's a country code :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like parsing issues shouldn't be a consideration in terms of the availability of accurate sub fields. I'll update the description to say it should only be filled in when there is a country code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The country code TLDs are a subset of all TLDs. And we already have a top_level_domain field so this will be duplicated. If is useful to know that a TLD is a country code then maybe that's a boolean flag like is_country_code to indicate the top_level_domain is a country.

Similarly I had thought other attributes of the domain name could be flags. Like whether it contains puny codes with is_internationalized (elastic/beats#5809 (comment)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewkroh that leaves me wondering 2 things (and I'm serious... not just pondering to be annoying, which I may have been known to do on occasion :)

  1. where do we draw the line for cutting up aggregate things (like a fqdn) into its component parts?

tld is part of a fqdn - and we duplicates it, but cc (and apparently ?region code? thanks for that Quebec!) are also part of a tld... should we be cutting a fqdn up into all of its component parts? or just the major ones? what criteria should we use for "major"?

I can't see visualizing cc against tld, but I could potentially see utilizing cc as a feature, or as an influencer / bucket for analysis (region code should result in the creators immediate relegation to their own personal purgatory as far as I am concerned :)

  1. I love the "is_internationalized" flag idea, and think theres probably some others we could consider as atomic features (digit ratio, number of special characters, entropy, levenstein distance to similar popular domains, subdomain count, subdomain length, etc.) but, well where do we draw the line with that too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added is_internationalized, removed country_code and added to the tld description ti indicate its the spot to tld/cctld

schemas/client.yml Show resolved Hide resolved
@@ -59,19 +59,35 @@
description: >
Client domain.

- name: name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this field's relationship to client.domain?

Copy link
Contributor Author

@dainperkins dainperkins Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domain is listed under host as being e.g. the windows domain (e.g. netbios name - "CONTOSO"); name is used elsewhere (dns.question.name) as being specific to the domain name...

Up until digging thru this I had expected to lobby for fqdn, but based on the fact that its next to impossible to determine from information gathered about a dns request if a name is partially or fully qualified, I went with the ECS dns fields as the baseline for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webmat - did I answer the question? I'm not 100% sure I understood your original question (as I looked at the other references)

the subdomain field should contain "sub2.sub1", with no trailing period.
example: east

- name: hostname
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this field's relationship to client.subdomain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider it to be the first in a multi-level name in a p/fqdn

Thinking about dns analysis, I might like to be able to analyze the host portion of a name (various features could be useful) as well as the [subdomains] as it stands (host + subdomains)

trying to avoid thinking about features etc., and just thinking about e.g. show me every domain name (requested or accessed) for a particular domain, and bucket by subdomain (assuming first name is a host designation - which even if its a subdomain with an a record, its still effectively a specific host...

schemas/dns.yml Outdated
@@ -159,6 +179,66 @@
be the original `question.name` repeated.
example: www.google.com

- name: answers.registered_domain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I would add all of these fields in the DNS answers array of object.

@MikePaquette Do you think we have a need for that?

@leehinman & @andrewkroh What do you think about implementing this on DNS answers? Sounds to me like adding this here may be problematic, considering the different kinds of DNS answers we can get, or even just for performance reasons: a query for a domain name can yield many CNAMEs as an answer, so worst case this could be multiple public suffix list lookups per DNS answer event. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at a packetbeat dns request there's 1 request for www.microsoft.com, which gets answered with 3 CNAMES and an A record. (reference below). As the A record is effectively a recursive, automatic request - it seems like the name should be parsed out and treated like a straight up dns.question??

{
"data": "www.microsoft.com-c-3.edgekey.net",
"name": "www.microsoft.com",
"type": "CNAME",
"class": "IN",
"ttl": "2726"
},
{
"data": "www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net",
"name": "www.microsoft.com-c-3.edgekey.net",
"type": "CNAME",
"ttl": "21204",
"class": "IN"
},
{
"data": "e13678.dspb.akamaiedge.net",
"name": "www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net",
"type": "CNAME",
"ttl": "900",
"class": "IN"
},
{
"data": "23.33.58.14",
"name": "e13678.dspb.akamaiedge.net",
"type": "A",
"class": "IN",
"ttl": "20"
}

@ebeahan ebeahan added the review label Jul 8, 2020
@ebeahan
Copy link
Member

ebeahan commented Aug 4, 2020

@dainperkins Have you had a chance to take a look at the feedback here? We would like to get this added in for 1.6.

@ebeahan ebeahan added 1.6.0 ready Issues we'd like to address in the future. labels Aug 4, 2020
removed country code
added is_internationalized
added further qualification for subdomain, tld
replicated updates to client, server, source, destination, url, host, and dns.question
removed answer domain breakdowns (for e.g. cnames)
@dainperkins
Copy link
Contributor Author

dainperkins commented Aug 4, 2020

@ebeahan,
hadn't heard again from anyone but just pushed updates including single example domains, pulled country code fields and beefed up examples to show tld/reg'd domain as being the appropriate place (e.g. co.uk, mydomain.co.uk)

Added is_internationalized (boolean) and hopefully fixed the changelog.md

I also pulled the breakdown for answers, I'm not 100% sure thats the right way to go, but its easy enough to add back in.

description: >
The subdomain portion of a fully qualified domain name includes all of the names except
the host name under the registered_domain. In a partially qualified domain, or if the
the qualification level of the full name cannot be determined, subdomain contains all of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of excluding the "hostname" from the subdomain, or assuming it should populate the hostname field.

What happens when the the "hostname" is not the hostname of the machine as with most publicly facing URLs?

For example mail.google.com is unlikely to be a machine whose actual hostname is "mail". It could be more than one machine load balanced where an individual host responded, or more than one subdomain hosted by the same machine. Is this what ..or if the qualification level of the full name cannot be determined is supposed to cover? There seem to be different schools of thought on whether the leftmost part of a domain counts as a hostname, or always a subdomain, or sometimes one or the other depending on the context that can't be determined from just looking at the url.

This is particularly important in fields set groups outside url and dns (like client and server) where you might be recording the machines actual hostname based on network traffic and not the url, but copying from url.domain to server.domain. If you drop mail from the subdomain but have hostname as the real internal IP based one the information is inconsistent.

Copy link
Contributor Author

@dainperkins dainperkins Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one is annoyingly specific to a number of criteria...

FQDN/PQDN - I am actually using PQDN wrong in the docs (mail vs. mail.google.com, or specifically an a record for a subdomain - mail.google.com where there is an A or CNAME record for the subdomain, but also hosts in mail.google.com) I'll fix that first. I'm not 100% sure it matters, but will check and revise if necessary. (Oddly a pqdn - e.g. looking up sharepoint, and having an added domain suffix, will typically show up as an FQDN in the DNS lookup, but may not in the Uri (https://sharepoint should do a lookup on sharepoint.contoso.com for example)

  • Locality (local hostname vs. host portion of a fqdn)
  • Record Type (CName(s) vs A record)
  • Truth (dns "host"/"A Record" that resolves to 1 or multiple hosts (could be lb, could be load balancing, etc)

I was trying to adjust, with minimal breaks, dns info across multiple fields and I think the valid concerns come down to frame of reference.

Maybe as you said there needs to be some delineation between dns related events (url, dns lookup) and device information (literal system hostname for client/server - potentially pulling from source/dest all together)


In the case of fully and partially qualified domain names it is recommended to parse out
the additional fields of top_level_domain, registered_domain, subdomain, and hostname
as they are available.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would whether this is a domain name or a custom name be determined? By whether the name contains a dot?

Perhaps "name" should work like "address" and you copy the best available name from more specific fields, or have a custom user supplied one? It feels somewhat overloaded here, it's a shame domain is used for other purposes in some field set groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one comes down to processing - I'm honestly not sure what the results would look like for a partial domain in the logstash tld filter (will test and see)

host portion of a fully qualified domain name.

For example, the hostname portion of "www.east.mydomain.co.uk " is "www".
example: www
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure www is a good example of a hostname, especially outside of the url field set. See my comments on subdomain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you said, relates back to the subdomain, but I can go back and replace www with something less obviously a virtual name for the sake of the example, I think that makes sense.

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dainperkins for revisiting so quickly! Just a couple of observations, but I think we'll probably need to perform a rebase on your branch to fix some of the predicted merge conflicts.

@@ -16,7 +16,7 @@ Thanks, you're awesome :-) -->

#### Added

* Added `search.*` fields #729
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks the conflicts with the CHANGELOG.next are still outstanding.

example: www
example: east

- name: is_internationalized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of is_internationalized, and I think Andrew's use case from elastic/beats#5809 makes a lot of sense. Do we have a mechanism that will be able to populate this field today or in the near future?

Just wonder if we should perhaps hold it back until we have a more immediate use. Going along with the internationalized domain name use case, it'd be more useful to a user if the punycode-to-unicode translation was captured as well as setting the is_internationalized flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a simple regex to check ^(.+[-_.])??xn--), contains ".xn--" in the registered / tld would also do it

From a machine learning standpoint I think the unicode translation would definitely be interesting (looks like apple, but its not!) but the is_internationalized flag seems easy enough to implement.

@webmat webmat added 1.7.0 and removed 1.6.0 labels Aug 12, 2020
@dainperkins
Copy link
Contributor Author

Per ECS Team discussion - closing this PR and opening 2 new PRs to separate:

  1. DNS specifics (related to dns name queries observed or logged)
  2. Host specifics (related to internal host naming conventions - logical name, system hostname, dns/directory name)

@ebeahan
Copy link
Member

ebeahan commented Nov 17, 2020

@dainperkins Is the plan still to split this into two new PRs as mentioned above? If so, can this PR be closed?

@dainperkins
Copy link
Contributor Author

@ebeahan I think its probably for the best to close this one, and I'll tackle the individual items in a future one.

I do think normalizing the dns domain fields throughout would be a good thing, and probably one of those potentially breaking changes to be considered for 2.0

@dainperkins
Copy link
Contributor Author

  • closing to split into multiple, scope limited PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: ECS ready Issues we'd like to address in the future. review RFC:candidate Team: ECS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize Domain representations
7 participants