-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize DNS related fields across entities #797
Conversation
Adding fields to standardize domain name representations & features for: dns, client, destination, host, server, source fields.
added PR #in Changelog.md
added country code & hostname to dns. and added country code to c/d/s/s/h
Tip: I added |
added answers features, and sorted name fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dain! Here's a first round of review.
I've made a few comments on the "client" field set, which apply to the equivalent fields in the other field sets as well.
@@ -16,7 +16,7 @@ Thanks, you're awesome :-) --> | |||
|
|||
#### Added | |||
|
|||
* Added `search.*` fields #729 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add back this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for future clarification - when I update from source, I should just add a line to CHANGELOG.next.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just add your line and don't touch the rest :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks the conflicts with the CHANGELOG.next
are still outstanding.
schemas/client.yml
Outdated
- name: country_code | ||
level: extended | ||
type: keyword | ||
short: The country code portion of the domain. | ||
description: > | ||
The country code portion of the domain. | ||
|
||
example: uk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should go all the way to adding country_code
.
I understand the value, but even just parsing out the eTLD and the registered_domain reliably puts a high burden on implementations, IMO.
Asking them to also figure out the country code (only when appropriate, so not when it's a ".com", ".org" or ".beer") opens up other cans of worm: what about the region? Being from Quebec Canada I'm thinking of our beloved ".qc.ca" tld and the like.
@andrewkroh What do you think? Does the public suffix list help fill this country code (aka it's not much more of a burden)? Or does it not help with country code?
If we decide to keep it, we should at least say in the description that it should be filled only when there's a country code :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like parsing issues shouldn't be a consideration in terms of the availability of accurate sub fields. I'll update the description to say it should only be filled in when there is a country code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The country code TLDs are a subset of all TLDs. And we already have a top_level_domain
field so this will be duplicated. If is useful to know that a TLD is a country code then maybe that's a boolean flag like is_country_code
to indicate the top_level_domain
is a country.
Similarly I had thought other attributes of the domain name could be flags. Like whether it contains puny codes with is_internationalized
(elastic/beats#5809 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewkroh that leaves me wondering 2 things (and I'm serious... not just pondering to be annoying, which I may have been known to do on occasion :)
- where do we draw the line for cutting up aggregate things (like a fqdn) into its component parts?
tld is part of a fqdn - and we duplicates it, but cc (and apparently ?region code? thanks for that Quebec!) are also part of a tld... should we be cutting a fqdn up into all of its component parts? or just the major ones? what criteria should we use for "major"?
I can't see visualizing cc against tld, but I could potentially see utilizing cc as a feature, or as an influencer / bucket for analysis (region code should result in the creators immediate relegation to their own personal purgatory as far as I am concerned :)
- I love the "is_internationalized" flag idea, and think theres probably some others we could consider as atomic features (digit ratio, number of special characters, entropy, levenstein distance to similar popular domains, subdomain count, subdomain length, etc.) but, well where do we draw the line with that too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added is_internationalized, removed country_code and added to the tld description ti indicate its the spot to tld/cctld
@@ -59,19 +59,35 @@ | |||
description: > | |||
Client domain. | |||
|
|||
- name: name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this field's relationship to client.domain
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
domain is listed under host as being e.g. the windows domain (e.g. netbios name - "CONTOSO"); name is used elsewhere (dns.question.name) as being specific to the domain name...
Up until digging thru this I had expected to lobby for fqdn, but based on the fact that its next to impossible to determine from information gathered about a dns request if a name is partially or fully qualified, I went with the ECS dns fields as the baseline for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@webmat - did I answer the question? I'm not 100% sure I understood your original question (as I looked at the other references)
schemas/client.yml
Outdated
the subdomain field should contain "sub2.sub1", with no trailing period. | ||
example: east | ||
|
||
- name: hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this field's relationship to client.subdomain
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider it to be the first in a multi-level name in a p/fqdn
Thinking about dns analysis, I might like to be able to analyze the host portion of a name (various features could be useful) as well as the [subdomains] as it stands (host + subdomains)
trying to avoid thinking about features etc., and just thinking about e.g. show me every domain name (requested or accessed) for a particular domain, and bucket by subdomain (assuming first name is a host designation - which even if its a subdomain with an a record, its still effectively a specific host...
schemas/dns.yml
Outdated
@@ -159,6 +179,66 @@ | |||
be the original `question.name` repeated. | |||
example: www.google.com | |||
|
|||
- name: answers.registered_domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I would add all of these fields in the DNS answers array of object.
@MikePaquette Do you think we have a need for that?
@leehinman & @andrewkroh What do you think about implementing this on DNS answers? Sounds to me like adding this here may be problematic, considering the different kinds of DNS answers we can get, or even just for performance reasons: a query for a domain name can yield many CNAMEs as an answer, so worst case this could be multiple public suffix list lookups per DNS answer event. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at a packetbeat dns request there's 1 request for www.microsoft.com, which gets answered with 3 CNAMES and an A record. (reference below). As the A record is effectively a recursive, automatic request - it seems like the name should be parsed out and treated like a straight up dns.question??
{
"data": "www.microsoft.com-c-3.edgekey.net",
"name": "www.microsoft.com",
"type": "CNAME",
"class": "IN",
"ttl": "2726"
},
{
"data": "www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net",
"name": "www.microsoft.com-c-3.edgekey.net",
"type": "CNAME",
"ttl": "21204",
"class": "IN"
},
{
"data": "e13678.dspb.akamaiedge.net",
"name": "www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net",
"type": "CNAME",
"ttl": "900",
"class": "IN"
},
{
"data": "23.33.58.14",
"name": "e13678.dspb.akamaiedge.net",
"type": "A",
"class": "IN",
"ttl": "20"
}
@dainperkins Have you had a chance to take a look at the feedback here? We would like to get this added in for 1.6. |
removed country code added is_internationalized added further qualification for subdomain, tld replicated updates to client, server, source, destination, url, host, and dns.question
removed answer domain breakdowns (for e.g. cnames)
@ebeahan, Added is_internationalized (boolean) and hopefully fixed the changelog.md I also pulled the breakdown for answers, I'm not 100% sure thats the right way to go, but its easy enough to add back in. |
description: > | ||
The subdomain portion of a fully qualified domain name includes all of the names except | ||
the host name under the registered_domain. In a partially qualified domain, or if the | ||
the qualification level of the full name cannot be determined, subdomain contains all of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of excluding the "hostname" from the subdomain, or assuming it should populate the hostname field.
What happens when the the "hostname" is not the hostname of the machine as with most publicly facing URLs?
For example mail.google.com is unlikely to be a machine whose actual hostname is "mail". It could be more than one machine load balanced where an individual host responded, or more than one subdomain hosted by the same machine. Is this what ..or if the qualification level of the full name cannot be determined
is supposed to cover? There seem to be different schools of thought on whether the leftmost part of a domain counts as a hostname, or always a subdomain, or sometimes one or the other depending on the context that can't be determined from just looking at the url.
This is particularly important in fields set groups outside url and dns (like client and server) where you might be recording the machines actual hostname based on network traffic and not the url, but copying from url.domain to server.domain. If you drop mail from the subdomain but have hostname as the real internal IP based one the information is inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this one is annoyingly specific to a number of criteria...
FQDN/PQDN - I am actually using PQDN wrong in the docs (mail vs. mail.google.com, or specifically an a record for a subdomain - mail.google.com where there is an A or CNAME record for the subdomain, but also hosts in mail.google.com) I'll fix that first. I'm not 100% sure it matters, but will check and revise if necessary. (Oddly a pqdn - e.g. looking up sharepoint, and having an added domain suffix, will typically show up as an FQDN in the DNS lookup, but may not in the Uri (https://sharepoint should do a lookup on sharepoint.contoso.com for example)
- Locality (local hostname vs. host portion of a fqdn)
- Record Type (CName(s) vs A record)
- Truth (dns "host"/"A Record" that resolves to 1 or multiple hosts (could be lb, could be load balancing, etc)
I was trying to adjust, with minimal breaks, dns info across multiple fields and I think the valid concerns come down to frame of reference.
Maybe as you said there needs to be some delineation between dns related events (url, dns lookup) and device information (literal system hostname for client/server - potentially pulling from source/dest all together)
|
||
In the case of fully and partially qualified domain names it is recommended to parse out | ||
the additional fields of top_level_domain, registered_domain, subdomain, and hostname | ||
as they are available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would whether this is a domain name or a custom name be determined? By whether the name contains a dot?
Perhaps "name" should work like "address" and you copy the best available name from more specific fields, or have a custom user supplied one? It feels somewhat overloaded here, it's a shame domain is used for other purposes in some field set groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one comes down to processing - I'm honestly not sure what the results would look like for a partial domain in the logstash tld filter (will test and see)
host portion of a fully qualified domain name. | ||
|
||
For example, the hostname portion of "www.east.mydomain.co.uk " is "www". | ||
example: www |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure www is a good example of a hostname, especially outside of the url field set. See my comments on subdomain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like you said, relates back to the subdomain, but I can go back and replace www with something less obviously a virtual name for the sake of the example, I think that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dainperkins for revisiting so quickly! Just a couple of observations, but I think we'll probably need to perform a rebase
on your branch to fix some of the predicted merge conflicts.
@@ -16,7 +16,7 @@ Thanks, you're awesome :-) --> | |||
|
|||
#### Added | |||
|
|||
* Added `search.*` fields #729 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks the conflicts with the CHANGELOG.next
are still outstanding.
example: www | ||
example: east | ||
|
||
- name: is_internationalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of is_internationalized
, and I think Andrew's use case from elastic/beats#5809 makes a lot of sense. Do we have a mechanism that will be able to populate this field today or in the near future?
Just wonder if we should perhaps hold it back until we have a more immediate use. Going along with the internationalized domain name use case, it'd be more useful to a user if the punycode-to-unicode translation was captured as well as setting the is_internationalized
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found a simple regex to check ^(.+[-_.])??xn--), contains ".xn--" in the registered / tld would also do it
From a machine learning standpoint I think the unicode translation would definitely be interesting (looks like apple, but its not!) but the is_internationalized flag seems easy enough to implement.
Per ECS Team discussion - closing this PR and opening 2 new PRs to separate:
|
@dainperkins Is the plan still to split this into two new PRs as mentioned above? If so, can this PR be closed? |
@ebeahan I think its probably for the best to close this one, and I'll tackle the individual items in a future one. I do think normalizing the dns domain fields throughout would be a good thing, and probably one of those potentially breaking changes to be considered for 2.0 |
|
Adding fields to standardize domain name representations & features for:
dns, client, destination, host, server, source fields.
Closes #728