Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Introduce connection prefix, move source / destination #51

Closed
wants to merge 1 commit into from

Conversation

ruflin
Copy link
Contributor

@ruflin ruflin commented Jul 17, 2018

There have been recently several discussions around source, destination and connection recently, especially in #9. The conclusion from my side is that source and destination normally belongs to a connection and we actually miss a connection prefix. Also some information from network like forward_ip more belong to a connection then network.

An additional change I made to source and destination is that they both contain now a host prefix. All the fields in source and destination also exist in host. The host prefix can be reused here too. This makes ECS very predictable that every time host.* shows up it will contain the same fields. Also source and destination could contain additional data like the location, see #50 for more details.

The connection fields now look as following:

Field Description Type
connection.destination.host.ip IP address of the destination.
Can be one or multiple IPv4 or IPv6 addresses.
ip
connection.destination.host.name Hostname of the destination. keyword
connection.destination.host.port Port of the destination. long
connection.destination.host.mac MAC address of the destination. keyword
connection.destination.host.domain Destination domain. keyword
connection.destination.host.subdomain Destination subdomain. keyword
connection.source.host.ip IP address of the source.
Can be one or multiple IPv4 or IPv6 addresses.
ip
connection.source.host.name Hostname of the source. keyword
connection.source.host.port Port of the source. long
connection.source.host.mac MAC address of the source. keyword
connection.source.host.domain Source domain. keyword
connection.source.host.subdomain Source subdomain. keyword
connection.direction Direction of the network traffic.
Recommended values are:
* inbound
* outbound
* unknown
keyword
connection.forwarded_ip Host IP address when the source IP address is the proxy. ip

I opened a PR to discuss this instead of an issue as it will allow us to discuss the high level parts as comment but also details directly in the code.

There have been recently several discussions around source, destination and connection recently, especially in elastic#9. The conclusion from my side is that source and destination normally belongs to a connection and we actually miss a connection prefix. Also some information from network like `forward_ip` more belong to a connection then network.

An additional change I made to source and destination is that they both contain now a host prefix. All the fields in source and destination also exist in `host`. The host prefix can be reused here too. This makes ECS very predictable that every time `host.*` shows up it will contain the same fields. Also source and destination could contain additional data like the location, see elastic#50 for more details.

The connection fields now look as following:

| Field  | Description  | Type  |
|---|---|---|---|---|
| <a name="connection.destination.host.ip"></a>`connection.destination.host.ip`  | IP address of the destination.<br/>Can be one or multiple IPv4 or IPv6 addresses.  | ip  |
| <a name="connection.destination.host.name"></a>`connection.destination.host.name`  | Hostname of the destination.  | keyword  |
| <a name="connection.destination.host.port"></a>`connection.destination.host.port`  | Port of the destination.  | long  |
| <a name="connection.destination.host.mac"></a>`connection.destination.host.mac`  | MAC address of the destination.  | keyword  |
| <a name="connection.destination.host.domain"></a>`connection.destination.host.domain`  | Destination domain.  | keyword  |
| <a name="connection.destination.host.subdomain"></a>`connection.destination.host.subdomain`  | Destination subdomain.  | keyword  |
| <a name="connection.source.host.ip"></a>`connection.source.host.ip`  | IP address of the source.<br/>Can be one or multiple IPv4 or IPv6 addresses.  | ip  |
| <a name="connection.source.host.name"></a>`connection.source.host.name`  | Hostname of the source.  | keyword  |
| <a name="connection.source.host.port"></a>`connection.source.host.port`  | Port of the source.  | long  |
| <a name="connection.source.host.mac"></a>`connection.source.host.mac`  | MAC address of the source.  | keyword  |
| <a name="connection.source.host.domain"></a>`connection.source.host.domain`  | Source domain.  | keyword  |
| <a name="connection.source.host.subdomain"></a>`connection.source.host.subdomain`  | Source subdomain.  | keyword  |
| <a name="connection.direction"></a>`connection.direction`  | Direction of the network traffic.<br/>Recommended values are:<br/>  * inbound<br/>  * outbound<br/>  * unknown  | keyword  |
| <a name="connection.forwarded_ip"></a>`connection.forwarded_ip`  | Host IP address when the source IP address is the proxy.  | ip  |

I opened a PR to discuss this instead of an issue as it will allow us to discuss the high level parts as comment but also details directly in the code.
@ruflin
Copy link
Contributor Author

ruflin commented Jul 17, 2018

As discussed in #9 there are also cases where the host from your and destination should end up in one field. For these cases the copy_to feature could be used. Here a small example:

PUT ecs
{
  "mappings": {
    "_doc": {
      "properties": {
        "connection.source.host.name": {
          "type": "keyword",
          "copy_to": "host.name" 
        },
        "connection.destination.host.name": {
          "type": "keyword",
          "copy_to": "host.name" 
        },
        "host.name": {
          "type": "keyword"
        }
      }
    }
  }
}

PUT ecs/_doc/1
{
  "connection.source.host.name": "elastic.co",
  "connection.destination.host.name": "ruflin.com"
}

GET ecs/_search
{
  "query": {
    "match": {
      "host.name": { 
        "query": "ruflin.com"
      }
    }
  }
}

The host.name field can not be used to query for host.name from source and destination.

@ruflin ruflin requested review from webmat and andrewkroh July 25, 2018 08:20
@urso
Copy link

urso commented Jul 25, 2018

As connection.forwarded_ip is for a proxy, I wonder if we want to be able to add more fields to forward? E.g. host, location, datacenter, domain...

@strawgate
Copy link

This is a pretty fundamental change to the schema -- should we be planning that this will be included if we are hoping to conform to this schema? Do we know when this might be approved or merged?

@praseodym
Copy link
Contributor

I quite dislike how this pull request introduces vastly longer field names for mostly superfluous categorisation. In terms of daily usability I much prefer source.ip over connection.source.host.ip.

@ruflin
Copy link
Contributor Author

ruflin commented Jul 26, 2018

@urso Interesting idea. So you are basically saying a proxy is also a host with additional info? Is forward_ip always coming from a proxy?

@strawgate There is no conclusion yet on this topic and the reason it's here for discuss. Outcome not clear yet.

@praseodym Can you share a bit more background on the problem of long field names? If it is the typing, I wonder how much the auto complete in newer Kibana versions solves this issue?

@urso
Copy link

urso commented Jul 26, 2018

Interesting idea. So you are basically saying a proxy is also a host with additional info? Is forward_ip always coming from a proxy?

No idea where forwarded_ip comes from. Use case and such. Descriptions says something proxy, so I wondered why a proxy is not allowed to have a host and other similar settings. Why does source and destination have a rich structure while you only reserve one field for the proxy?

In the schema the source field can be either a proxy or the actual source. Presence of forward_ip changes the meaning of the source ip? Why not source.origin_ip? It's just that something doesn't feel consistent

Checking the proposal again, I wonder if connection and <X>.host carries redundant information. I understand we used to have source.host, which implicitely states that source is a network based endpoint. But by introducing a connection namespace, the schema implies that source/destination are network based endpoints. How about
connection.source.mac, connection.source.ip, connection.source.port, connection.source.hostname, connection.source.fqdn... ?

@spartan782
Copy link

I think that there should be a consideration for tools that do not log Source and Destination. I propose there should be 2 different fields. One for tools that use source and destination, and another for tools that are sessionized. For example, the tool bro uses Originating and Responding because it keeps track and logs each conversation rather than individual packets.

@dcode
Copy link
Contributor

dcode commented Aug 2, 2018

I like the concept for connection oriented tools, but can't we just put it under network instead of connection?

  1. It's shorter.
  2. network is already a prefix
  3. Where this really pays off is searching across network-oriented logs (i.e. bro, netflow, suricata, etc), and process oriented logs that contain information gleaned from the operating system socket listings.
  4. I'm a huge fan of a catch-all hosts field. One thing we do in RockNSM^1 today is add a pivot for @meta.related_ips, which is a list of all IPs that fall into a given event. This makes pivoting across related logs dead simple for analysts. I would propose a top-level prefix of related which enables pivoting across multiple types (e.g. related.ip, related.hostname, related.mac, etc). This would enable a top-level semantic that makes sense for any sort of relation, based upon type. Kibana + Elastic handles the list of keywords/atomic types really well for purposes of filtering/pivoting.

As for @spartan782's comment, the way Bro actually does this today is for the connection log, the origin is marked as the host that started the conversation. At the TCP/IP level this doesn't really make a difference. In the protocol-specific logs, the origin is the host that started that particular protocol conversation. SMTP is one example where this nuance is important. Host A may connect to Host B using TCP on port 25 (keeping it simple). Host A is the TCP/IP originating host. However, upon connection, Host B initiate the SMTP protocol by sending multiple emails for further processing/routing. At the SMTP protocol layer, Host B is actually the originating host. FTP-DATA connections are similar in that the directions of the TCP connections are usually opposite of who initiated the transfer.

I propose that those cases are actually annotated in the protocol-specific prefixes (i.e. smtp.source or smtp.origin <-- this is also semantically valid terminology with respect to the protocol itself.)

^1 Migration to ECS is underway for RockNSM, so this is a timely topic.

@webmat
Copy link
Contributor

webmat commented Aug 3, 2018

Proxy IPs come from reverse proxies. The reason it's an array is that there can be more than one reverse proxy in front of the application logging the event. For example: Cloudflare => NGINX => application. Your list of forwarded IPs would include Cloudflare's edge node, then your NGINX load balancer's. You can have more than two, of course, just add in Varnish and Apache running PHP-fpm as the "application".

I love the idea of building a plain array of all seen IPs for a given event, for ease of pivoting. This would not only help catch situations where a proxy is compromised, and that's the hostile entity, but would also simplify pivoting for situations where we have potentially hostile IPs in "source" as well as in "destination" IPs, like DNS (see this discussion for more context).

@ruflin
Copy link
Contributor Author

ruflin commented Aug 7, 2018

++ on having a place for all ip addresses (and other fields). My idea here so far is that the higher up in ECS, the more generic it is. As an example:

connection.source.host.ip: Contains only the source ip
connection.host.ip: Can contain source, destination ip
host.ip: Can contain all ip's appearing in the event.

@strawgate
Copy link

strawgate commented Aug 8, 2018

host.ip containing all ip addresses from the event seems confusing.

The norm is to have things like src_ip and dst_ip, the current ecs makes that source.ip and destination.ip, this now makes it connection.destination.host.ip and connection.source.host.ip and I'm not entirely sure there is a benefit to this.

Fields which relate to a host are vast (architecture, OS, timezone, etc) whereas fields that relate to a host that is part of a connection that a router, switch, or firewall witnessed are minimal so I'm not exactly sure why prefixing each field with the object type here is useful and stuff like this:

connection.source.host.ip: Contains only the source ip
connection.destination.host.ip: Contains only the destination ip
connection.host.ip: Can contain source, destination ip

Don't really help the confusion.

@webmat
Copy link
Contributor

webmat commented Aug 8, 2018

I'm not sure I see the benefit of adding host. under connection.source and connection.destination. When I see host, I understand it as the server or process that generated the event or the log.

In a connection scenario, the process only knows host details about its own side of the connection, and not the other side. This means the bulk of the host details will actually shift around, between source and destination. Two examples

Application handling inbound requests:

connection.source.host.ip: A remote IP

connection.destination.host.ip: My server's IP
connection.destination.host.name: My server's hostname
connection.destination.host.id: ...
connection.destination.host.timezone: ...

Application calling out to an external system:

connection.source.host.ip: My server's IP
connection.source.host.name: My server's hostname
connection.source.host.id: ...
connection.source.host.timezone: ...

connection.destination.host.ip: A remote IP

This is what I mean by the host details shifting around.

I'm ok with having connection.source and connection.destination, but I think host. belongs outside of there. I like host at the top level, actually. It's a few informations I'm used to having on virtually all of my log events. So keeping host outside of connection, we always have this shape of event, regardless of direction.

connection.source.ip: IP of initiator
connection.destination.ip: IP of queried service
host.name: My hostname
host.id: ...
host.timezone: ...

@MikePaquette
Copy link
Contributor

@ruflin thanks for this PR. Clearly a great topic and a needed discussion, as it's generated a lot of sub-topics!

Here's my $0.02

  • I am not in favor of creating a new top-level namespace/object/prefix for connection.*
  • I am not in favor of re-using the host.* object anywhere except as a top level namespace/object/prefix.
  • I am in favor of using the existing network.* namespace/object/prefix for flow or connection-related fields.

@strawgate #51 (comment) I agree, this would be a big change, and I prefer to work through any shortcoming with the current set of namespaces/objects/prefixes.

@praseodym #51 (comment) I agree that vastly longer names without significant value, will detract from two key ECS benefits, Ease of Recall, and Ease of Deduction, and therefore should be avoided.

@urso #51 (comment) the network.forwarded_ip field definition may need some improvement. The original intent was to populate this field with the IP address(es) of network entity(ies) (e.g., proxies) forwarding network traffic associated with an event, when the source.ip is extracted from a field such as the x-forwarded-for HTTP header. Since the x-forwarded-for header contains both the "client" IP and the list of other proxies that have forwarded it, the network.forwarded_ip field would hold a list of IP addresses of all the proxies that may have forwarded this network traffic.

@spartan782 #51 (comment) The source.* and destination.* namespaces/objects/prefixes are indeed defined in ECS to cover packet-level, session/connection-level, and application-level events, even when those events do not use the names "source" and "destination" to refer to their participants, or don't contain source and destination fields. The only shortcoming with this approach occurs when you need to have multiple-levels in the same documents, as highlighted by Rob Cowart in #9 (comment), which I have a proposal for fixing in the network.* namespace/object/prefix by adding a few fields to be used only in that case. More details soon.

@dcode #51 (comment) +1 to keeping the connection-related fields in the network.* namespace/object/prefix. Also, I am working on a mapping table (sorry not code) of the bro conn.log fields to ECS. Would love to compare this to your mapping. Stay tuned.

@strawgate #51 (comment). Agreed with your point that details (fields) relating to source and destinations in a network event will be fewer than those relating to a host in a host event. This was a key factor in originally choosing host.* , source.* , and destination.* as distinct top-level namespaces/objects/prefixes in ECS.

@webmat #51 (comment) Agreed, thanks.

@ruflin
Copy link
Contributor Author

ruflin commented Aug 15, 2018

@webmat Having host inside source and destination would not remove it from the top level. I see host.* like a struct in Golang that can be reused in many places.

@webmat webmat mentioned this pull request Sep 18, 2018
26 tasks
@webmat
Copy link
Contributor

webmat commented Oct 25, 2018

@ruflin Are you ok if we close this? I think it's clear we're not going to move in this direction after all :-)

@ruflin
Copy link
Contributor Author

ruflin commented Oct 26, 2018

I think it's not something we do for 1.0 of ECS but it's still something I think we should do in the long term to support more complex connection data. Based on the recent changes the initial PR will need updating but my proposal to have a connection object is still standing. I suggest to keep this open but currently put it on hold.

@ruflin
Copy link
Contributor Author

ruflin commented Dec 7, 2018

Now that we are introducing also server / client, let's close this for now. I still like the idea of a connection though ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants