Guidance on anonymization/pseudonymization #68

loekvangool · 2018-08-06T09:40:15Z

I'd like to propose that ECS adds guidance for anonymization and pseudonymization. Some thoughts:

Definitions

anonymization: Irreversible data obfuscation.
pseudonymization: Reversible data obfuscation.

PII model
The NIST 800-122 publication on PII identifies levels of personal identifiable information:

High (4): publication has severe/catastrophic effects
Medium (3): publication has serious adverse effects
Low (2): publication has limited adverse effects
Public (1): not part of PII, but describes non-personal data

Typically if one is allowed to see PII level X, one can also see PII levels < X (the Air Force One uses the same method: walk freely towards the rear, but never walk forward of your own seat). We could also imagine putting pii_<level> as a pre- or postfix in field names to easily manage Field Level Security (because it supports access based on wildcards (*)).

Varying levels of obfuscation
We should also recognize that various versions of the same field can (and should) exist in harmony. Perhaps the Dutch postal code system is a good example:

postalcode: 1234AB

The system is set up so that each character to the right is adding more precision to the location.

Perhaps in Elasticsearch this becomes:

customer.postalcode.raw: 1234AB
customer.postalcode.city: 12
customer.postalcode.obfuscated: E32DB25A9BAAA6AF655FE65A861C9BD35AF1868229E0E9D738236B4500626AFB

Or, implementing PII:

customer.postalcode.pii4: 1234AB <-- perhaps enough to identify the customer
customer.postalcode.pii2: 12 <-- not enough to identify the customer
customer.postalcode.pii1: E32DB25A9BAAA6AF655FE65A861C9BD35AF1868229E0E9D738236B4500626AFB <-- not enough to identify the customer, but based on PII 4 data hence we can bucket customers of the same street without knowing which street it is.

The above would allow various users to access the postal code at an appropriate level for their usage (in case Business Analytics, for example, uses non-PII 3 or 4 data only due to laws on personal data like GDPR).

The text was updated successfully, but these errors were encountered:

ruflin · 2018-08-07T08:01:15Z

Thanks for filing this issue with all the details. I assume by default the "original" field would always contain the original object?

If we would have a field a and a field b and want to use field security, the fields would have to be called pii1_a and pii1_b so we could give someone access to all pii1_* fields? Would be nice if we could have a_pii1 instead so assuming has access to pii1 and pii2 he could start typing a and would see an auto complete for all the fields available to him.

loekvangool · 2018-08-07T19:53:15Z

I assume by default the "original" field would always contain the original object?

We have to allow for that, while also allowing the user to remove the original data. Depends on the business needs and applicable compliance rules.

If we would have a field a and a field b and want to use field security, the fields would have to be called pii1_a and pii1_b so we could give someone access to all pii1_* fields? Would be nice if we could have a_pii1 instead so assuming has access to pii1 and pii2 he could start typing a and would see an auto complete for all the fields available to him.

It can be a prefix or a postfix, does not make a difference w.r.t. how much effort it takes to create the Field Level Security rules. I also like postfix more because of the reasons you mentioned.

ruflin · 2018-08-08T08:45:52Z

Is there an easy way to do postfix field level security? Only found prefix so far: https://www.elastic.co/guide/en/elastic-stack-overview/current/field-level-security.html

loekvangool · 2018-08-08T09:11:48Z

I tested it real quick, seems to work just fine, luckily :).

PUT flstest

PUT flstest/doc/1
{
  "name.pii1": "secret",
  "name.pii2": "fd9adsfSADF"
}

GET flstest/doc/1

POST /_xpack/security/role/flsrole
{
  "indices": [
    {
      "names": [
        "flstest"
      ],
      "privileges": [
        "read"
      ],
      "field_security": {
        "grant": [
          "*.pii2"
        ]
      }
    }
  ]
}

POST /_xpack/security/user/flsuser
{
  "password": "s3cr3t",
  "roles": [
    "flsrole",
    "kibana_user"
  ]
}

POST /_xpack/security/role_mapping/flsmapping
{
  "roles": [
    "flsrole"
  ],
  "enabled": true,
  "rules": {
    "field": {
      "username": "flsuser"
    }
  },
  "metadata": {
    "version": 1
  }
}

As flsuser:

GET flstest/doc/1

Response:

{
  "_index": "flstest",
  "_type": "doc",
  "_id": "1",
  "_version": 3,
  "found": true,
  "_source": {
    "name.pii2": "fd9adsfSADF"
  }
}

DELETE /_xpack/security/role_mapping/flsmapping
DELETE /_xpack/security/role/flsrole
DELETE /_xpack/security/user/flsuser
DELETE flstest

ruflin · 2018-08-09T09:10:16Z

Thanks for testing. Good to hear prefixes also work. I assume it doesn't make a difference if it is a.pii2 or a_pii2?

loekvangool · 2018-08-09T09:26:57Z

Getting the same results on _pii2, yes.

ruflin · 2018-08-09T12:44:06Z

Nice. So now we should tackle the bigger question: How do we fit this into ECS? So far ECS purely focuses on the field definitions. I definitively see like how this makes sense "on top of ECS" and should be part of it. But where do we put it and how do we describe it?

@MikePaquette Ideas?

MikePaquette · 2018-09-12T22:19:39Z

@ruflin @loekvangool Sorry this has lagged so long without comment.

I wonder if this could be an "ECS Extension for Personal Data" that we can define, rather the defining the actual fields? I definitely like the postfix approach better than prefix for the reasons you've agreed upon.

Fields with the *.pii1 postfix could fit under our definition of ECS Extended fields, although we would not define them all. Having multiple levels is a great idea, as it can satisfy NIST and also GDPR, depending on the jurisdiction of the instantiation.

If this is acceptable, I could fold this into a big 'ol PR on the README.md

loekvangool · 2018-09-13T04:12:55Z

@MikePaquette sounds good. Reading my own post again, I think there is one additional item:

We should consider to add guidance on naming conventions for anonymized fields. Anonymized data cannot be reversed, and should not have a constant salt, hence referential integrity is lost, and probably index should be disabled on them.

MikePaquette · 2018-09-13T11:31:54Z

@loekvangool I think we agree that ECS should not attempt to specify what constitutes anonymization, but rather it should define a convention for storing pseudonymized and/or anonymized data in fields with a defined set of postfixes and a defined order of increasing anonymity.

So are you saying that the fourth level (assuming we define 4 of them) would be defined as anonymized? And thus the index mapping template should not bother to index fields with this fourth postfix?

loekvangool · 2018-09-13T12:38:41Z

@MikePaquette I agree we should not take a stand in that discussion, such a complex area with opinions on definitions etc.

We also can't say that PII4 is anonymized. It might as well be non-personal data altogether.

No I guess I'm proposing to add a line like "if you anonymize data, and there is no consistent hashing used, append .anonymized to the field name before .piiX We could cover index: false with a dynamic mapping template.

It doesn't take long to see that this line of thinking could lead to multiple flags being added to the field names leading to the usage of wildcards on both sides in the dynamic mapping (such as *.pii1*) which is not great and I have no clever way to fix this right now. The only thing I can think of to prevent this is to add the attributes to the field definition of the mapping:

"source.ip.pii2.anonymized": {
  "type": "ip",
  "index": false
}

vs:

"source.ip": {
  "type": "ip",
  "pii": "2",
  "anonymized": true
}

MikePaquette · 2018-09-13T14:02:17Z

Thanks @loekvangool I see. I am wondering if this (special treatment for anonymized fields) might be an optimization that is too complex, or beyond the scope of ECS. I will attempt to document this concept for the README.md, and would welcome your feedback.

loekvangool · 2018-09-13T14:18:47Z

Sounds good

ruflin · 2018-11-05T11:19:55Z

A similar topic was raised in a discussion around GDPR. I think one thing that ECS can do here is coming of with a convention of how to name such fields.

My current suggestion would be a _postfix to be used. This has several advantages:

Original source field does not need to change
It works on all fields
It does not conflict with the object syntax
Works with field level security (see test from Loek)
Makes it possible to apply dynamic types based on a pattern.

Taking source.ip as an example, this would become:

source.ip_pii2
source.ip_yourpostfix

The fields with the postfix I would expect normally to be a keyword and not necessarly the same as the original field. For example an ip address that is hashed is not an ip anymore.

webmat · 2018-11-05T15:00:26Z

Just a clarification on Elasticsearch mappings (I can't dive deep into this anonymization question, on the eve of the Beta1 release of ECS :-) ).

A given field that's already defined in the mapping (such as source.ip) cannot have other sub-fields defined under it. Multi-fields looks like it's doing that (e.g. myfield vs myfield.keyword), but a multi-field is actually the same information being indexed differently by Elasticsearch, as different data types. The type of information we're talking about here (the various types of anonymization) would not be done by Elasticsearch itself, it would have to be done in advance of sending to Elasticsearch.

In other words, they will need to be in distinct fields (not nested under another one), like those @ruflin is suggesting. Concretely: cannot be source.ip.pii1, it would have to be source.ip_pii1.

loekvangool · 2018-11-26T12:00:46Z

Agreed, otherwise it will not be possible. Probably good to include a warning in the guidance to make sure people don't make this error.

loekvangool · 2018-11-26T12:18:14Z

Maybe good to note that there are (at least) two topics:

Introducing a postfix for hashed data (be it reversible or not), and that likely can be assumed to be a keyword field.
The PII naming system really which is agnostic about obfuscation and is intended to help build easy and fail-proof authorization rules.

The way PII is defined (again looking at https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-122.pdf), it seems that a hashed version of a field should lead to a lower PII score because it's less harmful if publicized. But that determination is subjective and in the hands of the CISO. I conclude source.ip_pii2 can still have an obfuscated version source.ip_pii2_obfuscated for users that don't think that it's PII3 and thus we could have 2 postfixes. It seems useful since we can still create dynamic mapping rules and authorization rules for *_pii2*[1] as well as *_obfuscated. Thoughts?

[1] Note to self: Maybe check the performance on double wildcards...

ruflin · 2018-11-27T07:15:20Z

It's definitively nice if all the above is possible as it gives the creator of the data a lot of flexibility. As you mentioned we should definitively test double wildcards. I wonder how much we need to define on the ECS side and how much we leave to the user. To get start just defining that _* should be used would probably be enough.

loekvangool · 2019-03-12T16:09:10Z

Since changing the name of fields is inherently destructive to ECS's advantages (which is, in essence, the predictability of field names), I would still opt for the introduction of field metadata to implement PII:

"properties": {
  "source.ip": {
    "type": "ip",
    "pii": 2,
    "scrambled": true
  }
}

Or:

"properties": {
  "source.ip": {
    "type": "ip",
    "metadata": {
      "pii": 2,
      "scrambled": true
    }
}

The second example is on-par with the way we do User Metadata today:

POST /_xpack/security/user/jacknich
{
  "password" : "j@rV1s",
  "roles" : [ "admin", "other_role1" ],
  "full_name" : "Jack Nicholson",
  "email" : "[email protected]",
  "metadata" : {
    "intelligence" : 7
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-put-user.html

ruflin · 2019-03-13T08:03:52Z

I might miss something here. Is metadata a feature in Elasticsearch or is just an object?

loekvangool · 2019-03-13T22:14:11Z

Good point @ruflin those examples were a bit ambiguous, I changed them. I do mean adding field metadata to Mappings, which then should be usable in Security:

POST /_xpack/security/role/filebeat_pii2
{
  "indices": [
    {
      "names": [ "filebeat-*" ],
      "privileges": ["all"],
      "field_security" : {
        "grant" : [ "metadata": { "pii" => "2" } ] // grant access to all PII level 2 fields
      }
    }
  ]
}

ruflin · 2019-03-15T07:59:14Z

Were you thinking of metadata per field, something like this? elastic/elasticsearch#33267

loekvangool · 2019-03-15T09:13:27Z

Yes that's it.

jiriatipteldotorg · 2019-04-10T05:54:28Z

wouldn't the anonymization/encryption be useful not just outside ECS but even outside Elastic-Search?
The scenario I'm having on mind is SaaS where users need to know the data in database can't be seen by anyone else. They can already supply encrypted data to E-S, have it stored in encrypted form, retrieve it, and decrypt it in browser.

loekvangool · 2019-04-10T07:52:55Z

So if I get you you're basically talking about storing obfuscated data in Elasticsearch and have it reversible to the original data in an external tool? Can you elaborate on what's stopping you from doing that in Elasticsearch today?

jiriatipteldotorg · 2019-04-10T07:57:49Z

yes. I don't think there is anything preventing me from doing that in ES -- the question was whether folks may be interested in having that as the answer to the anonymization problem stated above. If it would take some work in the ELK, then only in the K-part.

loekvangool added the discuss label Aug 6, 2018

webmat mentioned this issue Sep 18, 2018

Getting ECS to 1.0 #115

Closed

26 tasks

ruflin mentioned this issue Oct 31, 2018

Post GA tasks #160

Closed

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance on anonymization/pseudonymization #68

Guidance on anonymization/pseudonymization #68

loekvangool commented Aug 6, 2018 •

edited

Loading

ruflin commented Aug 7, 2018

loekvangool commented Aug 7, 2018 •

edited

Loading

ruflin commented Aug 8, 2018

loekvangool commented Aug 8, 2018 •

edited

Loading

ruflin commented Aug 9, 2018

loekvangool commented Aug 9, 2018

ruflin commented Aug 9, 2018

MikePaquette commented Sep 12, 2018

loekvangool commented Sep 13, 2018

MikePaquette commented Sep 13, 2018 •

edited

Loading

loekvangool commented Sep 13, 2018 •

edited

Loading

MikePaquette commented Sep 13, 2018

loekvangool commented Sep 13, 2018 via email •

edited

Loading

ruflin commented Nov 5, 2018

webmat commented Nov 5, 2018 •

edited

Loading

loekvangool commented Nov 26, 2018

loekvangool commented Nov 26, 2018

ruflin commented Nov 27, 2018

loekvangool commented Mar 12, 2019 •

edited

Loading

ruflin commented Mar 13, 2019

loekvangool commented Mar 13, 2019 •

edited

Loading

ruflin commented Mar 15, 2019

loekvangool commented Mar 15, 2019

jiriatipteldotorg commented Apr 10, 2019

loekvangool commented Apr 10, 2019

jiriatipteldotorg commented Apr 10, 2019

Guidance on anonymization/pseudonymization #68

Guidance on anonymization/pseudonymization #68

Comments

loekvangool commented Aug 6, 2018 • edited Loading

ruflin commented Aug 7, 2018

loekvangool commented Aug 7, 2018 • edited Loading

ruflin commented Aug 8, 2018

loekvangool commented Aug 8, 2018 • edited Loading

ruflin commented Aug 9, 2018

loekvangool commented Aug 9, 2018

ruflin commented Aug 9, 2018

MikePaquette commented Sep 12, 2018

loekvangool commented Sep 13, 2018

MikePaquette commented Sep 13, 2018 • edited Loading

loekvangool commented Sep 13, 2018 • edited Loading

MikePaquette commented Sep 13, 2018

loekvangool commented Sep 13, 2018 via email • edited Loading

ruflin commented Nov 5, 2018

webmat commented Nov 5, 2018 • edited Loading

loekvangool commented Nov 26, 2018

loekvangool commented Nov 26, 2018

ruflin commented Nov 27, 2018

loekvangool commented Mar 12, 2019 • edited Loading

ruflin commented Mar 13, 2019

loekvangool commented Mar 13, 2019 • edited Loading

ruflin commented Mar 15, 2019

loekvangool commented Mar 15, 2019

jiriatipteldotorg commented Apr 10, 2019

loekvangool commented Apr 10, 2019

jiriatipteldotorg commented Apr 10, 2019

loekvangool commented Aug 6, 2018 •

edited

Loading

loekvangool commented Aug 7, 2018 •

edited

Loading

loekvangool commented Aug 8, 2018 •

edited

Loading

MikePaquette commented Sep 13, 2018 •

edited

Loading

loekvangool commented Sep 13, 2018 •

edited

Loading

loekvangool commented Sep 13, 2018 via email •

edited

Loading

webmat commented Nov 5, 2018 •

edited

Loading

loekvangool commented Mar 12, 2019 •

edited

Loading

loekvangool commented Mar 13, 2019 •

edited

Loading