-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WH] Avoid adding UserPersonalData to the event bus #66
Comments
Leaving this ungroomed until we get more feedback on the possibility of dropping the personal data from the events. |
The event bus team is looking for further input on how to proceed with signals such as enrollments which contain the user's sensitive data. In our current version of the event bus, we pass in signals which can sometimes contain user information. Most signals have this information in a UserData field, which in turn contains a For the time being we can just send an error message when handling signals with UserPersonalData since this is sensitive data that should not be serialized, but we are hoping to generate a discussion on how to proceed with this field for the future, since we intend to remove all personal data that is passed. A list of some possible solutions is included below: Drop data before the serialize/deserialize process and fill it in with empty data, this fix will essentially “wipe” all traces of the user’s pii when the signals are passed to the event bus silently. Error when we are expecting the UserPersonalData field, this solution will basically make it so the serializer and deserializer do not work with the userpersonaldata field by throwing an error, and as such, mark these signals as not for use with the event bus if it contains user pii. This is a temporary solution and as more signals are added in, we would have to continuously update which signals contain UserPersonalSata and mark them. Make UserPersonalData an optional field across all current signals, this solution will allow the signals to continue being serialized when userpersonaldata is not passed in, although it may be tricky to determine serialization for signals that have this field optional yet still contain the user’s personal data. For future signals we would need a way to make it so UserPersonalData is optional or document it for cases where it is not optional |
Is the idea that PII should not go over the event bus at all, or that we want to be very explicit when your service needs PII and opt-out by default? |
If the desire is to be explicit, we could define two different event signals for the same logical thing (one with PII and one without), give them both proper names with some convention to indicate PII, and emit both. No deserialization surprises, and the number of events with PII shouldn't be that high (right?). |
I do believe that the intention is to not have PII go to the event bus at all, PII is something that I don't believe any signal explicitly "needs" to perform their specific function. Although this solution of defining two signals, an optional PII and a required PII, doesn't seem to be a bad idea in terms of making sure there aren't any deserialization surprises. |
I like the explicitness of PII messages and non-PII messages, then people can choose to hook up the ones they want and there's no surprises about data they thought was there but was actually missing because your particular flavor doesn't consume PII. If the object doesn't have PII in the first place, and that's clear, then you don't have to solve for "how will people deal with the PII being missing some of the time when they try to access it?" This raises another question for me though. Does this imply that we will need configuration to choose which events turn into messages on the bus and which don't? |
This might be straightforward if it's a binary config setting as to whether PII messages get sent over the bus or not. I hope we can avoid anything more granular than that. I do want to understand the reasoning behind this a bit better though. Is there a particular regulatory compliance issue with sending PII in this manner that this is meant to address? A specific voiced concern from the security working group or some other body? Or is this because we just believe that it's always prudent to minimize exposure on this front...? |
From @feanil:
Yes. I think events need to be explicitly configured (potentially using a decorator?) to be turned into a message on the bus. Ideally, this information would make it out to the documentation for the events. |
Given that, I still have the understanding that user retirement continues to be a requirement and that permanent storage of immutable PII data would not be ok. |
I don't think infinite data retention should be one of our goals. It would be a goal if we were using this as the record of truth and deriving all state from it, but the event bus as envisioned is more of a notification and update propagation mechanism, with canonical state still stored in Django models. Kafka's design will keep messages around for quite some time, but I think the services should treat that state as ephemeral in principle, and only lean on the history for things like recovering from service downtime. If a service needs to have the full history of all the events of a certain type, it can copy it to some other place.
That's a possibility, but I think that no matter what, a service consumes PII data must be required to process retirement events as well, so that it can dispose of any PII it may have copied over.
I'm actually kind of surprised by the stance that only the LMS would hold this data. Don't we already have copying of basic user information (like names) to various services so they can do things like generate report files with names in them or display paginated lists of users without making a crazy number of calls to the LMS?
Fair point. The current event is I think that it's actually fine to have
I think that it's actually okay for other services to emit the same data structure, as long as it's in a different event. For instance, say there's an event when someone posts on the forum. I think it's reasonable for the (There's probably another version of this where we lean into edx-ace and keep the periphery more ignorant of user details, but as far as I know, nobody's really pushing that work forward at the moment.)
I think that as long as we keep the identity service oriented events in a separate package (a peer to |
@ormsbee: Thanks again for your thoughts.
Agreed. I was just trying to understand why this was even an issue in my mind. I had a partial-decision in my head from past discussions, with no ADR to explain how we got there, and with the partial-decision no longer making sense to me. So, we'll be looking to see if short data retention is a secure enough solution.
Yes, we do this now. However, there was a push to minimize/stop this in order to minimize the spread of PII and make the overall plaform more secure. There is always a trade-off between convenience and security. The convenient thing to do here is just to copy the PII across and let people copy only what they must into the service. This provides some additional PII exposure inside Kafka, but not for all users across all time. I can side with convenience, but would like a second opinion from those who lean towards security. For example, I think one team was asked to jump through a lot of hoops to avoid copying this data, for better or for worse. I'd need to dig in to this to provide more details.
I can accept this. I do think it will take proper education for people to understand that this potentially-stale data should be used with caution. |
If anyone is interested in reviewing the ADR summarizing this discussion, I've linked it in the pull request here: #73, we'll close this issue once the PR lands. |
Some of the events specified in openedx-events contain PII. One clear example is the UserPersonalData class.
AC:
The text was updated successfully, but these errors were encountered: