Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

Delete media nobody has access to any more #166

Open
lampholder opened this issue May 11, 2018 · 23 comments
Open

Delete media nobody has access to any more #166

lampholder opened this issue May 11, 2018 · 23 comments

Comments

@lampholder
Copy link
Member

No description provided.

@turt2live
Copy link
Member

For the implementation it would be nice if this went through the media provider structure that already exists.

In an ideal world the feature could live independently of synapse, although I question if that's possible.

Also, encrypted media will appear as "unreferenced" and we'll have to be careful not to delete it.

@Half-Shot
Copy link
Member

Half-Shot commented May 11, 2018

You'd need to start tagging media with roomids surely to be absolutely certain they are not referenced anywhere else? Remember media can also be inline too so they might not appear in m.image/m.video/m.audio.

Also the above suggestion would weaken encryption somewhat by adding yet more metadata, so you'd need to be very careful with this.

@richvdh
Copy link
Member

richvdh commented May 14, 2018

How do we know when nobody has access to it any more?

@richvdh
Copy link
Member

richvdh commented May 14, 2018

(I guess that's what @turt2live and @Half-Shot are saying. but generally: I don't really know what this means)

@lampholder
Copy link
Member Author

lampholder commented Jun 4, 2018

It might be that we need to tag media items with event ids such that, when we process the erasure of the event id we can simply also erase the corresponding media at the same time. If this is the case, we can probably populate the missing event ids for existing media in the repo by crawling through all the accessible message history.

Encrypted events will not be available to crawl in this way, but that might be fine because:

  1. I think we can assert that any media item in the media repo that cannot be tied to an event must have been created in association with an encrypted event, and therefore is itself encrypted
  2. Historical encrypted media need not necessarily be managed in the same was as unencrypted media, because the users' management of encryption keys essentially gives them the control they need

We probably also need to think about whether we need special provision for media repos forwarding a GDPR Art. 17 request somehow. My gut feeling is that this should not be handled by the media repo - the media repo should just delete media it is asked to by the homeserver when the homeserver wants to delete a media event.

Finally, this has some implications for recycling mxcs (inasmuch as that would have to be discouraged if media weren't just to disappear by surprise). I think there's a very strong case for discouraging the recyling of mxcs, though.

@Half-Shot
Copy link
Member

Half-Shot commented Jun 4, 2018

I think we can assert that any media item in the media repo that cannot be tied to an event must have been created in association with an encrypted event, and therefore is itself encrypted

There are a few reasons why this might not be the case:

  • Misc things for bridges (e.g. bridging custom emojis from Discord->Matrix involves us uploading some media items for inline images).
  • Uploading things simply to store them on a server, e.g. I had a script that stored screenshots on my Matrix HS.

Though the former, you could probably grep events (at great expense of performance) for any mentions in the body. The latter arguably isn't really a good use of Matrix, but it is still valid.

I guess what I'm saying is unless you can absolutely prove the media isn't being used by anything then it's best not to delete it.

An alternative might just be retention polices based on the last access date? If you know that:

  • The media isn't owned by any room, building on the tagging idea.
  • The media hasn't been accessed in X days.
  • There isn't some kind of special exemption (e.g. avatars)
    Then it should be safe to remove it.

I think there's a very strong case for discouraging the recyling of mxcs

I'd be shocked if there was ever a reason to recycle mxcs.

@turt2live
Copy link
Member

I think we can assert that any media item in the media repo that cannot be tied to an event must have been created in association with an encrypted event, and therefore is itself encrypted

User avatars can't really fit into this: it'll be a 1:many relationship because whatever crawler would pick up on more than 1 event referencing the media. If the avatar gets redacted in a room it shouldn't delete it.

There's also the case of people using the media repo directly for whatever reason, and not referencing it in matrix. For instance, the IRC bridge auto-pastebins long messages via the media repo.


Federated media is a bit harder: who has a copy of the media? It may not be possible to guarantee that all servers in the room have the media, and it may have been cached by parties not in the room as well. It may be acceptable to just forward a bulk delete request (because individual requests would be bad) to other servers, specing that they are required to honour it with the known caveat that we can't force another server to delete something.

See also:

On the not-GDPR front: deleting media that has been redacted is another hard problem to solve due to forwarded events, the person redacting may not belong to the origin server, etc.

@Half-Shot
Copy link
Member

@turt2live I understood this issue to be about locally removing media rather than nuking it for everyone. From a space saving perspective? Even from a GDPR pov do we care what other's store?

@lampholder
Copy link
Member Author

I guess what I'm saying is unless you can absolutely prove the media isn't being used by anything then it's best not to delete it.

This might be a philosophical question. Is the matrix media repo a place to store media, or a place to store media in support of events in matrix rooms? Also there are propbably different versions of 'best' here - under GDPR it might be that, if we couldn't answer when asked why we have something, perhaps we shouldn't have it :\

User avatars can't really fit into this: it'll be a 1:many relationship because whatever crawler would pick up on more than 1 event referencing the media. If the avatar gets redacted in a room it shouldn't delete it.

Gah, I'd forgotten about avatars - would we be able to do some smart inspection of the event types to fitler those though? And I was kinda thinking we'd have to handle the 1:many relationship anyway (to handle event forwarding/other random mxc recycling) - would it be tractable to associate the mxc with the 'first' event referencing it?

On the not-GDPR front: deleting media that has been redacted is another hard problem to solve due to forwarded events, the person redacting may not belong to the origin server, etc.

I'm thinking we'd want to reconsider the 'forwarded event' idea (although if all media is created with a reference to an associated event id going forward this wouldn't necessarily be a problem since you could identify when an event was not the media-creating event).

Also I forgot to capture the complexity that associating all new media with an event id might be convaluted since you'd need to upload the media to get the mxc to put into the event to get the id to put into the media repo...

@turt2live
Copy link
Member

@turt2live I understood this issue to be about locally removing media rather than nuking it for everyone. From a space saving perspective? Even from a GDPR pov do we care what other's store?

@Half-Shot I'd imagine as part of GDPR best effort should be applied to try and erase the user's existence.

The complication of proving ownership of media could just be "server name matches" with a signed request (see also: https://github.com/matrix-org/matrix-doc/issues/701#issuecomment-394121896)

would it be tractable to associate the mxc with the 'first' event referencing it?

Depends entirely on how you'd want to consider it. If someone forwards an image someone else sent - who owns that image? Both people can probably be considered the "owner" of the media and therefore linked to it, despite the original uploader being the only one associated. If the second person wanted to be forgotten, should that image be deleted?

There's also the special case of stickers (and probably a ton of other stuff): surely if someone deletes their account then we shouldn't go around deleting stickers (because mxc reuse).

Is the matrix media repo a place to store media, or a place to store media in support of events in matrix rooms?

"Yes" is kinda the answer, unfortunately. Because the repo is so generic some people use it as a CDN while others (probably most) use it as intended: for matrix. There's a couple people out there that (for some reason) host their entire website off the media repo. Even I'm personally going the direction of using the media repo to give bots avatars on my website (instead of having the media duplicated everywhere).


One possible solution (that falls apart quickly with the right to erasure) for the redacting media side is to just let the client deal with it. The client would send a DELETE to the media repo alongside the redact to the homeserver. This has the concern of proving ownership (or authority) to delete the media, but it does mean that the user has the option to hard-delete encrypted media (at least from their server).

@turt2live
Copy link
Member

fwiw, the advice I got from unpaid lawyer irl friends was it might be best to try and associate media with users rather than events. The redacting problem can probably be pushed further down the line, despite my attempts to solve it alongside gdpr.

(the legal advice is what drove this btw: t2bot/matrix-media-repo#96)

@Half-Shot
Copy link
Member

Actually, I'm kinda surprised the local homeserver doesn't log who uploaded it given we require the access_token anyway.

@richvdh
Copy link
Member

richvdh commented Jun 5, 2018

We do track the user that uploaded a given bit of media. The problem we have is that we've decided not to automatically delete all content when a user asks to be erased (cf https://matrix.org/blog/2018/05/08/gdpr-compliance-in-matrix/) so that information doesn't help us.

@richvdh
Copy link
Member

richvdh commented Jun 5, 2018

[though I wonder if this is a rather dangerous situation - it works for messages because we will restrict access to users who were in the room, but there is nothing stopping the url for a bit of media being available somewhere outside of an event, and public for everyone to see, despite the uploader having been asked to be erased]

@lampholder
Copy link
Member Author

For messages, the homeserver replicates the 'email experience', so users can always see the messages that were sent to them even after the sender executes their right to erasure.

For simplicity of reasoning, it would be great if the media repo could replicate the same experience. But today it can't, 'cause it has no concept of ACLs or visibility - if you have the URL, you have the data blob.

All we know from the media repo is that a given piece of content was uploaded by a given user. I don't think the media repo exposes this information at all (if I'm reading the docs right).

As described in my comment the other day, if we associate a media item with the event id of the event that posted it, this supports our deleting the media at the point at which we're erasing the event from our database (the point at which no active matrix user can claim visibilty of that message). By itself, this does not address @richvdh's point - we'd still be serving the media to anyone with the URL, which could (easily) have been shared out of band.

We could more radically overhaul the media repo to both make it aware of event ids and make it validate the requester's right to see that media by piping the event id back through synapse. This isn't going to happen quickly, though, without a lot of collateral damage. At the minimum we'd need to:

  • figure out how we handle avatars/stickers/any other stuff I've forgotten
  • do spec proposal + non-trivial time to mature
  • advertise the change + identify a migration strategy for all the (ab)use cases that the current behaviour supports

If we don't do the above, then we have a range of options between, I think, two extremes. When Alice GDPR17s herself:

  1. we delete all media uploaded by her and pass on a request to all federated homeservers to please do the same
  2. we draw the parallel that having the unique URL is the same as having the content, so in posting a media event with a reference to the mxc, Alice has transferred that media from herself to anyone who recieves it (via whatever means). We delete nothing.

@richvdh
Copy link
Member

richvdh commented Jun 5, 2018

we draw the parallel that having the unique URL is the same as having the content, so in posting a media event with a reference to the mxc, Alice has transferred that media from herself to anyone who recieves it (via whatever means). We delete nothing.

which might be fine if the average user had the first clue what an event with an mxc was. We go to a lot of effort to make it easy just to send a cat gif to a room - Alice has no reason to realise that there is a whole separate media repo, and can expect that, having realised her error, become a dog person, and requested erasure, that we won't continue to serve incriminating cat evidence.

@lampholder
Copy link
Member Author

If someone forwards an image someone else sent - who owns that image? Both people can probably be considered the "owner" of the media and therefore linked to it, despite the original uploader being the only one associated. If the second person wanted to be forgotten, should that image be deleted?

I think this is really just a case of making a decision (by which I mean I think there's a tractable techincal impl regardless of which conclusion we draw as to the philosophical or legal ownership of the media content).

Personally, I'd like to keep it simple and say "you didn't forward the media, you forwarded a reference to that media, and the media repo's contract (to leave the media a given mxc refers to intact) is only with the uploader, nobody else".

If we're explict about this, and as builders of Riot.im give users features that are not likely to fall foul of this, then I think we can have much simpler lives by dissuading the recycling of mxcs generally.

@richvdh
Copy link
Member

richvdh commented Jun 5, 2018

I don't think the media repo exposes this information at all (if I'm reading the docs right).

correct, afaik.

@lampholder
Copy link
Member Author

which might be fine if the average user had the first clue what an event with an mxc was. We go to a lot of effort to make it easy just to send a cat gif to a room - Alice has no reason to realise that there is a whole separate media repo, and can expect that, having realised her error, become a dog person, and requested erasure, that we won't continue to serve incriminating cat evidence.

Certainly whatever we decide, we can do some good work to clarify the situation by adding some additional UX to the media upload (of course, that only really helps with Riot.im, unless we do something really weird to the media upload API).

@lampholder
Copy link
Member Author

We can of course treat the two services as entirely distinct, with distinct erasure policies.

With media items' being associated with the user id, we could give the user a 'media control panel' they can use to see all the media they've put in a given repo, to then erase it (or submit an erase request that we honour after n whenevers). They could then choose to erase all media on account deactivation (with the similar warnings about what that will do to other users' experience of the service).

Another idea - we could enhance the media repo so that media has an expiry date, which is advertised in the UX when the expiry date is close, and which "media owners" can choose to reduce to 30 days from today if they like.

@turt2live
Copy link
Member

fwiw, the concept of "who can see this media" starts to overlap with https://github.com/matrix-org/synapse/issues/2150

@turt2live
Copy link
Member

I'm not personally a fan of expiration dates on media for various reasons. Primarily, it makes backlogging hard (as some people will set insane expiration times), and searching the room's history can become a useless effort. Obviously if the user decides to erase all their media then these points still apply, however that's more of an acceptable risk to me than having a 30, 60, or whatever day expiration.

Linking events to media has the further concern of leaking metadata in encrypted rooms.

I somewhat suspect whatever gets chosen as a way to identify who can access media (https://github.com/matrix-org/synapse/issues/2150) will end up also being able to identify who owns the media. From there it's just a matter of when someone asks to be erased, the media repo runs DELETE FROM media WHERE owner = '@travis:t2l.io'

@ara4n
Copy link
Member

ara4n commented Jun 7, 2018

I've written up a proposal for this at https://github.com/matrix-org/matrix-doc/issues/701

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants