-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Only connect to writers #36
Comments
Forgive me if I am wrong, but would this not seriously hinder any attempt at establishing a supporting network of peers that help host data? Given that multi writing is reliant on one hypercore feed per writer and hyperdb's internal implementation details contains many iterations based on the number of writer feeds (e.g. getting the head), than to increase the number of secure sources would mean eventually hitting performance bottlenecks. I am not too aware of the issues are around privacy though - is it primarily ip sniffing? or is it also related to bad actors poisoning the network? |
The burden would be on the owner(s) to ensure availability. If someone wants to help, the owner adds them as a writer, and then the new writer seeds.
The issue is that currently by using Dat you are exposing your IP:PORT to the network, and it is trivial for anyone else with the Dat key to see what bytes you have and/or are downloading. It's not an issue for 'private dats' but is an issue for e.g. hosting public websites on Dat. More background: https://blog.datproject.org/2016/12/12/reader-privacy-on-the-p2p-web/ Re: hypercore, I was under the impression that there was a concept of a top level set of writer keys that can be queried in a performant way. E.g. something like |
I totally get the privacy implications, but this will likely hurt the overall strength of the network and will make it harder for non-tech-savvy people to make use of the features. One of the main selling points, for me, was that the more peers were accessing the data, the more resiliant the network would be and there would be less of a load on the initial sources of the data. With this, that functionality goes away by default and there's more mental burdon on casual users to keep their data online. I think that adding support for some sort of mixnet into the protocol would be a better way forward for preserving IP address privacy in that it will be easier to make things "just work". |
One of the main selling points, for me, was that the more peers were
accessing the data, the more resiliant the network would be and there would
be less of a load on the initial sources of the data.
I'm basically arguing theres a middle ground where compared to the web
today we can still improve resiliency but at the same time not be
regressive in terms of privacy.
With this, that functionality goes away by default and there's more
mental burdon on casual users to keep their data online.
I personally don't think Dat itself ensures your keeping your data will be
persistently available today, we have always envisioned the need for a
persistent server. Dat just makes it so your server can be any set of
devices.
I think that adding support for some sort of mixnet into the protocol
would be a better way forward for preserving IP address privacy in that it
will be easier to make things "just work".
The issue I have with this is that there are different considerations when
building an anonymization system than when building a privacy system. To
add an anonymization layer on top of Dat (such as Tor) would mean
inheriting a very complex set of tradeoffs, including severe bandwidth
constraints, which sort of defeat the purpose of Dat (at least the design I
intended).
My proposal here is intended to be more like the way HTTPS works in terms
of privacy. It's not anonymous, but you only have to reveal what you're
reading to the server operator. The other nice aspect is Dat will still be
fast and easy to use.
I also have a number of unsettling, unresolved moral issues relating to
anonymization. There have been hate groups who have looked into using Dat,
and I fear that if we hand them anonymization tools on a silver platter we
would be directly enabling their use cases. This is why I tend to view
privacy and anonymization as totally separate challenges, and I'm not sure
anonymization belongs in Dat (and it would surely take lots of new novel
engineering to properly support it from the ground up while still meeting
our throughput requirements).
…On Thu, Jul 19, 2018, 8:09 AM RangerMauve ***@***.***> wrote:
I totally get the privacy implications, but this will likely hurt the
overall strength of the network and will make it harder for non-tech-savvy
people to make use of the features.
One of the main selling points, for me, was that the more peers were
accessing the data, the more resiliant the network would be and there would
be less of a load on the initial sources of the data.
With this, that functionality goes away by default and there's more mental
burdon on casual users to keep their data online.
I think that adding support for some sort of mixnet into the protocol
would be a better way forward for preserving IP address privacy in that it
will be easier to make things "just work".
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#36 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACbT25cOoTlR-zBSmUIa2KnUrZo2GTNks5uIKE2gaJpZM4VVfSn>
.
|
Would it be possible to make this opt-in per dat rather than the default for all dats? Like, allowing users to choose whether they want more privacy or more resiliancy. You mentioned "opt in to seed", but you can't opt into seeding if you're not a trusted host since nobody would attempt to connect to you. Does that mean that only trusted hosts can opt into seeding? One of the use-cases I have is a social media platform where you seed the data for all of the people you're following, kinda like SSB. That way if you have a decent sized community, you're more likely to be online at the same time as somebody to share updates to the content. Could that still work? Also, is dat still peer to peer if you have centralized hosts for replication? If those trusted hosts are blocked by a network or overloaded, there's now no way to get access to that data. How does this interact with sharing content over MDNS? If I'm using an offline-first chat, and somebody sends me a link, I now can't get access to that data from the person that sent me unless they're a "trusted" peer, right? That would make dat a lot less useful for collaboration without internet. |
The opt in could work for the seeder or the downloader. e.g. If you did
It depends on the privacy guarantees you want to provide to your users... but SSB pubs are a good model to think about, if the pub is a writer, then everyone can just connect to the pub. If the pub isn't available for some reason, the app can ask the user if its ok to try to connect to potentially untrustworthy sources in order to access content.
Dat would act the same way HTTPS acts today, except there would be a set of writers that would be trusted, rather than 1 host. So it would be more resilient than HTTPS but by no means would I describe dat as a censorship resistance tool (which is also a difficult problem related to anonymity).
The IP:PORT would just be a local one, so it could get signed by the writer key and would work the same as internet discovery. |
Thanks for clarification @maxogden. I remember reading this a while ago - and was not sure if there were other concerns that this proposal was addressing too.
In terms of performance its not the checking of the authorised keys that I think would become a problem but potentially the use of HyperDB. @mafintosh definitely has more of an understand of this, but my understanding is that navigating the trie structures of hyperdb will become less performant as writers increase. For example, getHeads iterates over every writer - https://github.com/mafintosh/hyperdb/blob/master/index.js#L243-L245 - and this is not the only function to do so. If you have to add a writer in order to add an authorised seeder - it become a limit. It would work fine for small number of writers, but would not scale well. |
Ahh I see what you mean. I'm not sure the number of writers/seeders is something that needs to be scaled up very high, probably supporting <100 would be fine for most use cases. I can't imagine a use case where you would need that many trusted hosts, you might as well just run in the opt-in mode at that point. Other limits would probably be hit first, such as the memory overhead of creating that many discovery-swarm instances. Implementation wise I would imagine one could cache the set of seeds as separate HyperDB key/values, making sure to sync the hyperdb writer keys into your seeds list when they are added or removed, but also allowing for manual management of non-writer seeds. I imagine this would avoid the potential perf issues you mention. |
That's really surprising to hear given the homepage mentions that Dat is geared to be peer-to-peer and you yourself retweeted something where Dat was being used to circumvent censorship by other platforms. I've been trying to sell people in my area on the p2p aspect of Dat. Are there no alternatives for improving privacy that don't limit p2p connections? |
I should probably add a 'RT aren't endorsements' disclaimer to my personal account then. I've definitely never described Dat as a censorship resistance, anonymity or piracy tool, or advocated for it's use as such. Being p2p does not imply any of the above features. In my opinion we must strive for better privacy than the existing p2p ecosystems out there that disregard it. Part of the principles of the modern web is building in privacy to protocols in the post-mass-surveillance era.
I'm not saying Dat isn't p2p or that we should get rid of p2p, just that if you ignore the privacy problems with unrestricted p2p connections you are throwing user privacy out the window. Two alternatives I have looked into are instructing users to use VPNs, or running Dat over Tor. |
I'm 100% in agreement about the privacy focus, but I believe that limiting the p2p aspect and limiting who can seed will be worse for the network in the long run. Right now there might not be that many people making use of Dat, and we're not seeing that much load. But if it's going to replace HTTP on the web, having it scale with the number of peers will make the adoption more smooth. I don't think that trusting writers is enough to guarantee privacy. That's the model we have now in the web, and it's really not good enough. If an adversary is monitoring who is accessing a given dat, they can analyze your traffic to see if you're connecting to one of the writers for it. It also makes it easier to DoS the writers for a given Dat to take the data down. Lastly, it means that all peers are going to be connecting to the writers rather than being mixed up in a network, so the writers now have more ability to analyze all the peers in the network. This change prevents random peers in the network from analyzing what you're reading, but it sacrifices network resiliance and scalability and doesn't prevent malicious acotrs that know the writers IPs (or having control of the writers) from analyzing your traffic. Even though I agree with the concerns you have about anonymity being misused by bad actors, mixnets seem to be the safest way to prevent malicious actors from analyzing your traffic and tracing users back to an IP. i2p would be a good way to go because unlike Tor, they force all nodes to participate in the routing and have resulted in better speeds for long-running nodes. What threat model are you concerned with when you talk about user privacy? Personally, I think that nobody should be able to trace peers accessing a dat, not even the writers. Anything less will expose vulnerable users. |
I'm specifically concerned about reader privacy. Consider if Wikipedia switches to Dat. On Dat if a user is reading a wikipedia article, they have no guarantee what they are reading is private. Today with HTTPS they have to trust Wikimedia to keep their privacy safe. If someone is monitoring that users traffic, they can only see that the user is on Wikipedia, they can't tell which article they are reading. With Dat today, the entire world can watch whatever bytes of whatever files you are reading or sharing, which we should all agree is a bad default for privacy. I'm just advocating for a safer default in Dat that works like the web works today but doesn't change what Dat is. You are advocating for a different safe default, and I respect and understand your position.
HTTPS accomplishes this as well, as long as a malicious actor is not in control of the server. For cases like the Snowden mass surveillance revelations (ISP level MITMs etc), HTTPS security protects many many people compared to HTTP. For cases where an attacker specifically targets you, or you get subpoenaed to relinquish your server logs, you can't trust the server any more and HTTPS doesn't protect you from them.
I have not personally benchmarked i2p implementations in node and also have not seen anyone else do so, but the only numbers I can find on this is that i2p maxes out at 200kb/s per socket. Much faster than the average Tor socket, yes. But this would still mean making Dat about 150 times slower per socket than the UTP/TCP sockets we use today (30MB/s).
This means Dat would become an anonymity and censorship resistance tool in addition to a peer to peer filesystem. I am not opposed to all of these things being supported by Dat. The question up for debate seems to be what functionality gets turned on by default, as they all have significant tradeoffs.
On an engineering level, I don't think it would be possible to bolt on an anonymity transport layer to Dat and make it anywhere close to fast without lots of effort. For example Tor relies on chained TCP sockets that hop through their routed network to an exit node. To have p2p connections work for >50% of users you need UDP for hole punching (TCP simultaneous open has about 1/3rd the success rate as UDP hole punching last time I checked). So we can't even attempt to use the Tor protocol over our own hybrid p2p network to increase speed, because it has no messaging based network API. So IMO running Dat over Tor isn't really peer to peer, because all of your connections between potentially fast hosts have to get routed through Tor TCP chains to exit nodes, so nearly of the bandwidth advantages you get from direct p2p connections go away. I'd argue this would be more of a degradation in user experience for most people than restricting downloads to writers only. And having users opt-in to being exit nodes puts them at even more risk than running Dat or BitTorrent on a 'naked' connection today. I believe i2p is similarly high level to Tor, making it impossible to do hole punching. For me, the goal of Dat has always been to take the web we have today and 1) put users in control of their data, allowing them to sync their data offline or to other places on the web, which directly combats the vendor lock in we have today with fb/twitter/google etc, 2) make bandwidth cheaper by allowing for more distributed network topologies for distributing content than what we have today with everyone paying by the MB to CDNs and google/amazon cloud to host their data to millions of users from a handful of data centers, when many of those members of the network might have fiber upload, and 3) use content addressability and signing on web content to improve the state of web archiving and permanence. If we can achieve those goals without degrading the privacy of today's HTTPS web I think that would be a huge upgrade to what we have now. But I guess I am dubious that tunneling all Dat connections over an onion routing network will be able to achieve no 2. above as it seems to inherently throw away the huge peer to peer bandwidth advantages. There is also the separate issue of the 'silver platter' above which weighs heavily on my conscious in light of the recent political climate. As an aside, I really appreciate the discussion so far from all the participants in this thread. |
My 2 cents --
I think there's general consensus that reader privacy is important. The debate is about the mechanisms and their tradeoffs. |
Related, I found this really great paper from TU Delft (maybe @mafintosh knows the researchers...) that discusses an approach at reimplementing Tor over UDP for higher bandwidth connections. The takeaway for me is that to make a Tor/Dat hybrid fast we would need to not only reimplement Tor from the ground up to be messaged based like they did (and then get it audited etc), but we'd need some sort of reputation system like the BarterCast thing they mention to ensure attackers can't sybil with slow nodes to kill throughput.
We have talked about these things before during Dat's development and it's a very, very hard problem to balance the advantages of gossiping metadata for better peer routing against the latency and bandwidth costs of the gossip protocol itself. |
@pfrazee Just to clarify the use case here, would it be accurate to liken this approach to a VPN? E.g. you have a publicly accessible server somewhere that you tunnel your traffic through, thereby obscuring your IP and making your connections always work? If so, with this approach, in order to protect their reader privacy, a user have to acquire a proxy and configure their Dat client to use it, so I'd classify it as an "opt-in" mechanism. |
What do you think about allowing dats to opt-into whitelists but keep them open by default? A lot of users' data isn't Wikipedia, and casual users are likely going to have dats for stuff like their fritter profile which they won't have backed up on a cloud provider. Also, I don't think that "write access" is the best way to signify that a peer can be trusted to replicate data since I wouldn't want to give something like hashbase write access despite trusting it to seed my data. It's been discussed before that there should be a "web of trust" for dats. Maybe on top of writers, a dat could have a public key used for "signing" ids of peers that are allowed to seed. That way, you could opt-into only allowing your dat to be seeded by trusted parties by adding the public key and making sure that the trusted parties have some sort of token signed by the key. It doesn't mean they can write, necessarily, but it means they can be trusted for hosting the data. I also like the idea of using hypercore-proxy as a sort of VPN for users to hide their origins from sources. |
This is what I meant by the "manual management of non-writer seeds" discussion above. IMO "web of trust" is a great way to describe this functionality. You as the original author of the Dat are constructing a set of trusted nodes. If users trust you, they trust who you trust. When you mark a set of keys as "preferred seeders", clients are taking your word that that set of preferred seeders will respect their privacy. A client can choose to disregard your preferred seeders and venture beyond to anyone else who has a copy. But to get good privacy in the network, IMO, there needs to be 1) general adoption of this "preferred seeders" whitelist option by dat creators making it easy to use and understand and 2) a default in the clients that opts-in the user to using it. Maybe a middle ground we could start with would be: If a Dat author enables "privacy" mode, then clients respect it by only connecting to the seed list that the author specifies. But if a Dat author does nothing, it continues to work like it does today, with no public privacy. This is similar to the "HTTPS everywhere" debate that's been happening. Beaker could even mirror HTTPS privacy policies such as requiring "privacy" mode to allow JS access to the webcam or other privacy sensitive APIs in web apps.
I also want to voice my support for this feature, but also recognize that despite what VPN companies say about security and privacy, it is an incredibly shady industry, and also costs $ and time to setup (meaning only savvy users will use it). edit sorry meant to quote this too:
I think that could work to start, given that if the author opts in, it opts the user in as well. The user can still explicitly opt out if they want. |
Re @maxogden
To me (as a lurker on this repo up to now -- hi everyone 👋), restricting sources to only the writers of a Dat kind of sounds like a large trade off at the expense of point 2). How will bandwidth get cheaper for a publisher if their peers are going to be the main data sources for the majority of users? Will this not drive publishers to solutions similar to the situation today, with large data centers that are necessary to handle the load? (Inadvertantly, at the same time this could compromise integrity as @RangerMauve noted, since the publisher then would need to give write access to the hoster, as I understand it.) PS: I also find this particularly interesting. It's one of the major problems. |
@maxogden that's accurate. Hypercore proxies are somewhat similar to a peer whitelist, except that when your proxy doesn't already have a dat, you can command it to go get it. About the proposal: I think what you're suggesting is that we reduce the seeder-set by having the owner authorize their seeders with a signed whitelist. (Based on my quick skim of the proposal) the owner could authorize multiple seeders without giving away control of the site. It's not so much that "writers" are seeding; rather, it's peers that are appointed by the writer. My take:
(Reading now your most recent response --) I don't see any harm in adding the ability for the DHT to have "preferred seeders" and then the a client could choose to limit their connections to those peers. Any time I'm unsure if a solution is right, I prefer to use an approach that's easy to discard if it fails. |
So what about the following knobs: For creators
For consumers
|
@RangerMauve I think for this proposal to work, it has to be constrained to the discovery/DHT. So, the whitelist would be exposed by merit of having signed peers in the dht. |
The idea of using 'writers' is to automate this task. Rather than requiring separate management of seeders and writers, I figure we can adopt the default of simply combining them, but allowing separate management if desired. If it's a default, it's more likely to be used (e.g. it's the security mantra of: if it's not on by default, nobody will use it).
This is a general problem we need to solve anyway (like how keybase does multi device management)
This proposal is designed to sit between two extremes. One extreme is HTTPS today where you have 1 authority, usually one server (or more if the SSL cert holder load balances to other hosts, but that's hard to set up and usually its one owner providing a service, and a different delegated trust model). On the other extreme is unrestricted P2P, where anyone can upload, but suffers from the privacy issues above. This proposal means rather than 1 host, you can curate a distributed web of trust to share the load. It's a kind of distributed load balancer. So it still offers significant horizontal scalability and bandwidth commodification over 1 host hosting. And it can still "fall back" to unrestricted p2p if users opt out of privacy.
I don't think we can leak any metadata either, because if you are just downloading 1 file, you only need to get the metadata corresponding to that file, so that leaks your reader privacy as well.
In addition to this, would it work for everyone to automatically copy any writers to this list as well? Or are there specific objections to that? |
@pfrazee So, instead of using the ususal How would a DHT-based approach work with MDNS / DNS-discovery in general? IMO, having it be part of hyperdrive / hyperdb would make it easier to understand for users since it'd be like "authorizing" for a write, but with less priviledges. |
Clarification: The only metadata I think you can should be able to get from untrusted sources is the list of writer keys. Also, I would be OK with an opt-in option for dat creators that they have to turn on for the repository to run in 'public privacy' mode. But once on, it automatically adds new writers to the preferred seeds list. |
I was under the impression that peers always downloaded the full metadata hypercore in order to get the latest changes (regardless of sparce mode) and that hyperdb authorization was part of it's metadata
I'm 100% behind that. I don't see why someone would have write access but without the ability to seed. |
Yeah that's sensible, I just wanted to surface that fact.
True, but in this case you're exacerbating the problem because you lose the option for a lost-key archive to persist in a readonly form.
Fair enough!
@RangerMauve I'm 99% sure we're going to move away from the mainline DHT permanently and create our own so that we can add features and fix issues (like the key length truncation). That said, I'm not well-versed in the details of those implementations.
It's actually possible to download the metadata in sparse mode and use pointers within the metadata to download only what's needed. But either way, with reader privacy, you wouldn't want to download any metadata at all prior to choosing the peers you wish to communicate with. |
@mafintosh and I agreed to this like a year or two ago but have not gotten around to it. There are indeed a number of issues with the mainline dht that deserve more discussion in another thread. Fun fact, 3 years ago now we both flew down to California to visit Juan from IPFS specifically to try to use their DHT, and could never find a way to integrate it, so we added bittorrent-dht instead. We really didn't ever want to use Mainline DHT but it was the only thing we could get working, and has been in there ever since.
Yes I think this will become the default in the future, especially for name resolution/DNS-like use cases like NPM on Dat.
If i'm understanding correctly, couldn't you just "peg" your client version to an older trusted version if you disagree with the writes one of the other keyholders has made? |
Yes we could switch the discovery-channel API pretty easily to only allow announcing buffers, and switch the underlying mechanism to BEP44 for bittorrent-dht. MDNS and dns-discovery already support buffers. |
How do you know which metadata to download in order to know which peers are authorized? At the moment, adding a writer to hyperdb appends a block at the end. Though you could find all the other feed IDs from that. Alternately, what data are you publishing on the DHT that can be trusted to have been created by a writer of the archive? Something like <id, the id signed by key holder, ip , port>? It would be impossible to detect whether a whitelist should be used or not if there's no additional metadata being downloaded somewhere. |
Maybe the DHT could hold |
Above I describe a possible implementation where you store preferred peers as a separate set of hyperdb keys, making it easy to just grab those keys without doing complicated traversals.
In the original post I suggest having a separate swarm for every key. So the original swarm for writer 1 gets created as usual. Then you request the set of preferred seeds and create a swarm for each one. If you get none back you would not use the "privacy" mode. So I think the discovery payload just needs to be |
Sorry for not replying! Went into weekend mode shortly after you posted that. 😅
I think pfrazee was really into the idea of keeping this information at the discovery-swarm level in order to avoid even connecting to untrusted peers in the first place. Having it be part of the protocol will avoid revealing metadata about which data you're looking at, but it will reveal that you are looking at the dat. I think it will also complicate the protocol somewhat if there's a set of keys that are separate from the rest of the hyperdb.
I'm not sure what the benefit is in having separate swarms here. Anybody could start announcing on a given swarm key and you wouldn't know if they're legitimate or not until you connected to them. Plus, this would offer a lot of overhead per-dat. I think that the DHT approach wouldn't require too many changes and would also scale for different data formats without changing the replication protocol. The flow I'm thinking is:
Some problems with this approach that I can think of right away:
Edit: Also, I was talking to @mafintosh about having a fully encrypted DHT where peer IDs used "hash cash" on top of their public keys in order to make it expensive to generate IDs to participate in the DHT so that sybil attacks would take a lot more energy, and making sure communication is encrypted. Twitter thread about it |
@RangerMauve @pfrazee so just to summarize, there are two approaches proposed so far:
Did I get that right? If so, seems like putting it all into the DHT protocol is more risky, because there is no "optional extension" mechanism there, it's just a buffer, so we'd have to support backwards compat on that protobuf if we ever change the scheme. Using a minimal signed payload in the DHT and then using hypercore protocol extension for the rest of the scheme seems like a better way to have something we can discard if it fails. |
@mafintosh Totally agree that modifying the DHT will be a lot of effort. The reason I'm more into it, though, is that you can hide IP addresses more easily since you'd need to perform sybil attacks on the DHT near the discovery key to find the IPs rather than just announcing that you have the data. Some questions:
|
I'm assuming the trust model is that you trust all of the preferred peers (same trust model as trusting all writers, which is why writers should automatically become preferred peers). So as long as an existing writer signed the data you receive, you can trust it.
It's OK to know an IP is accessing a Dat swarm, it's just not OK to know what parts of it they are uploading or downloading. edit
Querying lots of channels should be OK, the only real overhead since its all stateless is in the JS object memory footprint area, which is very optimizable. I misspoke last week when I mentioned multiple swarms, I meant to say multiple channels. |
Cool, so then the empty peer list message should be signed by the creator and peers should make sure to save it for when they handshake? I guess each new writer/seeder will need to have a signed message of a previous writer and the entire chain will need to be transferred. I get what you mean about the IP security. Sybil attacks are pretty annoying. |
Yea I was thinking if its stored in hyperdb then that ensures its saved, and that would also provide a mechanism to replicate those keys and sign them (and verify the signatures).
A good @mafintosh question, I believe he told me once an existing writer can add a new writer and then a new random peer would have a mechanism to check if what they received from someone was signed by one of the writers. Maybe we could piggy back on that (not sure about implementation there).
Agreed, sounds like you have lots of cool ideas for DHT improvements, maybe a new thread to discuss improving the DHT and anonymizing IPs etc would be worthwhile. |
Will it be a separate hypderdb from the main one? Or will it be under a prefix?
The way adding new writers works AFAIK, is that an existing writer adds another writer's key to the This means that getting the list of writers requires getting the latest block from each writer's feed. I suppose the flow could look like:
Originally I thought this would be problematic if the remote peer attempted to hide blocks from you that contained information about writers and seeders, but if they do that they will be setting the upper limit on which blocks the user will bother fetching. Regardless, I think the flag for setting the dat as "requiring trusted seeders" should be set in the first few blocks of the feed. Preferably in the header. The question now is, should we limit only hyperdb-based data structures to have the ability for additional privacy, or would it also be useful to have something for raw hypercores? |
Was thinking easiest way would just be a prefixed set of keys
I was thinking rather than a flag you'd use a new protocol message like
First you'd need to add the key as a writer, then copy it to your seeders array. I think above we agreed writers get copied in, but the seeders list should be separately managed so that you can have a non-writer seeder etc.
This is where you'd send a 'WANT-PEERS' message, and based on the response you'd know the latest seeders and that implies the answer to whether the dat requires trusted peers. I imagine in the future some of those steps can be removed if writers gets cached. I also imagine @mafintosh would come up with a better mechanism to store the list of seeders and ensure their integrity is checked. But it sounds like we're in general agreement about the API. Maybe we need another thread with a more concrete proposal now. |
In echo of this conversation (specifically with regards to GDPR), I was hoping to use DAT for a project to distribute academic research/materials (also intended to be offline accessible/searchable with a local sqlite db) for an institution based in Germany. We ran into some legal issues where, with regards to GDPR, if we were to utilize DAT, we would be unable to explicitly provide identity or assign responsibility to other peers in the swarm who would have access to users' IP addresses. Having read the post from 2016 on https://blog.datproject.org/2016/12/12/reader-privacy-on-the-p2p-web/ it appears that there needs to be a GDPR approved registry of trusted peers to allow for something like this, but am I correct in understanding that a lot of this not yet possible and would also still require a number of pieces to fall into place such as client support or update to the protocol is in place? It's a shame because, clearly as is frequently mentioned, P2P trades off to some extent with privacy, but GDPR has made it almost impossible to utilize DAT for a project it is so well suited for, and we are likely forced back to a centralized server or group of servers to host this data. |
@alancwoo We're working on (optional) authenticated connections in dat which would make it possible to whitelist who is allowed to connect and replicate the data. I'd expect it to land sometime during 2020. |
You can do this today. If you are building a custom tool you could already pass in a whitlelisted group of peers through discovery-swarm's options ( If you want to use the commandline tool, there is some low hanging fruit here to allow it to accept some options for whitelisting peers, a related issue is here: dat-ecosystem/dat#1082... PRs totally welcome |
Can you refer the section of the GDPR is referring to? Generally speaking in DAT there is no private data being communicated. Technically speaking: the clients join a dht network which is - much like a router - a public network. In this public network the clients send a public request for data and entries in the network that have this data forward it. There is - however - no private request or data processing happening. I would be really curious - in detail - what you are referring to. |
@martinheidegger the person clearly stated that ip addresses are the issue |
I would be grateful for more references. We have to deal with GDPR as well and we have not found any issues regarding the IP addresses. Maybe we are overlooking something but I can't find the context in which the use of IP addresses in DAT would violate the GDPR. |
Within GDPR, the EU includes IP addresses as “Personal Identifiable Information” potentially subject to privacy laws. There are a number of articles about this. |
@alancwoo mentions:
That seems to mean that private, personal identifiable data is leaked or distributed to unknown third parties. The GDPR does cover the collection of IP addresses for private, but I don't think this applies here. In any router software IP's and their packets need to be stored and passed on through packets to other routers as part of the protocol, without the ability to look inside the packet content (given its https). That is pretty much how DAT works if I am not mistaken. If data would be exclusively shared between two parties, DAT would just need to encrypt it, and any router in-between would not see its content. Which is what ciphercore does. Another angle I could see is that IP's are used to track what person has interest in which resource, which - if stored - would give a means of tracking insights. But as @alancwoo mentions this is covered by law (you may not do that). I am very interested in that subject and I would really like to have a reference to the actual issue. |
@martinheidegger our legal advisor mentioned:
So I think the issue is that GDPR regulates IPs as private data and thus any entity who is able to capture this information needs to be made legally identifiable to the user. As this is, from what I could imagine, just the nature of a peer network, the legal advisor mentioned the possibility to whitelist identified/trusted peers, and perhaps running our own discovery server but then I feel the issue would still remain, that any peers on the network at the same time could expose IPs to one another and thus be in violation of GDPR. I think the advisor's final solution is to protect the read-key behind a registration form that forces the users to agree to a Privacy Policy/ToS and provide identification, thus people who technically have the read key have fulfilled this identification requirement, but I still wonder because people can still pass the key around and so on? |
Following up from some twitter discussion a couple weeks back. I propose changing the default discovery algorithm in three ways in order to improve default privacy:
In other words, turn off p2p mode by default, except for a set of 'trusted hosts'. To simplify things, I propose we define the 'trusted hosts' as any writer. This is a simple default that can be overridden by settings (e.g. to specify a set of trusted non-writer hosts).
The way I envision discovery working in this new scheme is something like:
This changes the privacy expectation to match HTTPS: Users need only trust the 'owner' of the content they are requesting to keep their server logs secure. The key difference being instead of one DNS record being considered the owner, the entire set of Dat writers (and their corresponding IP:PORT pairs) are to be considered trustworthy.
Again, this is just a proposed default for any Dat client. An option to run in 'unrestricted p2p mode' is easily added.
This would probably be a breaking change, since there could be existing dat schemes out there that rely on non-writers re-seeding content.
DEP wise, there would need to be a mechanism added to sign DHT payloads.
The text was updated successfully, but these errors were encountered: