Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

fast and simple IPFS pubsub on the application layer #108

Open
hackergrrl opened this issue Feb 5, 2016 · 19 comments
Open

fast and simple IPFS pubsub on the application layer #108

hackergrrl opened this issue Feb 5, 2016 · 19 comments
Labels

Comments

@hackergrrl
Copy link

proposal

integrating pubsub into IPFS core is going to take time and effort to do it properly. this proposes a small CLI tool (or, trivially, a streaming module) that enables decentralized (but not distributed) pubsub on the application layer.

goals

this puts ease of integration via a simple api and speed as its highest priorities.

any IPFS application that wishes /ipns was fast enough for realtime use (that is: getting the HEAD of a frequently updating merkle dag). I wrote this with applications like orbit vaguely in mind.

while this isn't a fully distributed solution, it has a number of benefits over an ad-hoc central server approach.

CLI: publisher usage

$ my-periodic-data-generator | pubsub -p [topic] [publisher-addr [publisher-addr [...]]]
/ipfs/Qmfoobar/topic

the output is a single multiaddress that can be considered a subscription token.
the publisher shares this topic token as widely as they wish

the command expects an input stream of messages to be published to subscribers

CLI: subscriber usage

$ pubsub -s /ipfs/Qmfoobar/topic
/ipfs/quuxbax1
/ipfs/quuxbax2
/ipfs/quuxbax3
/ipfs/quuxbax4
...

the subscriber receives published data as /ipfs addresses, which are output as a
stream newlined delimited addresses. subscribers can they resolve each to see
the data that was published

an optional optimization for higher trust environments could be to permit the
publisher to include the above /ipfs address, but also inline the content.
subscribers could verify data asynchronously

network topology

publishers form a fully connected graph with each other

subscribers are each connected to one of the publishers (uniform dist)

this topology is not strictly necessary, since no connections are "privileged"
-- they all just replicate the topic dag quickly to 'subscribers'

^ said differently: anyone can be a publisher in the sense that they disseminate
the topic dag to others, but only publishers listed in the topic metadata can
add to the topic dag

subscription token/key/identifier format

a topic is defined by a multiaddress (probably /ipfs or /ipns). it resolves to a
list of newline-delimited multiaddresses of peers

/ip4/104.131.131.82/tcp/4001/ipfs/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ
/ip4/162.243.248.213/tcp/4001/ipfs/QmSoLueR4xBeUbY9WZ9xGUUxunbKWcrNFTDAadQJmocnWm

if stored on /ipfs, it forms an immutable contract of publishers for that topic

if stored on /ipns, it forms a mutable contract of publishers

since each publisher's public key is listed, it means all published messages
can be verified by subscribers

message publishing

format

this should be fully free-form (anything the publisher wants): newline delimited
text, json, protocol buffers, et cetera. let's not force e.g. IPLD on people who
just want p2p pubsub and IPFS happens to be the underlying transport

this document proposes a JSON structure to be used that enables a "topic merkle
dag" -- it permits linking to the previous heads of the topic dag, giving
integrity benefits:

json structure

something IPLD-esque, e.g.

{
  "prevHeads": [
    "/ipfs/Qmprevhash",
    "/ipfs/Qmotherhash"
  ],
  "data": "<some textual, json, or base64 binary data>"
}

where prevHeads is a list of the heads of the current topic dag. ('head'
refers to each node in the dag that has no links pointing to it)

summary: nice properties

  • simple, easy unix-style commands with a trivial api that enables pubsub
  • decentralized (not distributed)
  • having a running daemon is an optimization, but not a requirement (gateway
    fallback, message inlining)
  • all data and metadata is stored in ipfs
  • the set of publishers can be chosen to be mutable or immutable by the topic creator
  • unique namespace of topics
  • publisher public keys are part of the topic metadata, so all published
    messages can be verified
  • flexible network topologies are possible using pure publishers,
    store-and-forward subscribers, and pure subscribers

questions

  • how will subscribers connect to publishers (NAT, etc)?
    • can we reuse the existing go-ipfs NAT holepunching logic? maybe even reuse
      existing peer connections if we're already connected to a publisher?
  • could this work on the web too?
    • it could: connect to publishers via websockets or webrtc and use a (or
      the) public ipfs gateway for resolving messages
@jbenet
Copy link
Member

jbenet commented Feb 5, 2016

Tons of this SGTM! 👍 -- agree on pushing for getting something that just works now, and moving to generalizing / scalability / doing it right later.

it would be nice to be able to expose this service through ipfs itself. we had a proposal for pluggable transports and protocols earlier, where you can register a protocol handler with an ipfs node, and get a pipe of all the data coming through. this may be a good way to leverage existing connections and so on. but maybe it's a P2 need.

@eelcocramer
Copy link

👍

@hackergrrl
Copy link
Author

@jbenet thanks for the comments! (are you referring to this example?)

This was referenced Feb 6, 2016
@haadcode
Copy link
Member

haadcode commented Feb 9, 2016

Great proposal to get things moving with pubsub! It's a highly anticipated feature and I'm always for getting something working as soon as possible.

Some thoughts and questions:

I agree with @jbenet that the command should be exposed in ipfs, ie. ipfs pubsub -p <topic> and ipfs pubsub -s <topic>. The sooner we have it as part of ipfs, the easier it becomes to mock its usage on application level as well as expose it via bindings (js-ipfs-api)

publishers form a fully connected graph with each other
subscribers are each connected to one of the publishers (uniform dist)

In a chat (orbit), everyone's a publishers AND a subscriber within a chat room (topic). With the proposed solution, would it work so that only some members (of a chat room) are publishers while rest are subscribers? How, on application level, a developer should decide who become publishers and who remain subscribers? Or would it work so that there are dedicated nodes in the network that form the "publishing" service and all chat clients are subscribers? Either way, there needs to be a way to have a large number of peers be able publish to a topic, regardless of the network topology and roles. Perhaps a subscriber (in terms of the role in the network) can send a message to one of the publishers who then publishes the message on behalf of the subscriber?

subscription token/key/identifier format

How would the "list of publisher addresses" be handled if it's an immutable ipfs hash? What happens, in terms of the list, when a new publishers joins or leaves the network? If it's saved in IPNS, in addition to perf problems, we're limited to one node to handle the updates to the list (per-node pubkey publishing in ipns). Can we use the keystore feature to distribute the IPNS keys so that many peers can publish to the same IPNS pubkey? I suppose my question here is, who keeps track of the publishers per topic?

this document proposes a JSON structure to be used that enables a "topic merkle
dag" -- it permits linking to the previous heads of the topic dag, giving
integrity benefits:

We can do this already:
ipfs object put <payload> --> QmFoo
ipfs object patch add-link QmFoo previousHead QmPreviousHead --> QmNewHead

{
  Links: [
    { "Name": "previousHead", "Hash": "QmPreviousHead" }
  ],
  Data: "QmFoo"
}

As an additional feature, and as a performance optimization, we can add "skip lists" to the Links which point to the previousHead at 10 (heads ago), 20, 50, 100, etc. so that each topic intrinsically contain the full history of that topic. This feature is not usually present in traditional pubsub systems but would allow us to take the pubsub feature closer to a message queue / event log (eg. Kafka).

where prevHeads is a list of the heads of the current topic dag. ('head'
refers to each node in the dag that has no links pointing to it)

I don't understand this part. You intent to have multiple heads per topic? Shouldn't there be just one?

@haadcode
Copy link
Member

haadcode commented Feb 9, 2016

cc @whyrusleeping you mentioned you had some ideas for pubsub?

@haadcode
Copy link
Member

haadcode commented Feb 9, 2016

Earlier discussion on pubsub for the reference: #64

@hackergrrl
Copy link
Author

Thanks for the comments @haadcode!

I agree with @jbenet that the command should be exposed in ipfs, ie. ipfs pubsub -p and ipfs pubsub -s

I disagree. The ipfs command is already very large and does a wide array of things. Further bloating a single tool is undesirable compared to building other individual, specialized tools that do one thing well. We need to have a lower divide between "privileged core ipfs tooling" and "the userland community of ipfs tools": pubsub is not an inherent piece of core IPFS (yet, at least), so I think it's best suited for its own tool.

publishers form a fully connected graph with each other; subscribers are each connected to one of the publishers (uniform dist)

In a chat (orbit), everyone's a publishers AND a subscriber within a chat room (topic).

This approach isn't going to help you much, then. 😿 The use case I had in mind was a small set of privileged publishers providing to a larger set of unprivileged subscribers. The sort of case where individuals or small orgs control their own publish channel for authoritative data over the head of their data dag on the IPFS network.

Either way, there needs to be a way to have a large number of peers be able publish to a topic, regardless of the network topology and roles. Perhaps a subscriber (in terms of the role in the network) can send a message to one of the publishers who then publishes the message on behalf of the subscriber?

That could work! Individual publishers could choose peers they trust, and permit them to use their own privileged publisher node as a relay. This could actually work as an orthogonal layer on top of this model.

However, it may be worth reconsidering this problem again for your use case specifically. Making a list of all the pubsub use cases we can think of could help us model the solution better, too.

subscription token/key/identifier format

We can do this already: [very reasonable examples]

You're right; scratch that.

As an additional feature, and as a performance optimization, we can add "skip lists" to the Links which point to the previousHead at 10 (heads ago), 20, 50, 100, etc. so that each topic intrinsically contain the full history of that topic. This feature is not usually present in traditional pubsub systems but would allow us to take the pubsub feature closer to a message queue / event log (eg. Kafka).

Humour me: what does this optimize vs each message storing its previous heads? Traversal?

where prevHeads is a list of the heads of the current topic dag. ('head' refers to each node in the dag that has no links pointing to it)

I don't understand this part. You intent to have multiple heads per topic? Shouldn't there be just one?

In an append-only log CRDT you can always have forks, where e.g. two publishers publish a new head, but both point at the same old head. These forks are acceptable, but new publishers should link to all of the previous heads to indicate causality (i.e. that THIS head came temporally after ALL of this other heads).

@haadcode
Copy link
Member

I disagree. The ipfs command is already very large and does a wide array of things. Further bloating a single tool is undesirable compared to building other individual, specialized tools that do one thing well. We need to have a lower divide between "privileged core ipfs tooling" and "the userland community of ipfs tools": pubsub is not an inherent piece of core IPFS (yet, at least), so I think it's best suited for its own tool.

Fair points. My strongest reason to put pubsub into ipfs would be to have a quick integration with all the existing bindings. If it's a separate binary, we'll have to modify the binding libs heavily (or write separate bindings for pubsub) to make it work whereas if it was in ipfs we'd only need to plug in the commands.

This approach isn't going to help you much, then. 😿 The use case I had in mind was a small set of privileged publishers providing to a larger set of unprivileged subscribers. The sort of case where individuals or small orgs control their own publish channel for authoritative data over the head of their data dag on the IPFS network.

Could your solution still work if it was indeed a set of "privileged" nodes acting as publishers in the network and subscribers can communicate to them to publish? A service within a network. This would still be an acceptable solution to get started, imo, it's decentralized at least. With the upcoming private networks, small orgs can set it up within their network by providing a publisher node, and in the public ipfs network anyone can provide a publisher node. What do you think?

Humour me: what does this optimize vs each message storing its previous heads? Traversal?

Yes, purely a traversal optimization. But a helpful one where you need to, say, "warm up the cache" with the full history of a topic. With skip lists you can parallelize the traversal. Each new head would still point to the immediate next head, these would be additional pointers.

where prevHeads is a list of the heads of the current topic dag. ('head' refers to each node in the dag that has no links pointing to it)

I don't understand this part. You intent to have multiple heads per topic? Shouldn't there be just one?

In an append-only log CRDT you can always have forks, where e.g. two publishers publish a new head, but both point at the same old head. These forks are acceptable, but new publishers should link to all of the previous heads to indicate causality (i.e. that THIS head came temporally after ALL of this other heads).

I see. So in case there are two new heads A and B, both pointing to an old head C, how does the next new head, D, look like? D points to A and B? How can the publisher know that it needs to link to both of them instead of just "the most recent" (say A was posted before B)? Can A and B ever point to different old heads?

@Kubuxu
Copy link
Member

Kubuxu commented Feb 10, 2016

The list structure with only one link to previous is very inefficient when it come to fetching the past. If there is only link one back it means that we need to wait for resolution of one, read it, wait for resolution of second, read it and so on. It would be much better, when node knows that history, not every node needs to know all past, to include exponential links to the past (2, 4, 8, 16, ... before).

This means that lookup for Nth node in the past is no longer O(N) but in worst case O(logN) and best case O(1). This allows also to fetch big chunk of history concurrently, not sequentially as in case of storing only link to previous.

@haadcode
Copy link
Member

If it helps, and the pubsub (publisher?) implementation is not dependent on the message order, we could try to think of it without (and leave ordering to the application to handle) and see what would be the simplest way to "write/read messages to/from a topic"?

@hackergrrl
Copy link
Author

@haadcode @Kubuxu: Point well received re: skip lists. This makes a lot of
sense: needing to walk backwards in lockstep fashion would make for very slow
traversal. 👍

Could your solution still work if it was indeed a set of "privileged" nodes
acting as publishers in the network and subscribers can communicate to them
to publish? A service within a network. This would still be an acceptable
solution to get started, imo, it's decentralized at least. With the upcoming
private networks, small orgs can set it up within their network by providing
a publisher node, and in the public ipfs network anyone can provide a
publisher node. What do you think?

I think that's a good idea. Like you said it still relies on a central
authority, but the publishers themselves can be trivially decentralized.

I'm still not sure whether this is something that needs to be a part of core
pubsub, or if it should be layered on top. I'm leaning toward keeping pubsub
simple and letting this abstraction be layed on top if/when needed.

I see. So in case there are two new heads A and B, both pointing to an old
head C, how does the next new head, D, look like? D points to A and B? How
can the publisher know that it needs to link to both of them instead of just
"the most recent" (say A was posted before B)? Can A and B ever point to
different old heads?

A node doesn't need to know or make sure it has all of the heads before
publishing -- it just makes sure its new head links back to all of the heads it
does know about. This makes the merkle links into causal links: given a
published message, you can determine with certainty which other messages it
definitely came after.

E.g. from your example, D points to A and B if the peer publishing knows about
A and B. C could be out there somewhere and also point to D for all we know.
But D linking to A and B means "hey, A and B definitely happened before D".

If it helps, and the pubsub (publisher?) implementation is not dependent on
the message order, we could try to think of it without (and leave ordering to
the application to handle) and see what would be the simplest way to
"write/read messages to/from a topic"?

Agreed. Don't we get this by just dropping the prevHeads field from the JSON
structure? Adding something like e.g. a sequence number onto the structure
would give messages an absolute ordering. (We could do this on the IPRS
layer
too, which supports notions of record validity and ordering.)

@qxotk
Copy link

qxotk commented May 20, 2016

I am very interested in the #pubsub area within IPFS.

Does anyone have a reference to any of the time-tested research papers on the topic?

Is there one I could start with and then follow the end-notes to other important papers?

@hackergrrl
Copy link
Author

hackergrrl commented May 20, 2016

@jmsmcfrlnd you might want to talk to @nicola, who is working on a gossip implementation (JS) based on a paper, which could be a great cornerstone for a gossip-based many-to-many pubsub implementation. Also check out @whyrusleeping's work on one-to-many pubsub (Go), which is pretty far along! -- found here.

@nicola
Copy link
Member

nicola commented May 20, 2016

thanks for the mention @noffle, I would underline that a general purpose pubsub would probably not be based on gossip, but instead should try to reuse the dht itself. (will update you on this)

@MasterJames
Copy link

If you simulated a quorum one might choose how many peers are needed to quantify a write.
So all those in the pub list have to maintain connections to subs with websockets but managed with redundancy. (I like to think of citizens and no master or slaves, but here we're talking about multiple randomly selected masters functioning as a quorum. [Each has a standby or two.]) Two pubs per sub, during notifications via random statistical latency insure faster notifications while 2nd confirms validity (or 4 needed). Both a time delay and/or other options are used to update the master subscriptions M-DAG ID, but basically it's just a record for new subs joining. The quorum would function like a Shared FS but more like Redis.io on top, with deep JS (maybe BJSON) object linking.
Still can transport GraphQL and/or Diff changes per subscriber.
I guess I'm saying ipfs would work to keep a list of "citizens" and connected the and agree on how to share the load but not be writing new changes other then to store stat periodically.
Also consider latency paired quorum master children for exponential notification dissemination.

@MasterJames
Copy link

MasterJames commented May 30, 2016

Maybe this bit about Redis is relevant
https://github.com/ipfs/go-ipfs/blob/master/Godeps/_workspace/src/github.com/ipfs/go-datastore/redis/redis.go
also not as critical if there's Mutability with ipfs mfs maybe? Is this mfs idea still not available? where to learn more?
https://godoc.org/github.com/jbenet/go-datastore
Code samples all show .List() with totally different names and results
https://godoc.org/github.com/jbenet/go-datastore#Key.Instance
The Big question is not clear in docs is "Does 'Put', put Mutable?".
[no camel case for Go? functions don't start with a Capitals in my world?! never did. should not.]
I'm not seeing the level of mutability I need clearly enough? Well check back later. Good luck.

@daviddias
Copy link
Member

To everyone following this thread, check out the latest Tutorial published by @pgte on how to use PubSub to create Real-Time apps over IPFS

@daviddias
Copy link
Member

Also, in case you missed it, PubSub has been here for a while, see the spec at https://github.com/ipfs/interface-ipfs-core/tree/master/API/pubsub

@vmx
Copy link
Member

vmx commented May 18, 2018

@daviddias daviddias added the topic/libp2p Topic libp2p label Jul 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants