Merge pull request #170 from ipfs/feature/new-ipfs-explainer

New IPFS explainer
ipfs-inactive · Aug 30, 2019 · 3baf16e · 3baf16e
2 parents d8e061e + 864a892
commit 3baf16e
Show file tree

Hide file tree

Showing 9 changed files with 132 additions and 33 deletions.
diff --git a/content/_index.md b/content/_index.md
@@ -5,19 +5,15 @@ title: IPFS Documentation
 
 Welcome to the IPFS documentation portal! Whether you’re just learning about IPFS or are looking for detailed reference information, this is the place to start. You might have noticed that IPFS is a project with a big scope — and a *lot* of different tools, sites, and code.
 
-Here's an overview of what you'll find in our documentation:
+Here’s an overview of what you’ll find in our documentation:
 
 ## Introduction
 
 Head over to the [introduction](/introduction) section to learn about the basics of IPFS. There are also instructions on how to install IPFS, and tips on basic IPFS usage.
 
 ## Guides
 
-IPFS is a complex system that hopes to change how we use the internet, so it comes with many new concepts. The guides section has an overview of major [concepts](/guides/concepts) in IPFS (including terms and ideas associated with distributed file systems generally), and guides for specific IPFS use cases. The examples section is home to a number of [basic examples](/guides/examples) of ways to use the IPFS ecosystem, including:
-
-* A simple [how-to on pinning](/guides/examples/pinning)
-* Instructions for [making your own IPFS service](/guides/examples/api/service/readme)
-* A guide to [hosting your website](/guides/examples/websites)
+IPFS is a system that hopes to change how we use the internet, so it comes with many new concepts. The guides section has an overview of major [concepts](/guides/concepts) in IPFS (including terms and ideas associated with distributed file systems generally), guides for specific IPFS use cases, and example projects demonstrating various ways to use the IPFS ecosystem.
 
 For detailed guidance on select topics, try out the interactive tutorials at [ProtoSchool](https://proto.school). You can learn about the decentralized web by solving code challenges.
 

diff --git a/content/guides/concepts/merkle-DAG.md b/content/guides/concepts/merkle-DAG.md
@@ -0,0 +1,29 @@
+---
+title: "Merkle-DAGs"
+menu:
+    guides:
+        parent: concepts
+---
+
+A _Direct Acyclic Graph_ (DAG)is a type of graph in which edges have direction and cycles are not allowed. For example, a linked list like _A→B→C_ is an instance of a DAG where _A_ references _B_ and so on. We say that _B_ is _a child_ or _a descendant of A_, and that _node A has a link to B_. Conversely _A_ is a _parent of B_. We call nodes that are not children to any other node in the DAG _root nodes_.
+
+A Merkle-DAG is a DAG where each node has an identifier and this is the result of hashing the node’s contents — any opaque payload carried by the node and the list of identifiers of its children — using a cryptographic hash function like SHA256. This brings some important considerations:
+
+  1. Merkle-DAGs can only be constructed from the leaves, that is, from nodes without children. Parents are added after children because the children’s identifiers must be computed in advance to be able to link them.
+  1. every node in a Merkle-DAG is the root of a (sub)Merkle-DAG itself, and this subgraph is _contained_ in the parent DAG[9].
+  1.  Merkle-DAG nodes are _immutable_. Any change in a node would alter its identifier and thus affect all the ascendants in the DAG, essentially creating a different DAG. Take a look at [this helpful illustration using bananas](https://media.consensys.net/ever-wonder-how-merkle-trees-work-c2f8b7100ed3) from our friends at Consensys.
+
+Identifying a data object (like a Merkle-DAG node) by the value of its hash is referred to as _content addressing_.  Thus, we name the node identifier as _Content Identifier_ or CID.
+
+For example, the previous linked list, assuming that the payload of eachnode  is  just  the  CID  of  its  descendant  would  be: _A=Hash(B)→B=Hash(C)→C=Hash(∅)_.  The properties of the hash function ensure thatno cycles can exist when creating Merkle-DAGs[10].
+
+Merkle-DAGs are _self-verified_ structures. The CID of a node is univocally linked to the contents of its payload and those of all its descendants. Thus two nodes with the same CID univocally represent exactly the same DAG. This will be a key property to efficiently sync Merkle-CRDTs without having to copy the full DAG, as exploited by systems like IPFS. Merkle-DAGs  are  very  widely  used. Source  control  systems  like  Git [11] and others [6] use them to efficiently store the repository history, in away that enables de-duplicating the objects and detecting conflicts between branches.
+
+_Excerpted from Markle-CRDT draft paper by @hsanjuan, @haadcode, and @pgte. Available: https://hector.link/presentations/merkle-crdts/merkle-crdts.pdf_
+
+
+### Footnotes
+
+[6] Merkle-DAGs are similar to Merkle Trees [20] but there are no balance requirements and every node can carry a payload. In DAGs, several branches can re-converge or, in other words, a node can have several parents.
+
+[10] Hash functions are one way functions. Creating a cycle should then be impossibly difficult, unless some weakness is discovered and exploited.
diff --git a/content/introduction/assets/ipfs_stack-apps.png b/content/introduction/assets/ipfs_stack-apps.png
diff --git a/content/introduction/assets/ipfs_stack-data.png b/content/introduction/assets/ipfs_stack-data.png
diff --git a/content/introduction/assets/ipfs_stack-exchange_routing.png b/content/introduction/assets/ipfs_stack-exchange_routing.png
diff --git a/content/introduction/assets/ipfs_stack.png b/content/introduction/assets/ipfs_stack.png
diff --git a/content/introduction/how-ipfs-works.md b/content/introduction/how-ipfs-works.md
@@ -0,0 +1,75 @@
+---
+title: How IPFS Works
+weight: 2
+---
+
+IPFS is a peer-to-peer (p2p) storage network. Content is accessible through peers that might relay information or store it (or do both), and those peers can be located anywhere in the world. IPFS knows how to find what you ask for by its content address, rather than where it is.
+
+## There are three important things to understand about IPFS
+
+Let’s first look at _content addressing_ and how that content is _linked together_. This “middle” part of the IPFS stack is what connects the ecosystem together; everything is built on being able to find content via linked, unique identifiers.
+
+### 1 \\ Content addressing and linked data
+
+IPFS uses _content addressing_ to identify content by what’s in it, rather than by where it’s located. Looking for an item by content is something you already do all the time. For example, when you look for a book in the library, you often ask for it by the title; that’s content addressing because you’re asking for **what** it is. If you were using location addressing to find that book, you’d ask for it by **where** it is: “I want the book that’s on the second floor, first stack, third shelf from the bottom, four books from the left.” If someone moved that book, you’d be out of luck!
+
+It’s the same on the internet and on your computer. Right now, content is found by location, such as…
+
+- `https://en.wikipedia.org/wiki/Aardvark`
+- `/Users/Alice/Documents/term_paper.doc`
+- `C:\Users\Joe\My Documents\project_sprint_presentation.ppt`
+
+By contast, every piece of content that uses the IPFS protocol has a [*content identifier*]({{<relref "guides/concepts/cid.md">}}), or CID, that is its *hash*. The hash is unique to the content that it came from, even though it may look short compared to the original content. _If hashes are new to you, check out [the concept guide on hashes]({{<relref "guides/concepts/hashes.md">}}) for an introduction._
+
+Content addressing through hashes has become a widely-used means of connecting data in distributed systems, from the commits that back your code to the blockchains that run cryptocurrencies. However, the underlying data structures in these systems are not necessarily interoperable.
+
+This is where the [IPLD project](https://ipld.io/) comes in. **Hashes identify content, and IPLD translates between data structures**. Since different distributed systems structure their data in different ways, IPLD provides libraries for combining pluggable modules (parsers for each possible type of IPLD node) to resolve a path, selector, or query across many linked nodes (allowing you explore data regardless of the underlying protocol). IPLD provides a way to translate between content-addressable data structures: “Oh you use Git-style, no worries, I can follow those links. Oh you use Ethereum, I got you, I can follow those links too!”
+
+The IPFS protocol uses IPLD to get from raw content to an IPFS address. IPFS has its own preferences and conventions about how data should be broken up into a DAG (more on DAGs below!); IPLD links content on the IPFS network together using those conventions.
+
+**Everything else in the IPFS ecosystem builds on top of this core concept: linked, addressable content is the fundamental connecting element that makes the rest work.**
+
+### 2 \\ IPFS turns files into DAGs
+
+IPFS and many other distributed systems take advantage of a data structure called [directed acyclic graphs](https://en.wikipedia.org/wiki/Directed_acyclic_graph), or DAGs. Specifically, they use _Merkle-DAGs_, which are DAGs where each node has an identifier that is a hash of the node’s contents. Sound familiar? This refers back to the _CID_ concept that we covered in the previous section. Another way to look the this CID-linked-data concept: identifying a data object (like a Merkle-DAG node) by the value of its hash is _content addressing_. _(Check out [the concept guide on Merkle-DAGs]({{<relref "guides/concepts/merkle-DAG.md">}}) for a more in-depth treatment of this topic.)_
+
+IPFS uses a Merkle-DAG that is optimized for representing directories and files, but you can structure a Merkle-DAG in many different ways. For example, Git uses a Merkle-DAG that has many versions of your repo inside of it.
+
+To build a Merkle-DAG representation of your content, IPFS often first splits it into _blocks_. Splitting it into blocks  means that different parts of the file can come from different sources, and be authenticated quickly. (If you've ever used BitTorrent, you may have noticed that when you download a file, BitTorrent can fetch it from multiple peers at once; this is the same idea.)
+
+Merkle-DAGs are a bit of a [“turtles all the way down”](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Turtles_all_the_way_down.html) scenario; that is, **everything** has a CID. You’ve got a file that has a CID. What if there are several files in a folder? That folder has a CID, and that CID contains the CIDs of the files underneath. In turn, those files are made up of blocks, and each of those blocks has a CID. You can see how a file system on your computer could be represented as a DAG. You can also see, hopefully, how Merkle-DAG graphs start to form. For a visual exploration of this concept, take a look at our [IPLD Explorer](https://explore.ipld.io/#/explore/QmSnuWmxptJZdLJpKRarxBMS2Ju2oANVrgbr2xWbie9b2D).
+
+Another useful feature of Merkle-DAGs and breaking content into blocks is that if you have two similar files, they can share parts of the Merkle-DAG; ie, parts of different Merkle-DAGs can reference the same data. For example, if you update a website, only the files that changed will get new content addresses. Your old version and your new version can refer to the same blocks for everything else. This can make transferring versions of large datasets (such as genomics research or weather data) more efficient because you only need to transfer the parts that are new or have changed instead of creating entirely new files each time.
+
+
+### 3 \\ The DHT
+
+So, to recap, IPFS lets you give CIDs to content, and link that content together in a Merkle-DAG using IPLD. Now let’s move on to the last piece: how you find and move content.
+
+To find which peers are hosting the content you’re after (_discovery_), IPFS uses a [_distributed hash table_](https://en.wikipedia.org/wiki/Distributed_hash_table), or DHT. A hash table is a database of keys to values. A _distributed_ hash table is one where the table is split across all the peers in a distributed network. To find content, you ask these peers.
+
+The <a hrefm src="https://libp2p.io/">libp2p project</a> is the part of the IPFS ecosystem that provides the DHT and handles peers connecting and talking to each other. (Note that, as with IPLD, libp2p can also be used as a tool for other distributed systems, not just IPFS.)
+
+Once you know where your content is (ie, which peer or peers are storing each of the blocks that make up the content you’re after), you use the DHT **again** to find the current location of those peers (_routing_). So, in order to get to content, you use libp2p to query the DHT twice.
+
+You’ve discovered your content, and you’ve found the current location(s) of that content — now you need to connect to that content and get it (_exchange_). To request blocks from and send blocks to other peers, IPFS currently uses a module called [_Bitswap_](https://github.com/ipfs/specs/tree/master/bitswap). Bitswap allows you to connect to the peer or peers that have the content you want, send them your _wantlist_ (a list of all the blocks you're interested in), and have them send you the blocks you requested. Once those blocks arrive, you can verify them by hashing their content to get CIDs. (These CIDs also allow you to deduplicate blocks if needed.)
+
+There are [other content replication protocols under discussion](https://github.com/ipfs/camp/blob/master/DEEP_DIVES/24-replication-protocol.md) as well, the most developed of which is [_Graphsync_](https://github.com/ipld/specs/blob/master/block-layer/graphsync/graphsync.md). There's also a proposal under discussion to [extend the Bitswap protocol](https://github.com/ipfs/go-bitswap/issues/186) to add functionality around requests and responses.
+
+#### A note on libp2p
+
+What makes libp2p especially useful for peer to peer connections is _connection multiplexing_. Traditionally, every service in a system would open a different connection to remotely communicate with other services of the same kind. Using IPFS, you open just one connection, and you multiplex everything on that. For everything your peers need to talk to each other about, you send a little bit of each thing, and the other end knows how to sort those chunks where they belong.
+
+This is useful because establishing connections is usually hard to set up and expensive to maintain. With multiplexing, once you have that connection, you can do whatever you need on it.
+
+
+## And everything is modular
+
+As you may have noticed from this discussion, the IPFS ecosystem is made up of many modular libraries that support specific parts of any distributed system. You can certainly use any part of the stack independently, or combine them in novel ways.
+
+
+## Summary
+
+The IPFS ecosystem gives CIDs to content, and links that content together by generating IPLD-Merkle-DAGs. You can discover content using a DHT that's provided by libp2p, and open a connection to any provider of that content and download it using a multiplexed connection. All of this is held together by the “middle” of the stack, which is linked, unique identifiers; that's the essential part that the IPFS is built on.
+
+<!--Next, we’ll look at how IPFS is an interconnected network of equal peers, each with the same abilities (no client-server relationships), and what that means for system architectures. We’ll also touch on another useful project in the ecosystem -- IPFS Cluster -- that can help make sure your content is always available, even on a network like IPFS that supports peers dropping in and out at will.-->