Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mar/ipld1.4 #415

Merged
merged 11 commits into from
Dec 19, 2022
40 changes: 35 additions & 5 deletions config/_default/goals.json
Original file line number Diff line number Diff line change
Expand Up @@ -255,16 +255,46 @@
"levels": ["shallow", "deep"]
},
"1.4": {
"description": "Go over the underlying data formats and layouts in IPLD",
"subgoals":[{}],
"description": "Go over the underlying data types and formats in IPLD",
"subgoals":[
{
"id": "1.41",
"description": "Understand the different types of data used with IPLD"
},
{
"id": "1.42",
"description": "Discover the default codec in IPFS and how it differs from other IPLD-native codecs"
},
{
"id": "1.43",
"description": "Describe how IPLD is interoperable with other addressed systems"
},
{
"id": "1.44",
"description": "Understand how Schemas provide encoding and storage convenience for application developers"
}
],
"levels": ["deep"]
},
"1.5": {
"description": "Know how distributed data structures bring improvements to decentralized networks",
"subgoals": [{}],
"description": "Gain a conceptual understanding on what algorithms we use with Merkle DAGs to build distributed data structures",
"subgoals": [
{
"id": "1.51",
"description": "Understand what makes distributed data structures so powerful"
},
{
"id": "1.52",
"description": "Learn the basics of the HAMT algorithm operations"
},
{
"id": "1.53",
"description": "How to traverse a Merkle DAG with IPLD pathing"
}
],
"levels":["deep"]
},
"1.6": {
"1.7": {
"description": "Learn about the CAR format and how it helps data distribution",
"subgoals": [
{
Expand Down
141 changes: 54 additions & 87 deletions content/en/curriculum/ipld/codecs.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Codecs"
title: "The Data Model & Codecs"
description: "Understand what are IPLD codecs and what are they used for"
draft: false
menu:
Expand All @@ -9,72 +9,62 @@ weight: 230
category: lecture
level:
- deep
objectives:
show: true
goals:
- "1.4"
subgoals:
- 1.41
- 1.42
- 1.43
---
## What is the Data Model?
The Data Model is how we reason about data moving through the various states. We can know how the data is structured through memory, programmatic access and manipulation, and serialization to and from bytes for storage or transfers.

IPLD is ambitious in its aims to be able to represent many, varied types of content addressed data. To do this, it must be able to decode and encode those data formats. Being able to represent, manipulate and navigate data in memory is only possible if we can turn at-rest binary data into meaningful data structures.
Like the JSON data model, the IPLD Data Model includes data **[Kinds](https://ipld.io/docs/schemas/using/authoring-guide/#schema-kinds)** which include **Booleans**, **Strings**, **Ints**, **Floats**, **Null**, **Lists** and **Maps**, but also adds **[Bytes](https://ipld.io/docs/schemas/using/authoring-guide/#bytesprefix-unions-for-bytes)** and **[Links](https://ipld.io/docs/schemas/using/authoring-guide/#links)** (CIDs).

Every content addressed data system defines at least one data storage format. Some formats are common between systems—JSON is a common format since it is supported across almost every programming language and is easy to read! Binary formats are common for their compactness when storing or transferring large amounts of data. [CBOR is a binary format](https://cbor.io/) that is similar to JSON but more compact and able to represent more data types.
The data model defines a common representation of basic types that **are easily representable by common programming languages** and **found in the most common and successful serialization formats**.

Popular content addressed systems such as Git, Bitcoin and Ethereum have their own unique and custom formats, specifically engineered to their own use-cases.

Similarly, IPFS began life with its own data encoding format, specifically designed around the needs of file storage and addressing. Over time, the native IPFS data format was defined as "DAG-PB" (because it is a Protobuf based format for building DAGs), with an additional layer on top of it called UnixFS for encoding file and directory metadata.

But IPLD gives IPFS superpowers to store, transfer, address and manipulate many other data formats, and the self-describing nature of CIDs give us the tools to link between them.
```js
const data = {
string: "☺️ we can do strings!",
ints: 1337,
floats: 13.37,
booleans: true,
lists: [1, 2, 3],
bytes: new Uint8Array([0x01, 0x03, 0x03, 0x07]),
links: CID(bafyreidykglsfhoixmivffc5uwhcgshx4j465xwqntbmu43nb2dzqwfvae)
}
```

**Codecs are how IPLD moves data between the raw byte representation and their equivalent Data Model form.**
Read more about the Data Model at [**ipld.io/docs/data-model**](https://ipld.io/docs/data-model/)

## Codecs for Non-IPFS Systems
### What are Kinds?

IPLD has had codecs written for many different content addressed systems:
We refer to the different kinds of representable data in the Data Model as "kinds": **Booleans**, **Strings**, **Ints**, **Floats**, **Null**, **Bytes**, **Lists**, **Maps** and **Links**. The 'recursive kinds' are **Maps** and **Lists** (since they can contain other kinds). We use the term "kinds" here to disambiguate this from "types", which is a term we use at the Schemas level.

* Git (multicodec code `0x78`)
* Bitcoin (multicodec codes `0xb0`, `0xb1`, `0xb2`)
* Ethereum (multicodec codes `0x90` to `0x9a`)
* Zcash (multicodec codes `0xc0` and `0xc1`)
* Dash (multicodec code `0xf0` and `0xf1`)
* Bittorrent (multicodec code `0x7b` and `0x7c`)
* ... and more
Read more about IPLD Kinds and specifics of what we expect regarding their bounds and representation at [ipld.io/docs/data-model/kinds](https://ipld.io/docs/data-model/kinds/)

*Caveat emptor: as these are not core to IPFS or IPLD, most of these codecs are usually not as actively maintained as the IPLD-native codecs and may need some love!*
## What are Codecs?
Codecs are how IPLD moves data between the raw byte representation and their equivalent **Data Model** form. IPLD is ambitious in its aims to be able to represent many, varied types of content addressed data (not just file data for IPFS). To do this, it must be able to _encode and decode_ those data formats. Empowering it to be able to represent, manipulate and navigate data in memory; codecs make it possible to turn at-rest binary data into meaningful data structures.

## IPLD-native Codecs
### IPLD-native Codecs

Codecs for both **JSON** (multicodec code `0x0200`) and **CBOR** (multicodec code `0x51`) data are bundled with IPFS and may be used within most IPLD systems. However, both of these formats lack the key **Link** kind, so cannot form coherent, linked DAGs and can therefore only be terminal blocks within IPLD graphs.
Codecs for both **JSON** (multicodec code `0x0200`) and **CBOR** (multicodec code `0x51`) data are bundled with IPFS and may be used within most IPLD systems. However, both of these formats lack the key **Link** kind, so they cannot form coherent, linked DAGs and can therefore only be terminal blocks within IPLD graphs.

The **raw** codec (multicodec code `0x55`) is essentially a pass-through, from stored bytes, to Data Model **Bytes**. It also can only be a terminal within an IPLD graph but is used within IPFS file graphs to represent raw file data—usually as chunks of a complete file, connected by a parent **DAG-PB** block.

The [**DAG-PB**](https://ipld.io/specs/codecs/dag-pb/) codec is the original IPFS file data codec. It uses a fixed Protobuf format to represent just enough data to describe connected graphs of named links pointing to file data. DAG-PB is limited in that it can only represent Data Model **Bytes** and named **Links**. It is difficult to use DAG-PB for more than standard file data but because it is the native IPFS data format, it is common for IPFS users to translate their data structures into file form. Useful data is often stored within JSON files which are then encoded using DAG-PB and addressed by their file name. Unfortunately, this means that IPLD (and IPFS by extension) can't help users navigate at the level of their useful data, it can only present them with the files for them to decode. Which is why IPLD introduces two new flexible formats.
The [**DAG-PB**](https://ipld.io/specs/codecs/dag-pb/) codec is the original IPFS file data codec. It uses a fixed Protobuf format to represent just enough data to describe connected graphs of named links pointing to file data. DAG-PB is limited in that it can only represent Data Model **Bytes** and named **Links**. It is difficult to use DAG-PB for [data other than standard file data](/curriculum/ipld/ipld-and-ipfs/#limitations-of-file-data).

[**DAG-CBOR**](https://ipld.io/specs/codecs/dag-cbor/) is the most flexible (and arguably) useful native format of IPLD and IPFS. Built on CBOR, it enables the representation of all Data Model kinds, only needing to add **Links** to what CBOR can currently support (this is done via CBORs tag system). DAG-CBOR is the native format of the Filecoin chain, and is recommended for users building applications on IPLD (and IPFS) that are not focused on files.

[**DAG-JSON**](https://ipld.io/specs/codecs/dag-json/) is similar to DAG-CBOR in that it can represent the entire IPLD Data Model. It uses special forms within the data to denote **Links** (including the encoding of CIDs as their string form) and **Bytes** (including the encoding of raw bytes as base64). DAG-JSON is less space efficient but can be human-readable. For this reason it is the default output format of the `ipfs dag get` command which can be used to inspect IPLD blocks and nodes.

```
$ ipfs dag get /ipld/bafybeibxm2nsadl3fnxv2sxcxmxaco2jl53wpeorjdzidjwf5aqdg7wa6u/Links/1
{"Hash":{"/":"QmYCvbfNbCwFR45HiNP45rwJgvatpiW38D961L5qAhUM5Y"},"Name":"contact","Tsize":200}
```

Note in the output above, a Link (CID) is represented using the form `{"/":"<cid>"}`. Similarly, Bytes are represented as `{"/":{"bytes":"<base64 encoded bytes>"}}`.

[**DAG-JOSE**](https://ipld.io/specs/codecs/dag-jose/) is the newest codec able to support the complete Data Model. It combines the [JOSE](https://jose.readthedocs.io/en/latest/) format with CBOR to provide a standards-based signing and encryption format for flexible IPLD data.
### Examples

## Examples
Given some arbitrary data, [as shown above](/curriculum/ipld/codecs/#what-is-the-data-model), that is compatible with the IPLD Data Model, what does it look like in encoded form?

Given some data (represented here as JavaScript) in memory, compatible with the IPLD Data Model, what does it look like in encoded form?

```js
const data = {
string: "☺️ we can do strings!",
ints: 1337,
floats: 13.37,
booleans: true,
lists: [1, 2, 3],
bytes: new Uint8Array([0x01, 0x03, 0x03, 0x07]),
links: CID(bafyreidykglsfhoixmivffc5uwhcgshx4j465xwqntbmu43nb2dzqwfvae)
}
```

**DAG-JSON**:
###### DAG-JSON:

```json
{
Expand All @@ -88,13 +78,9 @@ const data = {
}
```

Note that DAG-JSON strips extraneous whitespace, the above example is present pretty-printed for ease of reading:

```json
{"arrays":[1,2,3],"booleans":true,"bytes":{"/":{"bytes":"AQMDBw"}},"floats":13.37,"ints":1337,"links":{"/":"bafyreidykglsfhoixmivffc5uwhcgshx4j465xwqntbmu43nb2dzqwfvae"},"string":"☺️ we can do strings!"}
```
Note that DAG-JSON strips extraneous whitespace, the above example is pretty-printed for ease of reading.

**DAG-CBOR**:
###### DAG-CBOR:

DAG-CBOR is difficult to illustrate as it's a purely binary format. Our example data encodes to the following binary data represented in hexadecimal:

Expand All @@ -105,40 +91,21 @@ a3d70a3d66737472696e67781ae298baefb88f202077652063616e20646f20737472696e67732168
616e73f5
```

CBOR has a standard diagnostic output that is useful for visualizing this data, however:
But _you can use [cbor.me](https://cbor.me) via the web, or install [github.com/rvagg/cborg](https://github.com/rvagg/cborg) on the command line to replace this output with a human-readable one if you have raw CBOR to inspect._

```
a7 # map(7)
64 # string(4)
696e7473 # "ints"
19 0539 # uint(1337)
65 # string(5)
6279746573 # "bytes"
44 # bytes(4)
01030307 # "\x01\x03\x03\x07"
65 # string(5)
6c696e6b73 # "links"
d8 2a # tag(42)
58 25 # bytes(37)
0001711220785197229dc8bb1152945da58e2348f7 # "\x00\x01q\x12 xQ"]¥#H÷"
e279eeded06cc2ca736d0e879858b501 # "âyîÞÐl Êsm\x0e"
66 # string(6)
617272617973 # "arrays"
83 # array(3)
01 # uint(1)
02 # uint(2)
03 # uint(3)
66 # string(6)
666c6f617473 # "floats"
fb 402abd70a3d70a3d # float(13.37)
66 # string(6)
737472696e67 # "string"
78 1ae298baef # string(22)
e298baefb88f202077652063616e20646f2073747269 # "☺️ we can do stri"
6e677321 # "ngs!"
68 # string(8)
626f6f6c65616e73 # "booleans"
f5 # true
```
### Codecs for Non-IPFS Systems
Every content addressed data system (CAS) defines at least one data storage format. Some formats are common between systems like JSON or CBOR. [CBOR is a binary format](https://cbor.io/) that is similar to JSON but more compact and able to represent more data types.

Popular content addressed systems (CAS) such as Git, Bitcoin and Ethereum have their own unique and custom formats, specifically engineered to their own use-cases.

IPLD has had codecs written for many different content addressed systems:

* Git (multicodec code `0x78`)
* Bittorrent (multicodec code `0x7b` and `0x7c`)
* Bitcoin (multicodec codes `0xb0`, `0xb1`, `0xb2`)
* Ethereum (multicodec codes `0x90` to `0x9a`)
* ... and more

*Caveat: as these are not core to IPFS or IPLD, most of these codecs are usually not as actively maintained as the IPLD-native codecs and may need some love!*

_(You can use [cbor.me](https://cbor.me) via the web, or install [github.com/rvagg/cborg](https://github.com/rvagg/cborg) on the command line to replicate this output if you have raw CBOR to inspect)._
Knowing these codecs and how data is formatted underneath, would enable IPLD developers to traverse their respective content addressed systems (CAS). Codecs, along with the Data Model, allows developers to make sense of the custom formats of other CASs.
44 changes: 13 additions & 31 deletions content/en/curriculum/ipld/data-model.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "The IPLD Data Model"
description: "Understand the IPLD data model"
draft: false
draft: true
menu:
curriculum:
parent: "curriculum-ipld"
Expand All @@ -10,14 +10,20 @@ category: lecture
level:
- shallow
- deep
objectives:
show: true
goals:
- "1.4"
subgoals:
- 1.41
- 1.42
- 1.43
- 1.44
---

IPLD defines a **Data Model** that details the forms that data can take in memory, and through which a codec transforms that memory to and from encoded bytes.

Like the JSON data model, the IPLD Data Model includes data **[Kinds](https://ipld.io/docs/schemas/using/authoring-guide/#schema-kinds)** which include **Booleans**, **Strings**, **Ints**, **Floats**, **Null**, **Lists** and **Maps**, but also adds **[Bytes](https://ipld.io/docs/schemas/using/authoring-guide/#bytesprefix-unions-for-bytes)** and **[Links](https://ipld.io/docs/schemas/using/authoring-guide/#links)** (CIDs).

The Data Model is how we reason about data moving through the various states—in-memory, programmatic access and manipulation, and serialization to and from bytes for storage or transfers.

Like the JSON data model, the IPLD Data Model includes data **[Kinds](https://ipld.io/docs/schemas/using/authoring-guide/#schema-kinds)** which include **Booleans**, **Strings**, **Ints**, **Floats**, **Null**, **Lists** and **Maps**, but also adds **[Bytes](https://ipld.io/docs/schemas/using/authoring-guide/#bytesprefix-unions-for-bytes)** and **[Links](https://ipld.io/docs/schemas/using/authoring-guide/#links)** (CIDs).

The data model defines a common representation of basic types that **are easily representable by common programming languages** and **found in the most common and successful serialization formats**.

```js
Expand All @@ -34,32 +40,8 @@ const data = {

Read more about the Data Model at [**ipld.io/docs/data-model**](https://ipld.io/docs/data-model/)

## Blocks and Nodes

IPLD data is quantified in terms of **nodes** and **blocks**. A node is a **point in a graph**, while a block is a collective unit of data that is serialized and hashed to generate a content address (CIDs). Blocks typically include many nodes.

If we define an example *block* of data using JSON:

```json
{"a": ["b", "c"]}
```

We can see 5 *nodes*:

1. The enclosing map
2. The key (the string `"a"`)
3. The list
4. The first list value (the string `"b"`)
5. The second list value (the string `"c"`)

Read more about Nodes and their relationship to other IPLD concepts at [ipld.io/docs/data-model/node](https://ipld.io/docs/data-model/node/)

## Kinds

We refer to the different kinds of representable data in the Data Model as "kinds": **Booleans**, **Strings**, **Ints**, **Floats**, **Null**, **Bytes**, **Lists**, **Maps** and **Links**.

We use the term "kinds" here to disambiguate this from "types", which is a term we use at the [Schemas](ipld-schemas.md) level.

The 'recursive kinds' are **Maps** and **Lists** (since they can contain other kinds).
We refer to the different kinds of representable data in the Data Model as "kinds": **Booleans**, **Strings**, **Ints**, **Floats**, **Null**, **Bytes**, **Lists**, **Maps** and **Links**. The 'recursive kinds' are **Maps** and **Lists** (since they can contain other kinds). We use the term "kinds" here to disambiguate this from "types", which is a term we use at the Schemas level.

Read more about IPLD Kinds and specifics of what we expect regarding their bounds and representation at [ipld.io/docs/data-model/kinds](https://ipld.io/docs/data-model/kinds/)
Loading