diff --git a/.github/ISSUE_TEMPLATE/BUG-REPORT.yml b/.github/ISSUE_TEMPLATE/BUG-REPORT.yml new file mode 100644 index 0000000..7b3dffa --- /dev/null +++ b/.github/ISSUE_TEMPLATE/BUG-REPORT.yml @@ -0,0 +1,31 @@ +name: "Bug Report - documentation or registry" +description: Report possible bugs in multibase spec, process docs, and/or the multibase registry. +title: "🐛 [DOC/PROCESS BUG] - " +labels: [ + "bug" +] +body: + - type: textarea + id: description + attributes: + label: "Description" + description: Please enter an explicit description of your issue, + placeholder: Short and explicit description of your incident, ideally with commit-specific link to lines + validations: + required: true + - type: input + id: reprod-url + attributes: + label: "Reproduction URL" + description: Please enter your GitHub URL to provide a reproduction of the issue + placeholder: ex. https://github.com/multiformats/multibase/ + validations: + required: false + - type: textarea + id: context + attributes: + label: "Context" + description: Please provide additional context + placeholder: "Context or external links needed to explain the possible mistake" + validations: + required: false \ No newline at end of file diff --git a/.github/ISSUE_TEMPLATE/NEW-REGISTRATION.yml b/.github/ISSUE_TEMPLATE/NEW-REGISTRATION.yml new file mode 100644 index 0000000..a7d1524 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/NEW-REGISTRATION.yml @@ -0,0 +1,75 @@ +name: "New Registration" +description: Express interest in registering a new encoding +title: "📚 [NEW REGISTRATION] - <title>" +labels: [ + "Registration" +] +body: + - type: input + id: encoding-name + attributes: + label: "Name of encoding" + description: Name this library or system + placeholder: acronyms and abbreviations are fine + validations: + required: false + - type: checkboxes + attributes: + label: "Have read contributing" + description: I have read the [contributing](https://github.com/multiformats/multiformats/blob/master/contributing.md) document + options: + - label: I read it! + validations: + required: true + - type: checkboxes + attributes: + label: "Have checked table" + description: I have reviewed the [multiformats mega-table](https://github.com/multiformats/multicodec/blob/master/table.csv) to assess viable sub-namespace for a registry if applicable + options: + - label: I read it! + - type: checkboxes + attributes: + label: "Willing to open a PR" + description: Once my questions are answered and my plan is confirmed, I will open a PR myself that adds the registration and be its change controller, or close this issue myself if I cannot + options: + - label: I will own this registration + - type: input + id: codepoint + attributes: + label: "Proposed codepoint" + description: Please put here the prefix in the target encoding. By tradition, the highest binary value in the encoding alphabet works well and has a built-in mnemonic if it doesn't conflict with any other entries + placeholder: x + validations: + required: true + - type: input + id: varint-value + attributes: + label: "Proposed varint value for registration in multiformats" + description: Please put here the UTF-8 value that corresponds to that target encoding, for inclusion in the multiformats table, formatted as an [unsigned varint](https://github.com/multiformats/unsigned-varint) + placeholder: See mf/unsigned-varint + validations: + required: true + - type: textarea + id: use-case + attributes: + label: "use-case" + description: Please describe the possible use-cases where this additional codec would be helpful, where this encoding is used currently in the wild, etc. + placeholder: Feel free to provide links for context and use-case descriptions + validations: + required: true + - type: textarea + id: specification + attributes: + label: "Description of relevant prior art and status quo" + description: Please describe relevant prior art and, if already specified in a static public document, the algorithms and configurations needed to deterministically encode/decode + placeholder: Links welcome + validations: + required: true + - type: textarea + id: solution_and_rationale + attributes: + label: "Proposed solution and rationale" + description: Please describe at a high level what you are exploring building and current open research questions. + placeholder: Detail welcome + validations: + required: true diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000..4943f9b --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,5 @@ +blank_issues_enabled: true +contact_links: + - name: Protocol Labs Vulnerability Disclosure Team + url: mailto:security@ipfs.io + about: Please do NOT open issues related to security of implementations or spec here without contacting the IPFS security team first. \ No newline at end of file diff --git a/README.md b/README.md index 64aeecd..f2d9218 100644 --- a/README.md +++ b/README.md @@ -5,39 +5,37 @@ [![](https://img.shields.io/badge/freenode-%23ipfs-blue.svg?style=flat-square)](https://webchat.freenode.net/?channels=%23ipfs) [![](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/RichardLitt/standard-readme) -> Self identifying base encodings +> Self-identifying base encodings -Multibase is a protocol for disambiguating the encoding of base-encoded (e.g., -base32, base36, base64, base58, etc.) binary appearing in text. +Multibase is a protocol for disambiguating the "base encoding" used to express binary data in text formats (e.g., base32, base36, base64, base58, etc.) from the expression alone. -When text is encoded as bytes, we can usually use a one-size-fits-all encoding -(UTF-8) because we're always encoding to the same set of 256 bytes (+/- the NUL -byte). When that doesn't work, usually for historical or performance reasons, we -can usually infer the encoding from the context. +When text is encoded as bytes, we can usually use a one-size-fits-all encoding (UTF-8) because we're always encoding to the same set of 256 bytes (+/- the NUL byte). +When that doesn't work, usually for historical or performance reasons, we can usually infer the encoding from the context. -However, when bytes are encoded as text (using a base encoding), the base choice -of base encoding is often restricted by the context. Worse, these restrictions -can change based on where the data appears in the text. In some cases, we can -only use `[a-z0-9]`. In others, we can use a larger set of characters but need a -compact encoding. This has lead to a large set of "base encodings", one for -every use-case. Unlike when encoding text to bytes, we can't just standardize -around a single base encoding because there is no optimal encoding for all -cases. +However, when bytes are encoded as text (using a base encoding), the choice of base encoding (and alphabet, and other factors) is often restricted by the context. +Worse, these restrictions can change based on where the data appears in the text. +In some cases, we can only use `[a-z0-9]`; in others, we can use a larger set of characters but need a compact encoding. +This has lead to a large set of "base encodings", almost one for every use-case. +Unlike the case of encoding text to bytes, it is impractical to standardize widely around a single base encoding because there is no optimal encoding for all cases. -Unfortunately, it's not always clear *what* base encoding is used; that's where -multibase comes in. It answers the question: +As data travels beyond its context, it becomes quite hard to ascertain *which* base encoding of the many possible ones were used; that's where multibase comes in. +Where the data has been prefixed before leaving its context behind, it answers the question: -> Given data d encoded into text s, what base is it encoded with? +> Given binary data `d` encoded into text `s`, what base `b` was used to encode it? + +To answer this question, a single code point is prepended to `s` at time of encoding, which signals in that new context which `b` can be used to reconstruct `d`. ## Table of Contents - [Format](#format) - [Multibase Table](#multibase-table) +- [Specifications](#specifications) +- [Status](#status) + - [Reserved Terms](#reserved-terms) - [Multibase By Example](#multibase-by-example) - [FAQ](#faq) - [Implementations:](#implementations) - [Disclaimers](#disclaimers) -- [Maintainers](#maintainers) - [Contribute](#contribute) - [License](#license) @@ -46,45 +44,48 @@ multibase comes in. It answers the question: The Format is: ``` -<base-encoding-character><base-encoded-data> +<base-encoding-code-point><base-encoded-data> ``` -Where `<base-encoding-character>` is used according to the multibase table. +Where `<base-encoding-code-point>` is a code representing an entry in the multibase table. ### Multibase Table The current multibase table is [here](multibase.csv): ``` -encoding, code, description, status -identity, 0x00, 8-bit binary (encoder and decoder keeps data unmodified), default -base2, 0, Binary (01010101), candidate -base8, 7, Octal, draft -base10, 9, Decimal, draft -base16, f, Hexadecimal (lowercase), default -base16upper, F, Hexadecimal (uppercase), default -base32hex, v, RFC4648 case-insensitive - no padding - highest char, candidate -base32hexupper, V, RFC4648 case-insensitive - no padding - highest char, candidate -base32hexpad, t, RFC4648 case-insensitive - with padding, candidate -base32hexpadupper, T, RFC4648 case-insensitive - with padding, candidate -base32, b, RFC4648 case-insensitive - no padding, default -base32upper, B, RFC4648 case-insensitive - no padding, default -base32pad, c, RFC4648 case-insensitive - with padding, candidate -base32padupper, C, RFC4648 case-insensitive - with padding, candidate -base32z, h, z-base-32 (used by Tahoe-LAFS), draft -base36, k, Base36 [0-9a-z] case-insensitive - no padding, draft -base36upper, K, Base36 [0-9a-z] case-insensitive - no padding, draft -base58btc, z, Base58 bitcoin, default -base58flickr, Z, Base58 flicker, candidate -base64, m, RFC4648 no padding, default -base64pad, M, RFC4648 with padding - MIME encoding, candidate -base64url, u, RFC4648 no padding, default -base64urlpad, U, RFC4648 with padding, default -proquint, p, Proquint (https://arxiv.org/html/0901.4016), draft -base256emoji, 🚀, Base256 with custom alphabet using variable-sized-codepoints, draft +Unicode, character, encoding, description, status +U+0000, NUL, none, (No base encoding), reserved +U+0030, 0, base2, Binary (01010101), experimental +U+0031, 1, none, (No base encoding) reserved +U+0037, 7, base8, Octal, draft +U+0039, 9, base10, Decimal, draft +U+0066, f, base16, Hexadecimal (lowercase), final +U+0006, F, base16upper, Hexadecimal (uppercase), final +U+0076, v, base32hex, RFC4648 case-insensitive - no padding - highest char, experimental +U+0056, V, base32hexupper, RFC4648 case-insensitive - no padding - highest char, experimental +U+0074, t, base32hexpad, RFC4648 case-insensitive - with padding, experimental +U+0054, T, base32hexpadupper, RFC4648 case-insensitive - with padding, experimental +U+0062, b, base32, RFC4648 case-insensitive - no padding, final +U+0042, B, base32upper, RFC4648 case-insensitive - no padding, final +U+0063, c, base32pad, RFC4648 case-insensitive - with padding, draft +U+0043, C, base32padupper, RFC4648 case-insensitive - with padding, draft +U+0068, h, base32z, z-base-32 (used by Tahoe-LAFS), draft +U+006b, k, base36, Base36 [0-9a-z] case-insensitive - no padding, draft +U+004b, K, base36upper, Base36 [0-9a-z] case-insensitive - no padding, draft +U+007a, z, base58btc, Base58 Bitcoin, final +U+005a, Z, base58flickr, Base58 Flicker, experimental +U+006d, m, base64, RFC4648 no padding, final +U+004d, M, base64pad, RFC4648 with padding - MIME encoding, experimental +U+0075, u, base64url, RFC4648 no padding, final +U+0055, U, base64urlpad, RFC4648 with padding, final +U+0070, p, proquint, Proquint (https://arxiv.org/html/0901.4016), experimental +U+002F, Q, none, (no base encoding) reserved +U+002F, /, none, (no base encoding) reserved +U+1F680, 🚀, base256emoji, base256 with custom alphabet using variable-sized-codepoints, experimental ``` -**NOTE:** Multibase-prefixes are encoding agnostic. "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). For example, in UTF-32, "z" would be `[0x7a, 0x00, 0x00, 0x00]`. Also note the difference between `0x00` (codepoint 0 or 0x00) and `0` (codepoint 48 or 0x30). +**NOTE:** Multibase-prefixes are encoding agnostic. "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). In UTF-32, for example, that same "z" would be `[0x7a, 0x00, 0x00, 0x00]` not `[0x7a]`, so detecting and dropping an initial byte of `0x7a` would not suffice to confirm the rest was `base58btc`-encoded bytes; `[0x7a, 0x00, 0x00, 0x00]` would instead be the UTF-32 bytes that correspond to the `z` codepoint for that entry, and the entire byte array would need to be detected and dropped. Also note the difference between `0x00` (codepoint 0 or 0x00) and `0` (codepoint 48 or 0x30). ## Specifications @@ -102,24 +103,26 @@ Below is a list of specs for the underlying base encodings: - `base58flickr` https://datatracker.ietf.org/doc/html/draft-msporny-base58-02, but using a different alphabet - `proquint` [Proquint RFC](rfcs/Proquint.md), which is the [original spec](https://arxiv.org/html/0901.4016) with an added prefix for legibility -## Reserved - -The following codes are _reserved_ for (backwards) compatibility with existing systems. -* `/` - Separator used by [multiaddr](https://github.com/multiformats/multiaddr). -* `1` - Base58 encoded identity multihashes used by libp2p peer IDs. -* `Q` - Base58 encoded sha2-256 multihashes used by libp2p/ipfs for peer IDs and CIDv0. - -If you'd like to switch a project over to multibase and would also like to -reserve a prefix for compatibility, please file an issue. ## Status Each multibase encoding has a status: -* draft - these encodings have been proposed but are not widely implemented and may be removed. -* candidate - these encodings are mature and widely implemented but may not be implemented by all implementations. -* default - these encodings should be implemented by all implementations and are widely used. +* reserved - for functional reasons or to avoid collisions with other multi-* registries, this registry cannot accept registrations at this code-point and implementing one unregistered is discouraged for interoperability reasons +* experimental - these encodings have been proposed but are not widely implemented and may be removed. +* draft - these encodings are mature and widely implemented but may not be implemented by all implementations. +* final - these encodings should be implemented by all implementations and are widely used. +* deprecated - this entry will likely be removed and reassigned in the future and it will not likely become a `final` registration + +### Reserved Terms + +The following codes are _reserved_ and cannot be registered in the `multibase` table. Note that all three of the Unicode entries, expressed as the [unsigned varint] expression of that Unicode code-point in UTF-8, correspond to widely-used entries in the [multiformats registry group] that could create confusions for some legacy systems handling both binary and multibased structures from other multiformats. While technically the multibase registry is not part of the [multiformats registry group], these reservations minimize risk of confusion when composing multiple multiformats in one data system. + +* `NUL` (n/a) - Legacy data may be found with null-byte-prefixed binary structures mixed in among multibase-encoded ones in arrays of data, although support for this is no longer mandated by conformant implementations. +* `/` (U+002F) - Separator used by [multiaddr]. +* `1` (U+0031) - Base58-encoded identity multihashes used by libp2p peer IDs. +* `Q` (U+0011) - Base58-encoded sha2-256 multihashes used by libp2p/ipfs for peer IDs and CIDv0. ## Multibase By Example @@ -157,11 +160,15 @@ Yes. If i give you `"1214314321432165"` is that decimal? or hex? or something el > Why the strange selection of codes / characters? -The code values are selected such that they are included in the alphabets of the base they represent. For example, `f` is the base code for `base16 (hex)`, because `f` is in hex's 16 character alphabet. Note that the alphabets can be encoded in UTF8, and most can be encoded in ASCII. We have not found a case needing something else. +The code values are selected such that they are included in the alphabets of the base they represent. +For example, `f` is the base code for `base16 (hex)`, because `f` is in hex's 16 character alphabet. +Note that most of the alphabets used can be encoded in UTF-8, and most but not all can be encoded in ASCII. +We have yet not found a case needing something else. > Don't we have to agree on a table of base encodings? -Yes, but we already have to agree on base encodings, so this is not hard. The table even leaves some room for custom encodings. +Yes, but we already have to agree on base encodings, so this is not hard. +The table even leaves some room for custom encodings and is intended to work both in contexts where the encodings are known or agreed on and open-world or brownfield contexts where these may vary. ## Implementations: @@ -188,16 +195,26 @@ Yes, but we already have to agree on base encodings, so this is not hard. The ta ## Disclaimers -Warning: **obviously multibase changes the first character depending on the encoding**. Do not expect the value to be exactly the same. Remove the multibase prefix before using the value. +Warning: **obviously multibase changes the first character depending on the encoding**. +Do not expect the value to be exactly the same. +Remove the multibase prefix before using the value. ## Contribute -Contributions welcome. Please check out [the issues](https://github.com/multiformats/multibase/issues). +Contributions welcome. +Please check out [the issues](https://github.com/multiformats/multibase/issues) and reading the [contributing document](https://github.com/multiformats/multiformats/blob/master/contributing.md) for the greater multiformats project before opening your first issue, as the workflow and the relation of multibase to the greater project both benefit from this context. +more information on how we work, and about contributing in general. -Check out our [contributing document](https://github.com/multiformats/multiformats/blob/master/contributing.md) for more information on how we work, and about contributing in general. Please be aware that all interactions related to multiformats are subject to the IPFS [Code of Conduct](https://github.com/ipfs/community/blob/master/code-of-conduct.md). - -Small note: If editing the README, please conform to the [standard-readme](https://github.com/RichardLitt/standard-readme) specification. +If you'd like to switch a project over to multibase, whether by creating a new multibase implementation or building on one of those listed above, please file an issue in this repository using the "Interested in implementing" issue template. +If would also like to reserve a prefix for compatibility, please file a separate issue in this repository using the "New Registration" issue template. ## License -This repository is only for documents. All of these are licensed under the [CC-BY-SA 3.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license © 2016 Protocol Labs Inc. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc. +This repository is only for documents. +All of these are licensed under the [CC-BY-SA 3.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license © 2016 Protocol Labs Inc. +Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc. + +[multiaddr]: https://github.com/multiformats/multiaddr +[multiformats registry group]: https://github.com/multiformats/multicodec/blob/master/table.csv +[unsigned varint]: https://github.com/multiformats/unsigned-varint +[code point]: https://infra.spec.whatwg.org/#code-points \ No newline at end of file diff --git a/multibase.csv b/multibase.csv index 8ba3b06..2472ed9 100644 --- a/multibase.csv +++ b/multibase.csv @@ -1,26 +1,29 @@ -encoding, code, description, status -identity, 0x00, 8-bit binary (encoder and decoder keeps data unmodified), default -base2, 0, Binary (01010101), candidate -base8, 7, Octal, draft -base10, 9, Decimal, draft -base16, f, Hexadecimal (lowercase), default -base16upper, F, Hexadecimal (uppercase), default -base32hex, v, RFC4648 case-insensitive - no padding - highest char, candidate -base32hexupper, V, RFC4648 case-insensitive - no padding - highest char, candidate -base32hexpad, t, RFC4648 case-insensitive - with padding, candidate -base32hexpadupper, T, RFC4648 case-insensitive - with padding, candidate -base32, b, RFC4648 case-insensitive - no padding, default -base32upper, B, RFC4648 case-insensitive - no padding, default -base32pad, c, RFC4648 case-insensitive - with padding, candidate -base32padupper, C, RFC4648 case-insensitive - with padding, candidate -base32z, h, z-base-32 (used by Tahoe-LAFS), draft -base36, k, Base36 [0-9a-z] case-insensitive - no padding, draft -base36upper, K, Base36 [0-9a-z] case-insensitive - no padding, draft -base58btc, z, Base58 bitcoin, default -base58flickr, Z, Base58 flicker, candidate -base64, m, RFC4648 no padding, default -base64pad, M, RFC4648 with padding - MIME encoding, candidate -base64url, u, RFC4648 no padding, default -base64urlpad, U, RFC4648 with padding, default -proquint, p, Proquint (https://arxiv.org/html/0901.4016), draft -base256emoji, 🚀, Base256 with custom alphabet using variable-sized-codepoints, draft \ No newline at end of file +Unicode, character, encoding, description, status +U+0000, NUL, none, (No base encoding), reserved +U+0030, 0, base2, Binary (01010101), experimental +U+0031, 1, none, (No base encoding) reserved +U+0037, 7, base8, Octal, draft +U+0039, 9, base10, Decimal, draft +U+0066, f, base16, Hexadecimal (lowercase), final +U+0006, F, base16upper, Hexadecimal (uppercase), final +U+0076, v, base32hex, RFC4648 case-insensitive - no padding - highest char, experimental +U+0056, V, base32hexupper, RFC4648 case-insensitive - no padding - highest char, experimental +U+0074, t, base32hexpad, RFC4648 case-insensitive - with padding, experimental +U+0054, T, base32hexpadupper, RFC4648 case-insensitive - with padding, experimental +U+0062, b, base32, RFC4648 case-insensitive - no padding, final +U+0042, B, base32upper, RFC4648 case-insensitive - no padding, final +U+0063, c, base32pad, RFC4648 case-insensitive - with padding, draft +U+0043, C, base32padupper, RFC4648 case-insensitive - with padding, draft +U+0068, h, base32z, z-base-32 (used by Tahoe-LAFS), draft +U+006b, k, base36, Base36 [0-9a-z] case-insensitive - no padding, draft +U+004b, K, base36upper, Base36 [0-9a-z] case-insensitive - no padding, draft +U+007a, z, base58btc, Base58 Bitcoin, final +U+005a, Z, base58flickr, Base58 Flicker, experimental +U+006d, m, base64, RFC4648 no padding, final +U+004d, M, base64pad, RFC4648 with padding - MIME encoding, experimental +U+0075, u, base64url, RFC4648 no padding, final +U+0055, U, base64urlpad, RFC4648 with padding, final +U+0070, p, proquint, Proquint (https://arxiv.org/html/0901.4016), experimental +U+002F, Q, none, (no base encoding) reserved +U+002F, /, none, (no base encoding) reserved +U+1F680, 🚀, base256emoji, base256 with custom alphabet using variable-sized-codepoints, experimental