multiformats · sg495 · Oct 20, 2021 · Oct 20, 2021 · Oct 20, 2021 · Oct 20, 2021
diff --git a/README.md b/README.md
@@ -59,7 +59,7 @@ The current multibase table is [here](multibase.csv):
 encoding,          code, description,                                              status
 identity,          0x00, 8-bit binary (encoder and decoder keeps data unmodified), default
 base2,             0,    binary (01010101),                                        candidate
-base8,             7,    octal,                                                    draft
+base8,             7,    octal (see RFC),                                          draft
 base10,            9,    decimal,                                                  draft
 base16,            f,    hexadecimal,                                              default
 base16upper,       F,    hexadecimal,                                              default
@@ -80,10 +80,27 @@ base64,            m,    rfc4648 no padding,
 base64pad,         M,    rfc4648 with padding - MIME encoding,                     candidate
 base64url,         u,    rfc4648 no padding,                                       default
 base64urlpad,      U,    rfc4648 with padding,                                     default
-proquint,          p,    PRO-QUINT https://arxiv.org/html/0901.4016,               draft
+proquint,          p,    pro-quint https://arxiv.org/html/0901.4016 (see RFC),     draft
 ```
 
-**NOTE:** Multibase-prefixes are encoding agnostic. "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). For example, in UTF-32, "z" would be `[0x7a, 0x00, 0x00, 0x00]`.
+**NOTE:** Multibase-prefixes are encoding agnostic: "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). For example, in UTF-32, "z" would be `[0x7a, 0x00, 0x00, 0x00]`. In particular, the multibase code 0x00 listed for the identity encoding is the non-printable ASCII/UTF-8 character with codepoint 0x00, while the multibase code 0 listed for base2 is the ASCII/UTF-8 character "0" (which has codepoint 0x30).
+
+## Specifications
+
+Below is a list of specs for the underlying base encodings:
+
+- `identity` [identity RFC](rfcs/identity.md)
+- `base2` [base2 RFC](rfcs/Base2.md)
+- `base8` [base8 RFC](rfcs/Base8.md), similar to [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html)
+- `base10` [base10 RFC](rfcs/Base10.md)
+- `base36` [base36 RFC](rfcs/Base36.md)
+- `base16*` [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html)
+- `base32*` (except for `base32z`) [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html)
+- `base32z` [human-oriented base32 spec](https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt)
+- `base64*` [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html)
+- `base58btc` https://datatracker.ietf.org/doc/html/draft-msporny-base58-02
+- `base58flickr` https://datatracker.ietf.org/doc/html/draft-msporny-base58-02, but using alphabet `123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ`
+- `proquint` [proquint RFC](rfcs/PRO-QUINT.md), which is the [original spec](https://arxiv.org/html/0901.4016) with an added prefix for legibility
 
 ## Reserved
 
@@ -160,6 +177,7 @@ Yes, but we already have to agree on base encodings, so this is not hard. The ta
 - [scala-multibase](//github.com/fluency03/scala-multibase)
 - [cpp-multibase](//github.com/cpp-ipfs/cpp-multibase)
 - [ruby-multibase](//github.com/sleeplessbyte/ruby-multibase)
+- `multibase` sub-module of Python module [multiformats](//github.com/hashberg-io/multiformats)
 - [Add yours here!](//github.com/multiformats/multibase/edit/master/README.md)
 
 

diff --git a/multibase.csv b/multibase.csv
@@ -1,7 +1,7 @@
 encoding,          code, description,                                              status
 identity,          0x00, 8-bit binary (encoder and decoder keeps data unmodified), default
 base2,             0,    binary (01010101),                                        candidate
-base8,             7,    octal,                                                    draft
+base8,             7,    octal (see RFC),                                          draft
 base10,            9,    decimal,                                                  draft
 base16,            f,    hexadecimal,                                              default
 base16upper,       F,    hexadecimal,                                              default
@@ -22,4 +22,4 @@ base64,            m,    rfc4648 no padding,
 base64pad,         M,    rfc4648 with padding - MIME encoding,                     candidate
 base64url,         u,    rfc4648 no padding,                                       default
 base64urlpad,      U,    rfc4648 with padding,                                     default
-proquint,          p,    PRO-QUINT https://arxiv.org/html/0901.4016,               draft
+proquint,          p,    pro-quint https://arxiv.org/html/0901.4016 (see RFC),     draft
diff --git a/rfcs/Base2.md b/rfcs/Base2.md
@@ -16,7 +16,7 @@ order, where each byte of the array is set to the character `1`, if the
 corresponding bit in the byte is set, and the character `0` if the corresponding
 bit is unset.
 
-For example, `[0x58, 0x59, 0x60]` can be converted to multibase base2 as
+For example, `[0x58, 0x59, 0x5a]` can be converted to multibase base2 as
 follows:
 
 ```

diff --git a/rfcs/PRO-QUINT.md b/rfcs/PRO-QUINT.md
@@ -1,7 +1,16 @@
 # PRO-QUINT
 
-See: https://arxiv.org/html/0901.4016 ([/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa](https://dweb.link/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa)).
+For the original proquint specification, see: https://arxiv.org/html/0901.4016 ([/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa](https://dweb.link/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa)).
 
-While the multibase prefix is `p`, the "full" prefix is actually `pro-`. This way, proquints are always easily pronouncable. For example
+The multibase prefix for proquints is the character `p`. The base encoded data is the encoded data according to the original specification, with an additional `ro-` prefix:
 
-`127.0.0.1`, as a multibase proquint encoded number, is `pro-lusab-babad`.
+```
+<multibase-prefix-character><additional-prefix-characters><proquint-encoded-data>
+```
+
+The resulting full prefix for the actual proquint encoded data is `pro-`, making multibase-encoded proquints easily pronouncable.
+For example, the proquint encoding of the bytestring `[127, 0, 0, 1]` (the data for the IPv4 address `127.0.0.1`) is `lusab-babad`, so the corresponding multibase-encoded proquint bytestring is:
+
+```
+pro-lusab-babad
+```
diff --git a/rfcs/identity.md b/rfcs/identity.md
@@ -0,0 +1,41 @@
+# Identity
+
+The multibase identity prefix is the character non-printable ASCII/UTF-8 character with codepoint 0x00. Note that this is different from the multibase prefix 0 listed for base2, which is the ASCII/UTF-8 character "0" with codepoint 0x30.
+
+
+## Encoding
+
+A byte array `b` is encoded by converting it to the Unicode string `s` having as its UTF-8 bytes the byte array `b` prefixed with a single zero byte.
+
+Below is a minimal implementation in Python, for clarification:
+
+```py
+def encode_identity(b: bytes) -> str:
+    utf8_bytes = b"\x00"+b
+    return utf8_bytes.decode("utf-8")
+```
+
+## Decoding
+
+A Unicode string `s` is decoded by obtaining its UTF-8 bytes and dropping the leading byte. The UTF-8 byte array must be non-empty and the leading byte must be zero.
+
+Below is a minimal implementation in Python, for clarification:
+
+```py
+def decode_identity(s: str) -> bytes:
+    utf8_bytes = s.encode("utf-8")
+    if not utf8_bytes or utf8_bytes[0] != 0:
+        raise ValueError("String not identity-encoded.")
+    return utf8_bytes[1:]
+```
+
+## Examples
+
+```py
+>>> encode_identity(bytes([0x31, 0x63, 0x57]))
+'\x001cW'
+>>> decode_identity("\x001cW")
+b'1cW'
+>>> list(decode_identity("\x001cW"))
+[49, 99, 87] # [0x31, 0x63, 0x57]
+```
diff --git a/tests/case_insensitivity.csv b/tests/case_insensitivity.csv
@@ -1,4 +1,4 @@
-non-canonical encoding, "hello world"
+non-canonical encoding, "yes mani !"
 base16, "f68656c6c6f20776F726C64"
 base16upper, "F68656c6c6f20776F726C64"
 base32, "bnbswy3dpeB3W64TMMQ"