Skip to content
This repository has been archived by the owner on Jun 2, 2022. It is now read-only.

wip: migrate to latest multiformats #16

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

wip: migrate to latest multiformats #16

wants to merge 7 commits into from

Conversation

mikeal
Copy link
Contributor

@mikeal mikeal commented Oct 1, 2020

Still need to port some codecs and fill in some coverage, but all the tests are passing against dag-cbor.

This gets the Block interface up to date with @Gozala’s latest refactor. I’d like to take a moment to note how good the
work has been on that refactor. You can see it here in the diff, everywhere that we touch multiformats it’s a little bit
cleaner and also a little more explicit and even adds some features. All of this while getting rid of a lot of the harder to work
with parts of this API that came from the prior multiformats dependency injection, and we still get to keep all the same features and codec registry just hoisted to the Block interface. This was a some really great work on @Gozala’s part and it really shows.

@rvagg
Copy link
Member

rvagg commented Oct 1, 2020

It's so hard to review this when most of it is just indentation changes! It seems that the major changes are really just in hoisting the Block.codecs registry up here, with add() and friends, but otherwise it's the same. Correct?

@Gozala
Copy link

Gozala commented Oct 1, 2020

It's so hard to review this when most of it is just indentation changes!

@rvagg There is this really handy thing in case you were not aware

image

Copy link

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikeal for updating this and other libraries to changes in multiformats. I'm hesitant to express strong opinions here. I personally think that compositional approach illustrated in dag.js from multiformats/js-multiformats#38 is a better alternative to this, but I am obviously biased.

For better clarity in review communications I use following iconography in my reviews

  • 🚨 Problem that needs to be addressed (please change)
  • 💣 Could be a problem (please use your own judgment to decide)
  • 💭 Just a though or on opinion (take it or leave it)
  • 📝 Note (do what you want with it)
  • ❓Question (likely a signal that code comment would be good there)

index.js Outdated
const b = coerce(value)
return coerce(b.buffer.slice(b.byteOffset, b.byteOffset + b.byteLength))
}
const reader = createReader(CID)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓Any reason to pass CID in vs just let createReader import it instead ? If there is a the reason it would be good to have a comment explaining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, i was just trying to get it working again but there’s no need for the dep injection anymore.

index.js Outdated
get codec () {
if (this.opts.code) {
this.opts.codec = multicodec.get(this.opts.code).name
get hasher () {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I think it would make things a lot cleaner if there were dedicated Block so that when you create it with CID an appropriate hasher would be pulled and set as a property etc... That way if invariants don't hold (nor cid nor codec was provided, or cid is provided and hasher is not available) error is thrown at construction rather than later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there’s a bigger refactor that I’m going to want to do to the construct and class methods that i haven’t done yet. I wanted to first see the existing implementation ported over without much API changes, but I’m still debating with myself whether or not i’m going to tackle the bigger constructor refactor before i get this merged or just do it now so that I can combine the breaking changes to all the APIs.

index.js Outdated
this.opts.codec = multicodec.get(this.opts.cid.code).name
}
return this.opts.codec
if (!this.opts.hasher) throw new Error('Do not have hash implementation')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💣 I think throwing error from property accessory is a bad pattern, APIs that do that tend to be really painful to work with (e.g. some DOM bindings that do it). I would suggest return null here and and do the throw from the method that needs to use hasher instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is a bit of a hack that I plan on factoring out.

index.js Outdated
if (this.opts.cid) return this.opts.cid.code
if (!this.opts.code) {
this.opts.code = multicodec.get(this.codec).code
get codec () {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓Why not return codec from here just like hasher returns hasher ? User can lookup .name if that's what the y need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t want to take that breaking change yet because it’s going to cause a lot of churn in the tests. I also hate to change the value of existing properties rather than just migrating to new properties because the errors during migration tend to be much more painful, but maybe it’s worth it here since codec is really the only good name for this.

index.js Outdated
const data = this.encodeUnsafe()
const hash = await this.hasher.digest(data)
if (bytes.equals(cid.multihash.bytes, hash.bytes)) return true
throw new Error('Bytes do not match')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I think this another case where dedicated constructors would make more sense. E.g. you could have a Block.createUnsafe(cid, bytes) and Block.createSafe(cid, bytes) where later delegates to former after validation. That would also get rid of this in cases where you know no validation is needed like when you create block e.g from JS value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on moving to dedicated constructors but hadn’t identified these as separate methods, but I like it. Although I think I’m just going to go with create() and createUnsafe() to match the other unsafe API forms.

index.js Outdated
async equals (block) {
if (block === this) return true
const cid = await this.cid()
if (block.asCID === block) return cid.equals(block)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I find the choice to return true when comparing block to cid questionable, but it was the already so 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be cleaned up by just renaming the variable. The API is meant to take either a CID or a Block, which you may not like, but that’s the intention and the fact that the variable is named block makes this very confusing.

@mikeal
Copy link
Contributor Author

mikeal commented Oct 1, 2020

Thanks @mikeal for updating this and other libraries to changes in multiformats. I'm hesitant to express strong opinions here. I personally think that compositional approach illustrated in dag.js from multiformats/js-multiformats#38 is a better alternative to this, but I am obviously biased.

I don’t want to diverge too far from the current approach in terms of presenting a class with along with some methods to help you create instances, along with a registry of codecs. It’s not that I’m adamant that this is the right approach for every case, but I do think that this approach works quite well in some cases and I want to make sure the libraries underneath Block (codecs and multiformats) support this approach.

However, I’d prefer to have this library presenting and producing block instances that conform to a lower level block interface defined by multiformats. And that interface should be designed to support this approach as well as the approach you’re exploring in dag.js. That obviously isn’t what is here yet, but I wanted to see what this looks like on top of the latest multiformats types before I tried to bring design considerations back into a muliformats/block interface.

Copy link

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeal per your request I have provided some feedback and made several suggestions. I do still however feel that this Block class tries to combine following:

  1. Encode JS value
  2. Decode block from bytes
  3. Decode block from CID + bytes

I understand that this allows recipient of the Block not to care which from the above 3 scenarios took place (which is great), but unfortunately it is forced upon, meaning recipient has no way of choosing alternative code paths optimized for specific scenario because there is no way to distinguish between them (without probing internal properties). I think it is possible to both:

  1. Free recipient from having to know which scenario took place (by providing common interface)
  2. Allow scenario specific optimizations when recipient cares.

That is why implementation I have been proposing attempted to separate those use case from each other while putting those under the same Block base class.

Another thing that I find to be questionable is that laziness is weaved into all this. Don't get me wrong lazy data structures are great, but unpredictable performance characteristics aren't (as many haskellers would testify). So I would highly recommend further separating concerns there by having:

  1. Block as in materialized instance that has both data and bytes, so it no longer needs codec (but may still need a hasher).
  2. BlockDecoder that is a lazy Block that performs decode on demand. It could even subclass Block and override toValue() method (or something along those lines).
  3. BlockEncoder that is a lazy Block that performs encode on demand. It could also can subclass Block and override toBytes() method (or something along those lines).

That would allow consumer not to care when they don't need to, but would enable appropriate treatment when that makes sense. Furthermore this creates an opportunity to do non-lazy decode / encode where that is more appropriate.

if (!codec) throw new Error('Missing codec')
return new Block({ data, codec, hasher })
}
Block.createUnsafe = (data, cid, { hasher, codec } = {}) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think createUnsafeDecoder is more descriptive name, and also makes it clear why codec needs to be provided. I think it should not take hasher because it does not need it.

Suggested change
Block.createUnsafe = (data, cid, { hasher, codec } = {}) => {
Block.createUnsafeDecoder = (data, cid, { codec } = {}) => {

if (!codec) throw new Error(`Missing codec ${cid.code}`)
return new Block({ data, cid, codec, hasher: hasher || null })
}
Block.create = async (data, cid, { hasher, codec } = {}) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling this decode would be more descriptive and explain why it's async.

Suggested change
Block.create = async (data, cid, { hasher, codec } = {}) => {
Block.decode = async (data, cid, { hasher, codec } = {}) => {

BlockWithIs.multiformats = multiformats
BlockWithIs.CID = CID
return BlockWithIs
return Block.createUnsafe(data, cid, { hasher, codec })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this this is a reason why createUnsafe to takes hasher. Just using constructor here can simplify createUnsafe.

Suggested change
return Block.createUnsafe(data, cid, { hasher, codec })
return new Block({ cid, data, codec, hasher })

index.js Outdated
Block.createUnsafe = (data, cid, { hasher, codec } = {}) => {
codec = codec || Block.codecs.get(cid.code)
if (!codec) throw new Error(`Missing codec ${cid.code}`)
return new Block({ data, cid, codec, hasher: hasher || null })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires change above.

Suggested change
return new Block({ data, cid, codec, hasher: hasher || null })
return new Block({ data, cid, codec })

@Gozala
Copy link

Gozala commented Oct 5, 2020

As broader question, I’m curios what are the use cases for lazy block encode and lazy block decode. Are there cases where you'd want to create a Block but never materialize it (not talking about CID, but turning bytes to JS value or other way round) ? I am inclined to think that if you do never end up materializing a block you're probably choosing a wrong interface for your library or a program. If you do end up materializing it later on in the program than all you achieve by deferring is making performance characteristics less deterministic, because bunch of computations can (and usually will) accumulate and then will be forced in one large batch that tends to create spikes on load.

That is just to say that it's best to make conscious choice when deferring computations. It also makes reasoning about code (as in when reading it) a lot easier (less context is required to infer it's behavior)

mikeal and others added 4 commits October 5, 2020 14:48
Co-authored-by: Irakli Gozalishvili <[email protected]>
Co-authored-by: Irakli Gozalishvili <[email protected]>
Co-authored-by: Irakli Gozalishvili <[email protected]>
Co-authored-by: Irakli Gozalishvili <[email protected]>
@mikeal
Copy link
Contributor Author

mikeal commented Oct 5, 2020

Are there cases where you'd want to create a Block but never materialize it (not talking about CID, but turning bytes to JS value or other way round)

Creating blocks you never encode is surprisingly common in multi-block data structures during mutation. You often have interfaces that create a bunch of blocks on mutation, but many of those are then orphaned by another mutation operation before they are ever written to the block store. Ideally you could factor this out using bulk operations but we don’t have reasonable bulk operations for some data structures (like our HAMT).

Creating blocks you never decode is very common because that’s what happens during a lot of replication operations. Whenever you’re moving data from one store to another you can avoid a decode.

@Gozala
Copy link

Gozala commented Oct 5, 2020

Creating blocks you never decode is very common because that’s what happens during a lot of replication operations. Whenever you’re moving data from one store to another you can avoid a decode.

I wonder if this indicative of the other gap multiformats/js-multiformats#37

It is also something I tried to represent via BlockView type.

@Gozala
Copy link

Gozala commented Oct 5, 2020

Creating blocks you never encode is surprisingly common in multi-block data structures during mutation. You often have interfaces that create a bunch of blocks on mutation, but many of those are then orphaned by another mutation operation before they are ever written to the block store. Ideally you could factor this out using bulk operations but we don’t have reasonable bulk operations for some data structures (like our HAMT).

This sounds similar to what I tried to represent via BlockDraft interface. I think you could have layer for building up dags and materializing or whiting them in one operation, but that does not necessarily mean that what you build up with should be a block. In fact the reason my BlockDraft just holds value codec code and hasher code is so that you could build things up on one thread and materialize them on the other one (where you do have codecs and hashers available). If you choose to represent that via Blocks as well doing this on different threads (or processes) becomes a lot more trickier.

@Gozala
Copy link

Gozala commented Oct 5, 2020

Please note that current approach complicates things for both cases:

  1. Actor that does replication becomes concerned with codecs and hashers, which is incidental, it could just replicate CID+Block|Code+Block pairs, it's just that does not have a canonical representation, which then Block attempts to fill.

  2. Actor that is building up dag structure, does not really need to be concerned with codecs or hashers. It just needs to create compact recipe so that another actor could materialize the dag.

Note: That I fully understand that when building up dags you may have some pre-existing (already materialized) blocks interleaved with new (not yet materialized) blocks. But argument I'm making is that actor doing the dag building needs to be able to be able to distinguish so it can choose the best strategy to transfer the recipe in the most effective way to another actor (which may be on the separate thread, process or even machine). In fact you'd likely even want some way to represent remote blocks so that they don't need to roundtrip and having to do all this through lazy Block representation isn't necessarily helpful.

@mikeal
Copy link
Contributor Author

mikeal commented Oct 6, 2020

@Gozala latest push splits the BlockEncoder and BlockDecoder classes out.

class BlockEncoder extends Block {
encodeUnsafe () {
if (this._data) return this._data
if (!this._codec) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is obsolete because it's validated at the construction site.

class BlockDecoder extends Block {
decodeUnsafe () {
if (typeof this.opts._source !== 'undefined') return this._source
if (!this._codec) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is obsolete because validation occurs at construction site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out, there is a case in which you want to create a Block without a codec attached. i actually have to drop the constructor error and move that check to the encoder/decoder methods, but it’ll still be necessary here so that you get a good error if you try to decode a block that doesn’t have a codec attached

if (cid) setImmutable(this, '_cid', cid)
if (data) setImmutable(this, '_data', data)
if (!source && (!data || !cid)) throw new Error('Missing required argument')
if (source && (!codec || !hasher)) throw new Error('Missing required argument')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things would be a lot cleaner if Block.encoder and Block.decoder would do above invariant checks instead of deferring that to shared constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, i’m going to move all of these in the new patch to js-multiformats.

@mikeal
Copy link
Contributor Author

mikeal commented Oct 6, 2020

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants