Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

feat: implement dag import/export #3728

Merged
merged 20 commits into from
Jul 27, 2021
Merged

feat: implement dag import/export #3728

merged 20 commits into from
Jul 27, 2021

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jun 28, 2021

New version of #2953
Fixes: #2745

WIP, just export implemented to far, no tests yet.

It has to lean on the new multiformats and I've pulled in the new IPLD codecs too in anticipation of #3556 eventually sorting out the slight dependency bloat introduced by this (i.e. we have both versions of the codecs in this branch for ipfs-cli).

@rvagg rvagg changed the title feat: implement ipfs dag export <root> feat: implement dag import/export Jun 28, 2021
@rvagg
Copy link
Member Author

rvagg commented Jun 28, 2021

added ipfs dag import [path...] functionality too

still no tests yet, though it all works according to basic manual testing.

@rvagg
Copy link
Member Author

rvagg commented Jun 29, 2021

@achingbrain can I get some advice on this please? I tried to mirror the go-ipfs approach export and import are "commands" rather than "core" API features, so there's more logic in here that it doesn't get to defer to "core". But I think go-ipfs uses that command infrastructure for both CLI and server. But it looks like we have two separate systems for CLI and server and everything gets implemented in 3 places - cli & http-server, and http-client, and a lot of it defers to "core", so 4 or 5 places, with grpc probably doing another one?

Maybe I should be implementing this somewhere else and deferring the input and output streams somehow? import accepts either stdin or 1 or more files. export streams to stdout (but I wouldn't mind adding an --output which go-ipfs should have got). We probably need these things wired up through to server and even http client, but that leaves the question of where this functionality needs to live and how many places it needs to be wired up to and I'm not familiar enough with the intended architecture here.

I have 8 test fixtures totalling 2.9M (1.1M if I compress them) for this feature, but I'm noticing that ipfs-cli relies almost exclusively on mocks to do its testing, which is probably not ideal for these. So I'll need suggestions on where all that should go and examples for how to set things up for repetitive runs against reset block stores.

@achingbrain
Copy link
Member

@rvagg typically you'd add functionality to ipfs-core and expose it in ipfs-cli and ipfs-http-server. I think we'd want to expose this over the HTTP API same as go-ipfs so we can test the http client against it, so it would need to go into core instead of being in the cli only.

The testing strategy - what sort of tests go where and how different components are tested - is documented here: https://github.com/ipfs/js-ipfs/blob/master/docs/DEVELOPMENT.md#testing-strategy

I'm not familiar enough with the intended architecture here

There's a small world of documentation here: https://github.com/ipfs/js-ipfs/tree/master/docs

I think go-ipfs uses that command infrastructure for both CLI and server

Yes, go-ipfs generates it's CLI and HTTP API from the same code whereas for better or for worse, js-ipfs has a more manual process.

I have 8 test fixtures totalling 2.9M (1.1M if I compress them) for this feature

Ouch, can these be generated on the fly by the tests?

@rvagg
Copy link
Member Author

rvagg commented Jun 30, 2021

@achingbrain the other dilemma I have here is the multiformats & ipld codec stack. @ipld/car uses multiformats and @ipld/dag-cbor itself already. I also need to do storage of raw bytes of blocks, plus traverse their decoded forms to gather the full DAG to export. The tools to get { cid, bytes, links } (as per this PR as it is now) from a block is easier with the new stack than it is with ipld which wants to turn a cid into a node, which is great for getting links but bytes gets discarded along the way so you have to go fishing for them elsewhere (or reimplement bits of ipld to do it).

So the choice is something like this:

  1. Code against the existing ipld stack but suffer only the multiformats and @ipld/dag-cbor bundle expansion costs and have overly complicated code that will be awkward to undo if/when we get a full multiformats upgrade here.
  2. Code against the new ipld stack and pull in the full set of codecs and suffer the bundle expansion costs of all of that (not massive, but not trivial), have slightly more straightforward code and be more future-proof if/when we get a full multiformats upgrade.

That might depend on timing for said upgrade so I'd appreciate your input on this.

Re test fixtures - I was wanting to import the ones that go-ipfs is using as they are, so we ensure both that the feature works properly and we have some level of parity with go-ipfs for it.

@rvagg
Copy link
Member Author

rvagg commented Jul 2, 2021

Status: dag export and dag import are implemented in core, cli, http-client and http-server. Minimal documentation and no tests. I've tested export manually in all of the places and it works like a charm. I've only tested import on the CLI, it's a bit advanced to set up a test with HTTP and I have run out of time this week.

Still needed: unit tests in cli, http-{client,server}, and some integration tests to match go-ipfs, plus it needs some additional docs. I'd really like to get t0054 from sharness working here via JS, it's 2.9M (1.1M compressed) of fixtures, but it's got really nice CAR coverage for various edge cases.

@achingbrain
Copy link
Member

@rvagg where does the BlockCount property come from? It doesn't appear to be in the output of the dag.import http command?

@achingbrain
Copy link
Member

Ah, I see - it's in an un-merged PR to go-ipfs: ipfs/kubo#8237

@rvagg
Copy link
Member Author

rvagg commented Jul 21, 2021

👌 looking really good, thanks @achingbrain
I really would like to see some test fixtures in here, though, it'd be good to figure out the best way to pull down large fixtures if it's going to be a problem for them to live in the repo. The ones @ribasushi put together for go-ipfs are pretty nice and cover the main edges of functionality for this feature.

@achingbrain
Copy link
Member

achingbrain commented Jul 26, 2021

Anecdotally for the fixture data the js implementation is 2-3.5x faster than go (ignoring test setup & teardown times):

$ npm run test:interface:http-go -- -t node

> [email protected] test:interface:http-go /Users/alex/Documents/Workspaces/ipfs/js-ipfs/packages/ipfs
> aegir test -f test/interface-http-go.js "-t" "node"

Test Node.js


  interface-ipfs-core over ipfs-http-client tests against go-ipfs
    .dag.export
      ✔ should export a car file (265ms)
      ✔ export of shuffled devnet export identical to canonical original (4412ms)
      ✔ export of shuffled testnet export identical to canonical original (32176ms)
    .dag.import
      ✔ should import a car file (488ms)
      ✔ should import a car file without pinning the roots (254ms)
      ✔ should import multiple car files (671ms)
      ✔ should import car with roots but no blocks (33776ms)
      ✔ should import lotus devnet genesis shuffled nulroot


  8 passing (1m)

vs

$ npm run test:interface:http-js -- -t node

> [email protected] test:interface:http-js /Users/alex/Documents/Workspaces/ipfs/js-ipfs/packages/ipfs
> aegir test -f test/interface-http-js.js "-t" "node"

Test Node.js


  interface-ipfs-core over ipfs-http-client tests against js-ipfs
    .dag.export
      ✔ should export a car file (136ms)
      ✔ export of shuffled devnet export identical to canonical original (1939ms)
      ✔ export of shuffled testnet export identical to canonical original (12438ms)
    .dag.import
      ✔ should import a car file (134ms)
      ✔ should import a car file without pinning the roots (99ms)
      ✔ should import multiple car files (197ms)
      ✔ should import car with roots but no blocks (14021ms)
      ✔ should import lotus devnet genesis shuffled nulroot


  8 passing (37s)

@rvagg
Copy link
Member Author

rvagg commented Jul 26, 2021

@achingbrain any idea what the difference might be? could it be on the rpc side, or is this only explainable by the core impl, and perhaps block store mechanism?

Would be interesting to try this same thing from the go http api, maybe.

@achingbrain
Copy link
Member

The original perf of this in js was really bad because we were doing sequential block writes in datastore-fs and also not being very clever about how we walk a DAG when pinning recursively - I changed it to do parallel writes and to not walk child nodes that we've seen before during the traversal and it got much faster.

Would be interesting to try this same thing from the go http api, maybe

If I understand you correctly, this is what the test runs above are doing - using the go http api and the js http api to run the same tests.

@achingbrain achingbrain marked this pull request as ready for review July 26, 2021 10:37
@rvagg
Copy link
Member Author

rvagg commented Jul 26, 2021

well, 👌 , whatever's going on, this is pretty sweet, thanks again for picking it up!

// blocks that we're OK with not inspecting for links
/** @type {number[]} */
const NO_LINKS_CODECS = [
dagCbor.code, // CBOR
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dagCbor.code, // CBOR

This should be 0x51 (cbor), not 0x71 (dag-cbor) like here, but it's probably best to just remove it since we don't have an official cbor encoder (yet, no big demand atm). We need to inspect links of dagCbor blocks, since it has links, but cbor blocks don't (or .. maybe they could do, 🤷, it's not really clear what "cbor" means and its scope, but at least in Go we're going with super-minimal cbor with no tags or extras).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense, I did wonder.

const codec = await codecs.getCodec(cid.code)

if (codec) {
const block = Block.createUnsafe({ bytes, cid, codec })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One difference we might find with Go is that they're going to be obsessively checking CIDs are correct for the bytes whenever they're loaded, at least in cases like this. The Unsafe is giving us a shortcut, basically saying we don't care if it doesn't match, and in fact we probably trust the underlying source of blocks to have already done that check. The new go-ipld-prime integration work is providing some opportunities for "trusted" data sources that skip these checks too so they may get slight perf improvements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I don't think we need to as the blockstore creates the path to the block from the multihash extracted from the CID. I suppose you could mess around with the fs/level backing store to change the contents of the buffer stored at the relevant path but I think solving that problem is outside the scope of dag export.

Re-hashing/verifying blocks on every export would be quite expensive, definitely something we should put behind a flag if people need those kind of guarantees.

go-ipfs has a ipfs repo verify command which I guess would load every block in the repo and rehash it to ensure we've got the right bytes, there's no equivalent in js land yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: the old Datastore.HashOnRead option that is false by default, due to the perf hit and "i trust my datastore" assumption https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#datastorehashonread

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go will be getting the same treatment when the ipld-prime work makes its way through: https://github.com/ipfs/go-fetcher/blob/64b1f390e7ae13d96494f15a10e07527369a521d/impl/blockservice/fetcher.go#L45

I'm pretty sure (but far from certain) that any time you interact with an IPLD Node at the moment it does the whole-hog of verification and you can't bypass it. In fact there's a lot of places where you can't even avoid decoding the bytes even if you just want the bytes. Things should get a little more sophisticated as ipld-prime moves through.

@achingbrain achingbrain merged commit 700765b into master Jul 27, 2021
@achingbrain achingbrain deleted the rvagg/import-export branch July 27, 2021 08:01
@rvagg
Copy link
Member Author

rvagg commented Jul 27, 2021

🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: import/export a DAG from/to a CAR file
3 participants