Recursive add of large directory fails at 100% (with nocopy and fscache) #5815

dokterbob · 2018-12-03T11:13:54Z

Version information:

go-ipfs version: 0.4.18-
Repo version: 7
System version: amd64/linux
Golang version: go1.11.1

Type:

Bug

Description:

Adding a large resource (the ipfs-search.com index, specifically, 390GB) fails at 100% - it simply blocks and doesn't give the overall hash for the resource. Getting to 100% takes an acceptable amount of time, after which nothing happens for at least 12 hours.

Example output:

$ ipfs add -p -w --nocopy --fscache -r ipfs-search-backup
[...]
added QmfTjAf3keCtZkKVizGGPokkWAkm3GQoLbP7iLBRJK4Y2e ipfs-search-backup/indices/53neN9SkQWublctO6iu8AQ
 390.22 GiB / 390.22 GiB [=====================================================================================================================================================] 100.00%
^C
Error: context canceled

The text was updated successfully, but these errors were encountered:

Stebalien · 2018-12-05T20:45:38Z

This is probably due to provider records (i.e., the process of telling the network that you have the content), unfortunately. We currently need to make one DHT request per block which means ~1e6 DHT requests.

Tracked by: #5774

For now, you should be able to use the --local flag to not send out provider records. Alternatively, you can use ipfs add without starting the daemon (that'll do the same thing).

Unfortunately, that does mean you won't tell the network that you have the data.

Stebalien · 2018-12-05T20:49:30Z

Actually, it may not be that. Can you post a heap and goroutine profile when this gets stuck? That is, run:

wget http://localhost:5001/debug/pprof/heap
wget http://localhost:5001/debug/pprof/goroutine?debug=2

dokterbob · 2018-12-06T13:47:48Z

@Stebalien Here you go. https://gateway.ipfs.io/ipfs/QmS79kLK2sxYVCtYNAJqwH1pZNePS8AQBNfr9AhRKkrq9a

Note that the adding progresses fine until exactly 100% is reached.

I will try again with --local as well.

dokterbob · 2018-12-06T14:44:05Z

@Stebalien Most surprising result with --local:

 401.85 GiB / 401.85 GiB [================================================================================================================================================] 100.00%Error: merkledag: not found

Stebalien · 2018-12-06T17:38:18Z

So, it does look like provider records are backing up however...

Most surprising result with --local:

That's not good. Can you run ipfs filestore verify (it may take a while)? If that doesn't show any errors, can you run ipfs repo verify (will take even longer).

Also, could you try ipfs add --local --pin=false?

It looks like you're missing a block that you should have.

dokterbob · 2018-12-08T09:46:39Z

I've cleaned the filestore and all pins and run both verify commands, to no avail. :/

However, very much to my surprise, the resource does seem to be pinned!

(After another stall at 100% - note that this one was without --fscache as it was an empty repo. This is also why I'm certain that this is, in fact, the correct hash.)

$ ipfs pin ls -t recursive
QmV9b1jxgCaTVNcSHnz2Fv2C3TddC41BuQFNQezT74HsbU recursive
$ ipfs ls /ipfs/QmV9b1jxgCaTVNcSHnz2Fv2C3TddC41BuQFNQezT74HsbU
QmXvvgfYCePbh6bAWNLRrNoPMP2sFzYKR24Vn2RbbNznRj 18005369229 ipfs-search-backup/

Note lastly, that weirdly enough the reported filesizes by the gateway are only a fraction of the real size of the data (18 GB reported vs. 400 GB original).

I have not yet tried to download the resource as, with current IPFS performance, that would take several days. But you're very much invited to try.

dokterbob · 2018-12-08T09:47:42Z

(I'm currently giving it another run with --local --pin=false.

dokterbob · 2018-12-08T11:16:08Z

$ ipfs add -r -w --nocopy --local --pin=false --fscache
 411.12 GiB / 411.12 GiB [=============================================] 100.00%Error: merkledag: not found

:/

Stebalien · 2018-12-10T22:52:42Z

Ok, this is definitely a bug in filestore.

What's the shape of the data? That is: small directory of large files or a large directory of small files?

dokterbob · 2018-12-10T23:06:11Z

It's al elasticsearch snapshot: couple of levels of depth (~4), lot's of smaller (bytes) and larger files (megabytes).

Example data: Qmc3RxfyZTPf7omWN1XxDkaZhp93ukfLSY14CTC8n1v5Hv (created using ipfs-pack, which somehow does seem to work)

Stebalien · 2018-12-10T23:23:21Z

I've just tested a large directory of small files with filestore so I'm pretty sure it's not that. I've also tested filestore on a 200MiB file so it's not that either.

@dokterbob have you tried running this without nocopy? I'm wondering if you have a filesystem corruption.

dokterbob · 2018-12-10T23:55:40Z

Haven’t tried without nocopy, yet. I’ve had this problem on two different machines, one of which runs ZFS - so very little chance of filesystem corruption (but running fsck on one of them anyways).

…

Op 10 dec. 2018, om 23:23 heeft Steven Allen ***@***.***> het volgende geschreven: I've just tested a large directory of small files with filestore so I'm pretty sure it's not that. I've also tested filestore on a 200MiB file so it's not that either. @dokterbob have you tried running this without nocopy? I'm wondering if you have a filesystem corruption. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Stebalien · 2018-12-11T00:26:57Z

I’ve had this problem on two different machines, one of which runs ZFS - so very little chance of filesystem corruption (but running fsck on one of them anyways).

Also check the permissions, make sure the daemon can read all the files. Are you running the daemon as a different user?

(but it's probably a bug)

dokterbob · 2018-12-11T00:32:30Z

Just succesfully created a tar as the ipfs user, so that’s ruled out. Also testing without --ncopy, so we can focus on that. Love to hear what I can do to provide you guys with more info to debug this. Also happy to share the actual dataset (although the given hash should suffice to replicate the problem).

Stebalien · 2018-12-11T01:03:45Z

Those incorrect sizes are also pretty worrying.

dokterbob · 2018-12-11T08:50:07Z

Yep.

Without --nocopy it's working fine, so it's definitely filestore.

Stebalien · 2019-02-01T00:36:46Z

Do you have garbage collection enabled?

dokterbob · 2019-02-01T11:15:06Z

Yep. It's kind of necessary as we pull about 1TB a day through our server.

dokterbob · 2019-02-01T11:15:41Z

I could test it later this week on my home server with GC disabled (if it is enabled at all).

Stebalien · 2019-02-01T18:30:43Z

Thanks!

github-actions · 2022-06-14T00:13:59Z

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

github-actions · 2022-06-21T00:16:05Z

This issue was closed because it is missing author input.

dokterbob mentioned this issue Dec 3, 2018

Create snapshots.md ipfs-search/ipfs-search#91

Merged

Stebalien added the kind/bug A bug in existing code (including security flaws) label Dec 5, 2018

DonaldTsang mentioned this issue Jan 14, 2019

The Grand NoCopy Bug #5924

Closed

michaelavila added the need/author-input Needs input from the original author label Jun 5, 2019

Stebalien mentioned this issue Apr 14, 2020

Filestore Experiment #7161

Open

3 tasks

github-actions bot added the kind/stale label Jun 14, 2022

github-actions bot closed this as completed Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive add of large directory fails at 100% (with nocopy and fscache) #5815

Recursive add of large directory fails at 100% (with nocopy and fscache) #5815

dokterbob commented Dec 3, 2018

Stebalien commented Dec 5, 2018

Stebalien commented Dec 5, 2018

dokterbob commented Dec 6, 2018

dokterbob commented Dec 6, 2018

Stebalien commented Dec 6, 2018

dokterbob commented Dec 8, 2018

dokterbob commented Dec 8, 2018

dokterbob commented Dec 8, 2018

Stebalien commented Dec 10, 2018

dokterbob commented Dec 10, 2018

Stebalien commented Dec 10, 2018

dokterbob commented Dec 10, 2018 via email

Stebalien commented Dec 11, 2018

dokterbob commented Dec 11, 2018 via email

Stebalien commented Dec 11, 2018

dokterbob commented Dec 11, 2018

Stebalien commented Feb 1, 2019

dokterbob commented Feb 1, 2019

dokterbob commented Feb 1, 2019

Stebalien commented Feb 1, 2019

github-actions bot commented Jun 14, 2022

github-actions bot commented Jun 21, 2022

Recursive add of large directory fails at 100% (with nocopy and fscache) #5815

Recursive add of large directory fails at 100% (with nocopy and fscache) #5815

Comments

dokterbob commented Dec 3, 2018

Version information:

Type:

Description:

Stebalien commented Dec 5, 2018

Stebalien commented Dec 5, 2018

dokterbob commented Dec 6, 2018

dokterbob commented Dec 6, 2018

Stebalien commented Dec 6, 2018

dokterbob commented Dec 8, 2018

dokterbob commented Dec 8, 2018

dokterbob commented Dec 8, 2018

Stebalien commented Dec 10, 2018

dokterbob commented Dec 10, 2018

Stebalien commented Dec 10, 2018

dokterbob commented Dec 10, 2018 via email

Stebalien commented Dec 11, 2018

dokterbob commented Dec 11, 2018 via email

Stebalien commented Dec 11, 2018

dokterbob commented Dec 11, 2018

Stebalien commented Feb 1, 2019

dokterbob commented Feb 1, 2019

dokterbob commented Feb 1, 2019

Stebalien commented Feb 1, 2019

github-actions bot commented Jun 14, 2022

github-actions bot commented Jun 21, 2022