Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted large badger repo #5213

Closed
ghost opened this issue Jul 10, 2018 · 26 comments
Closed

Corrupted large badger repo #5213

ghost opened this issue Jul 10, 2018 · 26 comments
Assignees
Labels
need/analysis Needs further analysis before proceeding topic/badger Topic badger topic/repo Topic repo

Comments

@ghost
Copy link

ghost commented Jul 10, 2018

Version information: 0.4.16-rc2

Type: bug

Description:

I got his large badgerds repo (~5TB) which I recently update to 0.4.16-rc2. After the included 6-to-7 repo migration, my repo is corrupted. I'm not sure whether I hard-killed the daemon before the update. The version it was previously running is 8b383da which was first included in 0.4.15-rc1.

# IPFS_PATH=/ipfs/ipfs_master/repo ./ipfs daemon
Initializing daemon...
Error: Unable to replay value log: "/ipfs/ipfs_master/repo/badgerds/000017.vlog": Data corruption detected. Value log truncate required to run DB. This would result in data loss.
Received interrupt signal, shutting down...
(Hit ctrl-c again to force-shutdown the daemon.)

I tried doing a badger backup with truncation enabled, but that didn't actually go and truncate stuff:

# badger --vlog-dir badgerds/ --dir badgerds/ backup -t
Listening for /debug HTTP requests at port: 8080
Error: Unable to replay value log: "badgerds//000017.vlog": Value log truncate required to run DB. This might result in data loss.
Usage:
  badger backup [flags]

Flags:
  -f, --backup-file string   File to backup to (default "badger.bak")
  -h, --help                 help for backup
  -t, --truncate             Allow value log truncation if required.

Global Flags:
      --dir string        Directory where the LSM tree files are located. (required)
      --vlog-dir string   Directory where the value log files are located, if different from --dir

Unable to replay value log: "badgerds//000017.vlog": Value log truncate required to run DB. This might result in data loss.
@ghost ghost added topic/repo Topic repo topic/badger Topic badger labels Jul 10, 2018
@ghost ghost assigned schomatis Jul 10, 2018
@schomatis
Copy link
Contributor

Thanks for reporting this issue, it's my main concern regarding the Badger transition, two separate issues:

  1. Possible data loss after kill, this is expected and by default Badger will require explicit consent to truncate the corrupted values, but I'm wondering, what tool and information does an IPFS user has to actually take the required measures (enable truncation) to continue using the repo?

  2. Flag -t not working, I'll have to take a closer look at the backup command, and possibly your repo.

@ghost
Copy link
Author

ghost commented Jul 10, 2018

Let me know an SSH key :)

@schomatis
Copy link
Contributor

There's a windows related check in the badger truncation function that it's catching my attention, it rejects it if the value log has been loaded with mmap, @magik6k is there an easy way to pass the ValueLogLoadingMode option through the config file to set it not to use the (default) mmap?

@schomatis
Copy link
Contributor

Actually, if you're running the badger command from the cloned git repo, @lgierth, could you do a temporary modification of the ValueLogLoadingMode default option, change it to FileIO, rebuild and retry the command?

@ghost
Copy link
Author

ghost commented Jul 10, 2018

It did change something:

Unable to replay value log: "badgerds//000017.vlog": truncate badgerds//000017.vlog: invalid argument

@schomatis
Copy link
Contributor

Yes, this is a different problem, I'll raise the corresponding issues at Badger.

@manishrjain
Copy link

manishrjain commented Jul 11, 2018

Can you try not passing the slash at the end? So, it doesn't have two slashes in the file path: badgerds//000017.vlog. I'm not sure if that's the issue, but I suspect it might be an issue.

Also, what version of Badger are you on?

P.S. If you have more logs, it would better help understand what's happening here.

@ghost
Copy link
Author

ghost commented Jul 11, 2018

That didn't help unfortunately, neither as a relative nor absolute path. Is there anything I could pull out of :8080 while it's still running?

@manishrjain
Copy link

Can you expand more about the Badger version and the environment? Also, if you have access to the Badger directory, could you tar, gzip and upload it and send me a link? So, I could debug what's going on.

@ghost
Copy link
Author

ghost commented Jul 12, 2018

Also, if you have access to the Badger directory, could you tar, gzip and upload it and send me a link?

It's 5 TB unfortunately. Can give you access to the host though.

@manishrjain
Copy link

Sure. My email id is my first name at dgraph.io. Also, tell me the steps about what to do after logging in.

@manishrjain
Copy link

It looks like this is the line which is causing the issue. For some reason, it is unable to truncate the file:

https://github.com/dgraph-io/badger/blob/master/value.go#L329

@schomatis
Copy link
Contributor

Can you expand more about the Badger version

The Badger version currently used in go-ipfs is v1.3.0,

https://ipfs.io/ipfs/QmeAEa8FDWAmZJTL6YcM1oEndZ4MyhCr5rTsjYZQui1x1L/badger

although @lgierth was using a much recent version to run the backup command that was failing, probably v1.5.x.

@manishrjain
Copy link

What filesystem is the environment using? Is it VFAT or EXT4 or something else?

@Stebalien Stebalien added the need/analysis Needs further analysis before proceeding label Jul 13, 2018
@schomatis
Copy link
Contributor

@lgierth Could you provide @manishrjain more details about the setup?

@leerspace
Copy link
Contributor

I've also just encountered this issue on Windows with v0.4.16 on an NTFS partition. The update completed successfully (as far as I could tell) and I was able to use the repo for a while afterwards, but now I'm getting this error. The repo I lost is a lot smaller at 88GB, so I can share if it would be helpful.

@schomatis
Copy link
Contributor

Hey @leerspace, there are different errors mentioned in this issue, are you getting the invalid argument one?

@leerspace
Copy link
Contributor

@schomatis sorry for not being more clear. I'm getting the error in the first post: Error: Unable to replay value log: "C:\\Users\\user\\.ipfs\\badgerds\\000088.vlog": Data corruption detected. Value log truncate required to run DB. This would result in data loss..

@schomatis
Copy link
Contributor

Ok, this may be a consequence of many possible factors, but most possibly a crash or a hard-kill of an ipfs command. It's an acceptable scenario, but we're not providing a truncate flag at the moment or any tool to bypass this (see #5213 (comment) point 1), I'll open another issue about this.

If you want you could try the badger backup -t to try to remove the corrupted part of the DB (that should be only a small fraction of it).

@leerspace
Copy link
Contributor

@schomatis I just finished running badger --vlog-dir badgerds/ --dir badgerds/ backup -t from OP and it completed successfully from what I can tell, but now I get bunch of disk IO followed by a different error when trying to start the daemon. It's different than what's in this issue so I can open a separate one for my new error.

@manishrjain
Copy link

So, Go's truncate function is failing:

2018/08/17 04:18:33 Iterating file id: 16
2018/08/17 04:18:33 Replaying log file: 16. Running count: 2000
2018/08/17 04:18:33 Replaying log file: 16. Running count: 4000
2018/08/17 04:18:33 Iteration took: 214.280716ms
2018/08/17 04:18:33 Iterating file id: 17
panic: offset: 0. Err: truncate ./000017.vlog: invalid argument

goroutine 1 [running]:
github.com/dgraph-io/badger.(*valueLog).iterate(0xc4200f3d48, 0xc4cb72c8a0, 0x0, 0xc5956470e0, 0x1, 0x0)
	/root/go/src/github.com/dgraph-io/badger/value.go:335 +0xa59
github.com/dgraph-io/badger.(*valueLog).Replay(0xc4200f3d48, 0x0, 0xc400000000, 0xc5956470e0, 0x0, 0x0)
	/root/go/src/github.com/dgraph-io/badger/value.go:779 +0x32d

Code:

        if vlog.opt.Truncate && truncate && len(lf.fmap) == 0 {
                // Only truncate if the file isn't mmaped. Otherwise, Windows would puke.
                if err := lf.fd.Truncate(int64(validEndOffset) + 1); err != nil {
                        panic(fmt.Sprintf("offset: %d. Err: %v", validEndOffset, err.Error()))
                        return err
                }

I see that the root folder is on RAID array. I wonder if that's what's causing the issue -- this looks like a problem with either the standard file.Truncate library in Go, or a problem with the system itself.

├─sdc3    8:35   0   6.7G  0 part  
│ └─md2   9:2    0    20G  0 raid5 /

@bonedaddy
Copy link
Contributor

have you tried doing a health check and/or repair of your raid array?

@ghost
Copy link
Author

ghost commented Aug 17, 2018

Yeeah spot on, one (of four) disks has died without us noticing. I don't even see log lines of when it died. The filesystem seems to be intact and complete, but whatever, let's call this host dead.

The data in the repo can be reproduced relatively easily. (It's really just the cdn.media.ccc.de mirror that needs reproducing.)

@manishrjain
Copy link

You could copy over this data to another host, and verify that Badger is doing the right thing. Not sure there's anything else we need to do from Badger's end, so I'm considering this issue closed.

@schomatis
Copy link
Contributor

Not sure there's anything else we need to do from Badger's end, so I'm considering this issue closed.

Agreed, I'm closing the issue on the Badger end, thanks for investigating this issue @manishrjain which wasn't actually related to Badger.

You could copy over this data to another host, and verify that Badger is doing the right thing.

Could you do this @lgierth to be extra sure? Or is the DB too big to perform a full copy?

@ghost
Copy link
Author

ghost commented Oct 5, 2018

This has been solved -- the underlying mdadm RAID got into a weird state and might have corrupted/lost data.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/analysis Needs further analysis before proceeding topic/badger Topic badger topic/repo Topic repo
Projects
None yet
Development

No branches or pull requests

5 participants