Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deeper sub-levels option for extremely large repositories #215

Open
midnightmagic opened this issue Feb 21, 2023 · 3 comments
Open

Deeper sub-levels option for extremely large repositories #215

midnightmagic opened this issue Feb 21, 2023 · 3 comments

Comments

@midnightmagic
Copy link

midnightmagic commented Feb 21, 2023

Output of rest-server --version

rest-server version rest-server 0.10.0-dev compiled with go1.17.3 on linux/ppc64le

What should rest-server do differently?

For extremely large repositories (on the order of dozens to hundreds of TB) the single-sublevel of directories for the storage backend become directories with upwards of 16k - 18k individual files. For name lookups alone, let alone full-directory traversal, on some OS and/or FS this becomes extremely expensive to maintain, mirror, snapshot, or even just move from one machine to another.

Ultimately this is just a performance regression based on the environment and not restic, of course.

Ideally, I believe the easiest solution at least for my particular data backups, would be to add not an additional 256-per, but likely even just an '0-f' sublevel would be enough to reduce file number to a level where the system is not labouring.

What are you trying to do? What is your use case?

I wrote a small utility to manually verify the integrity of the files, so one of my use-cases is to be able to read large amounts of backup data quickly—weirdly, in spite of the files being reasonably large, the name lookups reduce overall data export bandwidth to about 5MB/s on the ZFS backend I'm currently using.

So, it'd be a remote rsync, and a local checksum verification for me specifically.

Currently I'm looking at getObjectPath (adding an additional 0-f substr in there) and in createRepo() for the creation of the repo plus from-here-forward storage; I suspect the migration of my pre-existing repository could be made programmatic in some fashion; perhaps an exists check per-subdir on a second-level directory during a directory iteration, and an idempotent move operation which can be restarted for users like myself who may sometimes cause instability on systems which have long-running important processes on them.

Did rest-server help you today? Did it make you happy in any way?

Yeah. Every time I make a new snapshot this software makes me happy. :-) The design decisions just make sense, and they fit perfectly into my uses. The portability, the consistent user experience, the use of parallelism for performance, it's all good. All so, so good. Thank you.

@wojas
Copy link
Contributor

wojas commented Feb 22, 2023

The current filesystem format is made to be compatible with how restic stores repositories on disk. You can directly run restic commands on a rest-server repo, which would not be the case with the proposed change.

For extremely large repositories (on the order of dozens to hundreds of TB) the single-sublevel of directories for the storage backend become directories with upwards of 16k - 18k individual files. For name lookups alone, let alone full-directory traversal, on some OS and/or FS this becomes extremely expensive to maintain, mirror, snapshot, or even just move from one machine to another.

18k files in a directory is really not that much for a modern filesystem. For example, ext3+ has the dir_index option that prevents the need for linear scans. I believe it is enabled by default nowadays when creating a new filesystem, but see this page for how to enable it on an existing filesystem that does not have it enabled.

You mention that you use ZFS. I'm surprised that this would be an issue with ZFS, but I have no experience actually running with ZFS.

As a side note, it can be helpful to use a small SSD/NVMe device for read-only caching, if possible. This can make directory listings as fast as if the data was stored on the SSD.

@MichaelEischer
Copy link
Member

Too many files in a single folder isn't the only problem here. At that repository size, listing all files in it currently causes the rest-server to create a single gigantic JSON reply, which might cause memory issues.

Other than that, additional directory levels would first have to be supported by restic.

@ethereal-engineer
Copy link

Completely don't know what I am talking about here but what about running multiple instances of restic-rest-server with root in different directories to split up the massive repository problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants