-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with TarFS - slow to open large tar archives, and exception #275
Comments
I suspect the size of the file is probably a red herring. The only thing I can see that would cause that is if there is a file encoded with a path of empty string or "/". Could you try something along these lines, and tell me if either of those conditions could be true: >>> import tarfile
>>> t = tarfile.open("mytar.tar")
>>> [f.path for f in t] |
As for the slowness, I think you're right about making the directory reading lazy. Although you would incur that delay if you list a directory, and potentially a few other methods. |
I don't see any empty string or '/' names from the following code:
I do see
The tarfile was generated with this command:
Where source is root, |
As for the slowness, I think a full file scan is a reasonable expectation for a directory listing operation (the same occurs in |
Incidentally, |
I don't think scanning the full directory is avoidable unfortunately. TarFS creates the illusion that tar is structured for random access, when it is anything but. As it is essentially a flat list, there is no way to know if a file exists or not without scanning the entire thing. And you can't list just a single directory. If TarFS was to discard the directory after an operation, it would have to re-scan it again on the next operation and incur the delay again. So if your big tar takes 45 minutes to scan it would take 45 minutes to do
Sorry, what 2 file scans? There is only one as far as I can tell. The directory is cached on first use. To be honest, if your use case is satisfied with a linear scan of the tar, then TarFS may just add overhead you don't need, and you might want to just use the builtin BTW I am not the original author of TarFS. It was contributed by @althonos I also don't claim to be particularly knowledgable about the tar format. I could well be mistaken about how it works. I think the IndexError you found is caused by that "." file. I'm guessing that stores information about the directory that was compressed. Odd that we've never seen that. I'll implement a workaround for that when I next have the time. Or would accept a PR if you would like to tackle it yourself. |
When walking the tar file there are two scans because:
In the case of a tar file where we only open and walk it (arguably a common use case with tar files), # 1 is redundant. For my present use case I have switched to using I see your point about abstracting tar files to random access systems, though you might consider a more efficient I don't have bandwidth to tackle any of this myself, but if it remains a feature open to contribution, and I find bandwidth later, I might opt to take it on. It'd be nice to make TarFS() more large-file friendly. The two scans for a walk is the most limiting factor though, 5gb of ram is probably available to those handling massive tar files. |
I'm afraid it may be unavoidable. Files in a tar may be stored in no particular order, and there are confounding factors such as having a "foo/bar/baz" file without a directory called "foo". PyFilesystem also guarantees that you can walk the directory structure depth first or breadth first. If the walk didn't return files in a well defined order it would break much of the moving and copying code. I think ultimately the abstraction means that it is never going to be as efficient as simply unpacking the tar. With regards to storing the directory more efficiently, TarFS uses much of the data that Contributions are most welcome. You may find technical solutions that have escaped me! |
The fact that walk defaults to "breadth" in I agree though, that if a user specified "breadth" or "depth", then the first scan is unavoidable. But if the user does not specify, then I'd argue that simply lazy initializing the directory structure would avoid the redundant file scan and the filesystem should be free to choose a walk order that is most efficient. In fact, an efficient implementation of |
In most filesystems directories are a first class concept, and so supporting a "no particular order" directory walk has not been a requirement. But I do see potential in your "linear" walk idea. In addition to archives, there is at least Amazon S3 which doesn't have true directories. Copying to a true filesystem where the source walk is an an arbitrary order would be a more expensive operation, since Pyfilesystem couldn't make the assumption that directories are created prior to copying files to them. For instance if the source has a file called "foo/bar/baz", PyFilesystem would have to check for the existence of "foo" and "bar" in the destination, before creating baz. For that reason I think breadth is a better default. Another possible solution to the double scan problem may be to allow an implementation to advertise that it has a more efficient copy available. The fs.copy module could defer to a |
https://docs.python.org/3/library/tarfile.html#tarfile-objects says "It is possible to store a file in a tar archive several times." and https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.getmember says "If a member occurs more than once in the archive, its last occurrence is assumed to be the most up-to-date version.". And WRT the "double scan" I guess if the tarfile is uncompressed and if it's being opened read-only, then the first scan only needs to (additionally) store the file-offsets and then the |
Will have a fix for that exception shortly. And I've made the directory loading lazy. But the larger issue of the inefficient walk may have to be a wontfix for now. @lurch 's point about duplicate filenames may mean that the linear walk idea is a non-starter. Happy to revisit this at a later date. |
Fixed the exception in v2.4.5 @davidparks21 can you test? |
[2.4.11]: Added Added geturl for TarFS and ZipFS for 'fs' purpose. NoURL for 'download' purpose. Added helpful root path in CreateFailed exception Added Python 3.8 support Fixed Fixed tests leaving tmp files Fixed typing issues Fixed link namespace returning bytes Fixed broken FSURL in windows Fixed hidden exception at fs.close() when opening an absent zip/tar file URL Fixed abstract class import from collections which would break on Python 3.8 Fixed incorrect imports of mock on Python 3 Removed some unused imports and unused requirements.txt file Added mypy checks to Travis. Fixed missing errno.ENOTSUP on PyPy. Fixed bug in a decorator that would trigger an AttributeError when a class was created that implemented a deprecated method and had no docstring of its own. Changed Entire test suite has been migrated to pytest. Style checking is now enforced using flake8; this involved some code cleanup such as removing unused imports. [2.4.10]: Fixed Fixed broken WrapFS.movedir [2.4.9]: Fixed Restored fs.path import Fixed potential race condition in makedirs. Added missing methods to WrapFS. Changed MemFS now immediately releases all memory it holds when close() is called, rather than when it gets garbage collected. FTPFS now translates EOFError into RemoteConnectionError. Added automatic close for filesystems that go out of scope. [2.4.8]: Changed geturl will return URL with user/password if needed @zmej-serow [2.4.7]: Added Flag to OSFS to disable env var expansion [2.4.6]: Added Implemented geturl in FTPFS @zmej-serow Fixed Fixed FTP test suite when time is not UTC-0 @mrg0029 Fixed issues with paths in tarfs PyFilesystem/pyfilesystem2#284 Changed Dropped Python3.3 support [2.4.5]: Fixed Restored deprecated setfile method with deprecation warning to change to writefile Fixed exception when a tarfile contains a path called '.' PyFilesystem/pyfilesystem2#275 Made TarFS directory loading lazy Changed Detect case insensitivity using by writing temp file [2.4.4]: Fixed OSFS fail in nfs mounts [2.4.3]: Fixed Fixed broken "case_insensitive" check Fixed Windows test fails [2.4.2]: Fixed Fixed exception when Python runs with -OO [2.4.1]: Fixed Fixed hash method missing from WrapFS [2.4.0]: Added Added exclude and filter_dirs arguments to walk Micro-optimizations to walk [2.3.1]: Fixed Add encoding check in OSFS.validatepath [2.3.0]: Fixed IllegalBackReference had mangled error message Added FS.hash method
I'm trying to open a 500 GB tar archive in PyFilesystem. It takes about 45 minutes to open (it needs to read through the entire file). The time it takes to open is the first issue I'm having. I'm able to open and stream files via the standard
tarfile
package usingtarfile.open
andmytar.next()
with no delay. It looks liketarfs.py:275 self._directory = OrderedDict((relpath(self._decode(info.name)).rstrip("/"), info) for info in self._tar)
is doing the full read. It seems like this could be lazy initialized, or probably even better if the tarfile package was queried directly on demand rather than maintaining a full dictionary of objects in PyFilesystem.After I get it to open though, when I try to walk the files with
mytarfs.walk.files()
I am encountering this exception:It appears the assumption that 2 elements are returned by
parts(child)
is not always correct. I have tried to reproduce the issue with a small tar file, but I was unable to do so, it only occurs on my large 500 GB tar file. Unfortunately the time it takes to open it is hindering efforts to debug the issue, so for now I can only report it.The text was updated successfully, but these errors were encountered: