You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.
As noted in PacificBiosciences/FALCON#334, pwatcher uses os.listdir() to view the existence of job-done files ("exit" files). That probably uses readdir(), so it should be fast along as the number of files is < 100k.
However, there is a low limit in some filesystems on the number of sub-directories in a directory -- e.g. 32k in ext3. So the current pwatcher is near the limit. For genomes > 5GB, we will need to use 2 levels of directory naming, similar to object-file naming in git. E.g.
# ls ../.git/modules/pypeFLOW/objects/
00 09 12 1a 22 28 30 36 41 4b 54 5d 63 6d 73 7e 84 8b 93 9b a4 ab b1 ba c1 c7 d0 d7 e1 e8 f1 f9 info
01 0a 13 1b 23 2a 31 39 43 4c 55 5e 64 6e 76 7f 85 8c 94 9d a5 ac b2 bb c2 c8 d1 d8 e2 ea f2 fa pack
...
# $ ls ../.git/modules/pypeFLOW/objects/1f
373199fd8940c9338fb698d27a2ec9fa5622c5 f22aa0033b51faac5283ad37acc4c695887308 f37c0e9f149fac5bd6eed9636478c9ae063cf6
See the idea? With just 2 hex-digits, we have 1/256 as many files (or sub-dirs in our case) per directory. We could also use this for the heartbeat/exit files if necessary, but the problem I know we'll have is in the pwatcher/jobs directory, where every uniquely named job has its own directory (based on a checksum of the job-description).
Not urgent, but I don't want to forget about this.
The text was updated successfully, but these errors were encountered:
+1. If we can add some metadata files (thin layer DB) to track where the files that people might be interested to figure out where they are without depending on file system, it could be useful.
As noted in PacificBiosciences/FALCON#334, pwatcher uses
os.listdir()
to view the existence of job-done files ("exit" files). That probably usesreaddir()
, so it should be fast along as the number of files is < 100k.However, there is a low limit in some filesystems on the number of sub-directories in a directory -- e.g. 32k in ext3. So the current pwatcher is near the limit. For genomes > 5GB, we will need to use 2 levels of directory naming, similar to object-file naming in git. E.g.
See the idea? With just 2 hex-digits, we have
1/256
as many files (or sub-dirs in our case) per directory. We could also use this for the heartbeat/exit files if necessary, but the problem I know we'll have is in thepwatcher/jobs
directory, where every uniquely named job has its own directory (based on a checksum of the job-description).Not urgent, but I don't want to forget about this.
The text was updated successfully, but these errors were encountered: