Two levels of directory hierarchy #18

pb-cdunn · 2016-05-19T01:14:36Z

As noted in PacificBiosciences/FALCON#334, pwatcher uses os.listdir() to view the existence of job-done files ("exit" files). That probably uses readdir(), so it should be fast along as the number of files is < 100k.

http://stackoverflow.com/questions/466521/how-many-files-can-i-put-in-a-directory

However, there is a low limit in some filesystems on the number of sub-directories in a directory -- e.g. 32k in ext3. So the current pwatcher is near the limit. For genomes > 5GB, we will need to use 2 levels of directory naming, similar to object-file naming in git. E.g.

# ls ../.git/modules/pypeFLOW/objects/
00  09  12  1a  22  28  30  36  41  4b  54  5d  63  6d  73  7e  84  8b  93  9b  a4  ab  b1  ba  c1  c7  d0  d7  e1  e8  f1  f9  info
01  0a  13  1b  23  2a  31  39  43  4c  55  5e  64  6e  76  7f  85  8c  94  9d  a5  ac  b2  bb  c2  c8  d1  d8  e2  ea  f2  fa  pack
...
# $ ls ../.git/modules/pypeFLOW/objects/1f
373199fd8940c9338fb698d27a2ec9fa5622c5  f22aa0033b51faac5283ad37acc4c695887308  f37c0e9f149fac5bd6eed9636478c9ae063cf6

See the idea? With just 2 hex-digits, we have 1/256 as many files (or sub-dirs in our case) per directory. We could also use this for the heartbeat/exit files if necessary, but the problem I know we'll have is in the pwatcher/jobs directory, where every uniquely named job has its own directory (based on a checksum of the job-description).

Not urgent, but I don't want to forget about this.

The text was updated successfully, but these errors were encountered:

pb-jchin · 2016-05-19T01:58:47Z

+1. If we can add some metadata files (thin layer DB) to track where the files that people might be interested to figure out where they are without depending on file system, it could be useful.

pb-cdunn · 2016-05-19T11:21:23Z

Separate issue. See #20.

…o develop * commit '6942aeee628e7afd2ec1f00be14de225c86d4004': We must quote filenames, but not params in task scripts

pb-cdunn added the enhancement label May 19, 2016

pacbbbbot pushed a commit that referenced this issue Jan 16, 2018

Merge pull request #18 in SAT/pypeflow from TAG-1916-quoting-params t…

f974d9f

…o develop * commit '6942aeee628e7afd2ec1f00be14de225c86d4004': We must quote filenames, but not params in task scripts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two levels of directory hierarchy #18

Two levels of directory hierarchy #18

pb-cdunn commented May 19, 2016

pb-jchin commented May 19, 2016

pb-cdunn commented May 19, 2016 •

edited

Loading

Two levels of directory hierarchy #18

Two levels of directory hierarchy #18

Comments

pb-cdunn commented May 19, 2016

pb-jchin commented May 19, 2016

pb-cdunn commented May 19, 2016 • edited Loading

pb-cdunn commented May 19, 2016 •

edited

Loading