Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

Two levels of directory hierarchy #18

Open
pb-cdunn opened this issue May 19, 2016 · 2 comments
Open

Two levels of directory hierarchy #18

pb-cdunn opened this issue May 19, 2016 · 2 comments

Comments

@pb-cdunn
Copy link

As noted in PacificBiosciences/FALCON#334, pwatcher uses os.listdir() to view the existence of job-done files ("exit" files). That probably uses readdir(), so it should be fast along as the number of files is < 100k.

However, there is a low limit in some filesystems on the number of sub-directories in a directory -- e.g. 32k in ext3. So the current pwatcher is near the limit. For genomes > 5GB, we will need to use 2 levels of directory naming, similar to object-file naming in git. E.g.

# ls ../.git/modules/pypeFLOW/objects/
00  09  12  1a  22  28  30  36  41  4b  54  5d  63  6d  73  7e  84  8b  93  9b  a4  ab  b1  ba  c1  c7  d0  d7  e1  e8  f1  f9  info
01  0a  13  1b  23  2a  31  39  43  4c  55  5e  64  6e  76  7f  85  8c  94  9d  a5  ac  b2  bb  c2  c8  d1  d8  e2  ea  f2  fa  pack
...
# $ ls ../.git/modules/pypeFLOW/objects/1f
373199fd8940c9338fb698d27a2ec9fa5622c5  f22aa0033b51faac5283ad37acc4c695887308  f37c0e9f149fac5bd6eed9636478c9ae063cf6

See the idea? With just 2 hex-digits, we have 1/256 as many files (or sub-dirs in our case) per directory. We could also use this for the heartbeat/exit files if necessary, but the problem I know we'll have is in the pwatcher/jobs directory, where every uniquely named job has its own directory (based on a checksum of the job-description).

Not urgent, but I don't want to forget about this.

@pb-jchin
Copy link
Contributor

+1. If we can add some metadata files (thin layer DB) to track where the files that people might be interested to figure out where they are without depending on file system, it could be useful.

@pb-cdunn
Copy link
Author

pb-cdunn commented May 19, 2016

Separate issue. See #20.

pacbbbbot pushed a commit that referenced this issue Jan 16, 2018
…o develop

* commit '6942aeee628e7afd2ec1f00be14de225c86d4004':
  We must quote filenames, but not params in task scripts
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants