Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

DB to track files? #20

Open
pb-cdunn opened this issue May 19, 2016 · 1 comment
Open

DB to track files? #20

pb-cdunn opened this issue May 19, 2016 · 1 comment

Comments

@pb-cdunn
Copy link

@pb-jchin wrote:

If we can add some metadata files (thin layer DB) to track where the files that people might be interested to figure out where they are without depending on file system, it could be useful.

There is pwatcher/state.py, but maybe you really want a forward link from the run-dir into pwatcher.

@pb-cdunn
Copy link
Author

The convention I'm moving toward is to have only a single task per task-directory. The run-directory is inferred as the directory of the script. This has 2 advantages:

  1. We can easily copy the entire directory to /tmp, which makes running in /tmp a simple configuration.
  2. The directory can have a symlink into pwatcher/jobs/UNIQUE_JOB_ID.

Let's say you have a directory like this:

PATH/myjob/input.a
PATH/myjob/input.b
PATH/myjob/run.sh

For (1), the wrapper creates this:

/tmp/myjob/input.a -> symlink to PATH/input.a
/tmp/myjob/input.b -> symlink to PATH/input.b
/tmp/myjob/run.sh -> symlink to PATH/run.sh

Then it runs our script. And when done, the wrapper copies everything that is not a symlink back to PATH/myjob. Thus, all outputs are written into /tmp/myjob and eventually copied back to NFS. All we need is to use relative paths for output files, again by convention. I have this basically implemented, but not yet testable. When the task cannot use relative paths, the new PypeTask will let us mark the task as non-copyable, so that task will never run in /tmp.

For (2), we simply add a symlink like this:

PATH/pwatcher.dir -> PWATCHER/jobs/1f/0a4b67e3/

This part is already implemented, except that the symlink today points to PWATCHER/jobs/1f0a4b67e3 in a flat directory. It's easy to point instead to the new, 2-level nested directory. The reason why we need only 1 per directory is that with N jobs per directory, we would need the symlinks to be named after the jobid, which gets ugly.

Currently, the contents of PWATCHER/jobs/1f0a4b67e3 are exactly 2 files: stdout and stderr. That's usually all you need, but since the jobid is encoded into the symlink, you can easily find the exit, heartbeat, and bash-wrapper files in parallel directories. It's easier to see than to explain.

$ ls mypwatcher/
exits  heartbeats  jobs  state.py  wrappers
$ ls mypwatcher/jobs
J0aaef15a4a19a1293ffc4111f962fc46e01157888ddeee2863d40b94255b63a5  J7af4926326d614413d9f80e5e6b3f523fa31d8288c63ea845a67ff5aefd52461
...
$ ls mypwatcher/wrappers/
run-J0aaef15a4a19a1293ffc4111f962fc46e01157888ddeee2863d40b94255b63a5.bash  run-J7af4926326d614413d9f80e5e6b3f523fa31d8288c63ea845a67ff5aefd52461.bash
...
$ ls mypwatcher/exits/
exit-J0aaef15a4a19a1293ffc4111f962fc46e01157888ddeee2863d40b94255b63a5  exit-J7af4926326d614413d9f80e5e6b3f523fa31d8288c63ea845a67ff5aefd52461

So we have 2 good reasons to have only 1 task per directory. That's not a difficult restriction to follow.

(Note that for (1), we would also sometimes like to copy the input into /tmp, as dgordon has pointed out. I don't have a solution for that yet. It will be inelegant, but easy -- probably a new field on the PypeTask. This gets complicated because we want the option to copy such inputs into dev/shm, based on dgordon's testing. But in that case we must re-use the copy for multiple jobs on the same machine, to avoid running out of RAM. But that requires both locking and a complicated reference-counting system for cleaning up later. dgordon provided a simple solution, but it's not really enough.)

The bottom line is that, by following simple conventions, we don't really need a DB. (There is a DB for pwatcher, but that's different.) Observe:

$ ls -trd run-*
run-bam2fasta  run-fasta2referenceset  run-pbalign-00  run-pbalign_gather  run-gc-01  run-gc-gather
run-falcon     run-pbalign-scatter     run-pbalign-01  run-gc_scatter      run-gc-00  run-polished-assembly-report


$ ls -1trgG run-*
run-bam2fasta:
total 120
-rw-rw-r-- 1    241 May 18 17:41 run_bam2fasta.sh
lrwxrwxrwx 1    242 May 18 17:41 pwatcher.dir -> /home/UNIXHOME/cdunn/repo/pb/smrtanalysis-client/smrtanalysis/siv/testkit-jobs/sa3_pipelines/hgap5_fake/s
ynth5k/job_output/tasks/falcon_ns.tasks.task_hgap_run-0/mypwatcher/jobs/J71c0fd091f2bce465a8cdab4debe4e053b0ef6ba8aeb88e808f46daebb761950
-rw-rw-r-- 1   1798 May 18 17:41 filtered.subreadset.xml
-rw-rw-r-- 1 102740 May 18 17:41 input.fasta


run-falcon:
total 316
-rw-rw-r-- 1    187 May 18 17:41 raw_reads.fofn
-rw-rw-r-- 1    893 May 18 17:41 fc.cfg
-rw-rw-r-- 1   1060 May 18 17:41 fc.json
-rw-rw-r-- 1    414 May 18 17:41 run_falcon.sh
lrwxrwxrwx 1    242 May 18 17:41 pwatcher.dir -> /home/UNIXHOME/cdunn/repo/pb/smrtanalysis-client/smrtanalysis/siv/testkit-jobs/sa3_pipelines/hgap5_fake/synth5k/job_output/tasks/falcon_ns.tasks.task_hgap_run-0/mypwatcher/jobs/J18f9e040a90143f08dec1ca0f85ca2c4a763a3c0dc3dae1b7b63342f18adef53
drwxrwxr-x 2   4096 May 18 17:41 scripts
drwxrwxr-x 2   4096 May 18 17:41 sge_log
drwxrwxr-x 6   4096 May 18 17:41 mypwatcher
drwxrwxr-x 5   4096 May 18 17:42 0-rawreads
-rw-rw-r-- 1   1013 May 18 17:42 fc.log
drwxrwxr-x 4   4096 May 18 17:42 1-preads_ovl
drwxrwxr-x 2   4096 May 18 17:42 2-asm-falcon
-rw-rw-r-- 1 267790 May 18 17:42 pypeflow.log
lrwxrwxrwx 1     21 May 18 17:42 asm.fasta -> 2-asm-falcon/p_ctg.fa
lrwxrwxrwx 1     30 May 18 17:42 preads.fofn -> 1-preads_ovl/input_preads.fofn
-rw-rw-r-- 1     26 May 18 17:42 asm.fasta.fai

run-fasta2referenceset:
total 8
...

run-polished-assembly-report:
total 116
-rw-rw-r-- 1   899 May 18 17:43 run_report.sh
lrwxrwxrwx 1   242 May 18 17:43 pwatcher.dir -> /home/UNIXHOME/cdunn/repo/pb/smrtanalysis-client/smrtanalysis/siv/testkit-jobs/sa3_pipelines/hgap5_fake/synth5k/job_output/tasks/falcon_ns.tasks.task_hgap_run-0/mypwatcher/jobs/J393c6e14ce6037183c55e2c3ddac9c52938b0c43e95e73fe5c89e78fa8fb6e8e
-rw-rw-r-- 1 76169 May 18 17:43 alignment.summary.gff
-rw-rw-r-- 1 13300 May 18 17:43 polished_coverage_vs_quality.png
-rw-rw-r-- 1  2974 May 18 17:43 polished_coverage_vs_quality_thumb.png
-rw-rw-r-- 1  1359 May 18 17:43 polished_assembly_report.json
-rw-rw-r-- 1    69 May 18 17:43 polished_coverage_vs_quality.csv

pb-cdunn pushed a commit that referenced this issue Mar 12, 2018
…-PATH to develop

* commit 'd45f554a5a4d09606881380b05dece0bd0085b77':
  use_tmpdir=False for local Task
  Use Dist for NPROC/MB; allow sge_option per task for fs_based; support local dist
  Stop telling about heartbeats (unused)
  Catch bad job_queue usage early (spaces are for blocking pwatcher)
  Moved gen_task a bit
  Add /bin to PATH in run.sh
  /bin/bash, since /bin might not be in $PATH
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant