Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container is unfriendly to htcondor #65

Open
multiduplikator opened this issue Dec 21, 2021 · 0 comments
Open

Container is unfriendly to htcondor #65

multiduplikator opened this issue Dec 21, 2021 · 0 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@multiduplikator
Copy link

multiduplikator commented Dec 21, 2021

Is your feature request related to a problem? Please describe.
The singularity container is sensitive to having the users home environment available/bound when being instantiated. Since most htcondor environments call singularity with -C --no-home this leads to complications, eg. with the python egg cache.

Describe the solution you'd like
I would like the container to be more "self contained" so it relies as little as possible on the host environment when being called. Reason being, that we usually don't know which node in a cluster will execute a respective bidsonym job. In addition, the entry point definition is somewhat complicating matters as well, since it is calling a startup script with command line parameters, and not an executable in a fully prepared environment. Lastly, we can assume that all nodes in a htcondor cluster can see a shared file system, so data transfer within jobs is reduced to a minimum.

Describe alternatives you've considered
Let's start with a fully working singularity call from the host, where /srv/home is bind mounted on all nodes. The dataset is in /srv/home/user/test/BIDS. Then we can successfully call

/opt/singularity/bin/singularity run -C \
	--bind /srv/home/user/test/BIDS:/BIDS \
	/opt/singularity/images/bidsonym-0.0.5.sif \
	/BIDS \
	participant \
	--participant_label 3 \
	--deid pydeface \
	--brainextraction bet \
	--bet_frac 0.5 \
	--del_meta 'InstitutionAddress' 'AcquisitionTime' 'ProcedureStepDescription' 'InstitutionalDepartmentName' 'InstitutionName' 'SeriesDescription' \
	--skip_bids_validation

Changing the call to simulate a htcondor like call. We eliminate binding home and bring in the temporary directoriea availalable on each node locally. To prevent the container from crashing, we need to explicitly tell singularity where to setup the python egg cache (this is one annoyance that should be fixed).

export SINGULARITYENV_PYTHON_EGG_CACHE=/tmp
/opt/singularity/bin/singularity run --no-home -C \
	-S /tmp -S /var/tmp \
	--bind /srv/home/user/test/other --pwd /srv/home/user/test/other \
	--bind /srv/home/user/test/BIDS:/BIDS \
	/opt/singularity/images/bidsonym-0.0.5.sif \
	/BIDS \
	participant \
	--participant_label 3 \
	--deid pydeface \
	--brainextraction bet \
	--bet_frac 0.5 \
	--del_meta 'InstitutionAddress' 'AcquisitionTime' 'ProcedureStepDescription' 'InstitutionalDepartmentName' 'InstitutionName' 'SeriesDescription' \
	--skip_bids_validation

The above call would work on a htcondor cluster if we were able to have user defined singularity bind mounts (apart from user definable singularity images). In general, htcondor clusters are not setup for that, but it can be done by configuring htcondor with

SINGULARITY = /opt/singularity/bin/singularity
SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
SINGULARITY_IMAGE_EXPR = strcat("/opt/singularity/images/",TARGET.SingularityImage)
SINGULARITY_BIND_BASE = ""
SINGULARITY_BIND_EXPR = ifThenElse(isUndefined(TARGET.SingularityBind),$(SINGULARITY_BIND_BASE),TARGET.SingularityBind))

And then introducing in the job definition files

+SingularityBind = "/srv/home/user/test/BIDS:/BIDS"

This would give an job desciption file that would execute well in the cluster.

universe                = vanilla
request_memory          = 1G
request_cpus            = 1
environment             = "SINGULARITYENV_PYTHON_EGG_CACHE=/tmp"
+SingularityImage       = "bidsonym-0.0.5.sif"
+SingularityBind        = "/srv/home/user/test/BIDS:/BIDS"
requirements            = HasSingularity
executable              = /neurodocker/startup.sh
transfer_executable     = FALSE
should_transfer_files   = NO

arguments               = "bidsonym /BIDS participant --participant_label 3 --deid pydeface --brainextraction bet --bet_frac 0.5 --del_meta 'InstitutionAddress' 'AcquisitionTime' 'ProcedureStepDescription' 'InstitutionalDepartmentName' 'InstitutionName' 'SeriesDescription' --skip_bids_validation"

log                     = $(Cluster)-$(Process)-r01.log
output                  = $(Cluster)-$(Process)-r01.out
error                   = $(Cluster)-$(Process)-r01.err

queue

As can be seen in the job description above, we have to specify the entry point as executable, and then give the parameters in the arguments section. This leads to a problem, where relative paths for the data directory do not work properly anymore, so we have to specify absolute paths. I am not sure why this is. Would be great to get this fixed.

To get rid of the +SingularityBind hack and be much closer to htcondor standard, we need to change a few things around. To simulate the cluster call, we can do

export SINGULARITYENV_PYTHON_EGG_CACHE=/tmp
/opt/singularity/bin/singularity run --no-home -C \
	-S /tmp -S /var/tmp \
	--bind /srv/home/user/test --pwd /srv/home/user/test \
	/opt/singularity/images/bidsonym-0.0.5.sif \
	/srv/home/user/test/BIDS \
	participant \
	--participant_label 3 \
	--deid pydeface \
	--brainextraction bet \
	--bet_frac 0.5 \
	--del_meta 'InstitutionAddress' 'AcquisitionTime' 'ProcedureStepDescription' 'InstitutionalDepartmentName' 'InstitutionName' 'SeriesDescription' \
	--skip_bids_validation

Note how we have to specify an absolute path for the data directory in the above. This translates in to a job description like this

universe                = vanilla
request_memory          = 1G
request_cpus            = 1
environment             = "SINGULARITYENV_PYTHON_EGG_CACHE=/tmp"
+SingularityImage       = "bidsonym-0.0.5.sif"
requirements            = HasSingularity
executable              = /neurodocker/startup.sh
transfer_executable     = FALSE
should_transfer_files   = NO
initial_dir             = /srv/home/user/test

arguments               = "bidsonym /srv/home/user/test/BIDS participant --participant_label 3 --deid pydeface --brainextraction bet --bet_frac 0.5 --del_meta 'InstitutionAddress' 'AcquisitionTime' 'ProcedureStepDescription' 'InstitutionalDepartmentName' 'InstitutionName' 'SeriesDescription' --skip_bids_validation"

log                     = $(Cluster)-$(Process)-r01.log
output                  = $(Cluster)-$(Process)-r01.out
error                   = $(Cluster)-$(Process)-r01.err

queue

Interestingly, when using the above job description (or simulation), we end up with some additional directories and files, that are not there when calling without the --no-home option. These are brainextraction_wf and deface_wf and report_wf.

Apparently, these files contain an html report that tries to pull some data from /tmp, for example like so in graph1.json

{
    "groups": [
        {
            "name": "Group_00001",
            "procs": [
                0
            ],
            "total": 1
        }
    ],
    "links": [],
    "maxN": 1,
    "nodes": [
        {
            "group": 1,
            "name": "0_bet",
            "report": "/tmp/tmpirc1_uju/brainextraction_wf/bet/_report/report.rst",
            "result": "/tmp/tmpirc1_uju/brainextraction_wf/bet/result_bet.pklz"
        }
    ]
}

Since we have to assume that /tmp is bind mounted locally on each cluster node and content deleted regularly, this html report will almost always fail and not endure for a long time. I do not see a way to influence the location of these reports. On the other hand, they do not seem to be vital. It would be great, if this could be controlled.

Additional context
Long story short, it would be good if we could make the container more self contained, have a more concise entry point, and a better way to control where output goes. Especially, when we want to exploit parallel execution on a htcondor cluster.

Happy to discuss and debug.

@multiduplikator multiduplikator added enhancement New feature or request help wanted Extra attention is needed labels Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants