diff --git a/README.md b/README.md index 30878ca7c..b66d4064b 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,14 @@ eHive is a system for running computation pipelines on distributed computing res The name comes from the way pipelines are processed by a swarm of autonomous agents. +> [!IMPORTANT] +> As per eHive version 2.7.0, we changed the preferred/default eHive meadow from *LSF* to *SLURM*. +> +> Also, because our pipelines are transitioning to *SLURM*, we have decided to deprecate all the meadows other than *SLURM* and *Local*, and not to support them anymore with immediate effect. +> For details about the schedulers, please see [Grid scheduler and Meadows](#grid-scheduler-and-meadows) below in this README. +> +> Please, do not hesitate to contact us, should this be a problem for you. + Available documentation ----------------------- @@ -70,6 +78,9 @@ Grid scheduler and Meadows -------------------------- eHive has a generic interface named _Meadow_ that describes how to interact with an underlying grid scheduler (submit jobs, query job's status, etc). eHive is compatible with +[SLURM](https://slurm.schedmd.com) + +The following schedulers are deprecated and not supported as of eHive 2.7.0. [IBM Platform LSF](http://www-03.ibm.com/systems/spectrum-computing/products/lsf/), Sun Grid Engine (now known as Oracle Grid Engine), [HTCondor](https://research.cs.wisc.edu/htcondor/), @@ -98,8 +109,8 @@ docker run -it ensemblorg/ensembl-hive beekeeper.pl -url $URL -loop -sleep 0.2 docker run -it ensemblorg/ensembl-hive runWorker.pl -url $URL ``` -Docker Swarm ------------- +Docker Swarm (DEPRECATED) +------------------------- Once packaged into Docker images, a pipeline can actually be run under the Docker Swarm orchestrator, and thus on any cloud infrastructure that supports diff --git a/docs/contrib/alternative_meadows.rst b/docs/contrib/alternative_meadows.rst index 08a31b64a..298b351ef 100644 --- a/docs/contrib/alternative_meadows.rst +++ b/docs/contrib/alternative_meadows.rst @@ -19,33 +19,30 @@ LOCAL configuration found in the ``hive_config.json`` file and (ii) the *analysis_capacity* and *hive_capacity* mechanisms. -LSF - A meadow that supports `IBM Platform LSF `__ +SLURM + A meadow that supports `SLURM `__ This meadow is extensively used by the Ensembl project and is regularly updated. It is fully implemented and supports workloads reaching thousands of parallel jobs. -Other meadows have been contributed to the project, though sometimes not -all the features are implemented. Being developed outside of the main -codebase, they may be at times out of sync with the latest version of -eHive. Nevertheless, several are continuously tested on `Travis CI -`__ using single-machine Docker -installations. Refer to the documentation or README in each of those -repositories to know more about their compatibility and support. +Other meadows - now deprecated - contributed to the project in the past, +though sometimes not all the features were implemented. Being developed outside of the main +codebase, they could be at times out of sync with the latest version of +eHive. These meadows are listed below for the records. + +LSF (Deprecated) + A meadow that supports `IBM Platform LSF `__ This meadow was extensively used by the Ensembl project until 2024. It was fully implemented and supported workloads reaching thousands of parallel jobs. -SGE +SGE (Deprecated) A meadow that supports Sun Grid Engine (now known as Oracle Grid Engine). Available for download on GitHub at `Ensembl/ensembl-hive-sge `__. -HTCondor +HTCondor (Deprecated) A meadow that supports `HTCondor `__. Available for download on GitHub at `Ensembl/ensembl-hive-htcondor `__. -PBSPro +PBSPro (Deprecated) A meadow that supports `PBS Pro `__. Available for download on GitHub at `Ensembl/ensembl-hive-pbspro `__. -SLURM - A meadow that supports `Slurm `__. Available for download on GitHub at `tweep/ensembl-hive-slurm `__. - -DockerSwarm +DockerSwarm (Deprecated) A meadow that can control and run on `Docker Swarm `__. Available for download on GitHub at `Ensembl/ensembl-hive-docker-swarm `__. @@ -69,25 +66,25 @@ The table below lists the capabilities of each meadow, and whether they are avai - Yes - Partially implemented - Not available - * - LSF + * - LSF (Deprecated) - Yes - Yes - Yes - Yes - Yes - * - SGE + * - SGE (Deprecated) - Yes - Yes - Yes - Yes - Not implemented - * - HTCondor + * - HTCondor (Deprecated) - Yes - Yes - Yes - Yes - Not implemented - * - PBSPro + * - PBSPro (Deprecated) - Yes - Yes - Yes @@ -99,7 +96,7 @@ The table below lists the capabilities of each meadow, and whether they are avai - Yes - Yes - Yes - * - DockerSwarm + * - DockerSwarm (Deprecated) - Yes - Yes - Not implemented diff --git a/docs/creating_pipelines/meadows_and_resources.rst b/docs/creating_pipelines/meadows_and_resources.rst index 2fc351bd5..49a9eda0b 100644 --- a/docs/creating_pipelines/meadows_and_resources.rst +++ b/docs/creating_pipelines/meadows_and_resources.rst @@ -23,7 +23,7 @@ order for a Meadow to be available, two conditions must be met: - The appropriate Meadow driver must be installed and accessible to Perl (e.g. in your $PERL5LIB). - - Meadow drivers for LSF and for the LOCAL Meadow are included with the eHive distribution. Other Meadow drivers are :ref:`available in their own repositories `. + - Meadow drivers for SLURM and for the LOCAL Meadow are included with the eHive distribution. Other Meadow drivers are :ref:`available in their own repositories `. - The Beekeeper must be running on a head node that can submit jobs managed by the corresponding job management engine. @@ -57,14 +57,14 @@ The Resource Description is a data structure (in practice written as a perl hashref) that links Meadows to a job scheduler submission string for that Meadow. For example, the following data structure defines a Resource Class with a Resource Class Name '1Gb_job'. This Resource -Class has a Resource Description for running under the LSF scheduler, -and another description for running under the SGE scheduler: +Class has a Resource Description for running under the SLURM scheduler, +and another description for running under the LSF scheduler: .. code-block:: perl { - '1Gb_job' => { 'LSF' => '-M 1024 -R"select[mem>1024] rusage[mem=1024ma]"', - 'SGE' => '-l h_vmem=1G', + '1Gb_job' => { 'SLURM' => ' --time=1:00:00 --mem=1000m', + 'LSF' => '-M 1024 -R"select[mem>1024] rusage[mem=1024ma]"', }, } diff --git a/docs/creating_pipelines/pipeconfigs.rst b/docs/creating_pipelines/pipeconfigs.rst index 5fafb3858..583d94dfb 100644 --- a/docs/creating_pipelines/pipeconfigs.rst +++ b/docs/creating_pipelines/pipeconfigs.rst @@ -222,7 +222,7 @@ Resource classes for a pipeline are defined in a PipeConfig's resource_classes m return { %{$self->SUPER::resource_classes}, - 'high_memory' => { 'LSF' => '-C0 -M16000 -R"rusage[mem=16000]"' }, + 'high_memory' => { 'SLURM' => ' --time=7-00:00:00 --mem=100000m' }, }; } diff --git a/docs/dev/development_guidelines.rst b/docs/dev/development_guidelines.rst index 2202382d5..32915366a 100644 --- a/docs/dev/development_guidelines.rst +++ b/docs/dev/development_guidelines.rst @@ -232,19 +232,15 @@ Internal versioning eHive has a number of interfaces, that are mostly versioned. You can see them by running ``beekeeper.pl --versions``:: - CodeVersion 2.5 - CompatibleHiveDatabaseSchemaVersion 92 - CompatibleGuestLanguageCommunicationProtocolVersion 0.3 + CodeVersion 2.7.0 + CompatibleHiveDatabaseSchemaVersion 96 MeadowInterfaceVersion 5 - Meadow::DockerSwarm 5.1 unavailable - Meadow::HTCondor 5.0 unavailable Meadow::LOCAL 5.0 available Meadow::LSF 5.2 unavailable - Meadow::PBSPro 5.1 unavailable - Meadow::SGE 4.0 incompatible - GuestLanguageInterfaceVersion 3 - GuestLanguage[python3] 3.0 available - GuestLanguage[ruby] N/A unavailable + Meadow::SLURM 5.5 available + GuestLanguageInterfaceVersion 5 + GuestLanguage[python3] 5.0 available + * *CodeVersion* is the software version (see how it is handled in the section below). @@ -265,16 +261,16 @@ Releases, code branching and GIT There are three kinds of branches in eHive: -* ``version/X.Y`` represent released versions of eHive. They are considered +* ``version/X.Y.Z`` represent released versions of eHive. They are considered *stable*, i.e. are feature-frozen, and only receive bug-fixes. Schema changes are prohibited as it would break the database versioning - mechanism. Users on a given ``version/X.Y`` branch must be able to + mechanism. Users on a given ``version/X.Y.Z`` branch must be able to blindly update their checkout without risking breaking anything. It is forbidden to force push these branches (they are in fact marked as *protected* on Github). * ``main`` is the staging branch for the next stable release of eHive. It receives new features (incl. schema changes) until we decide to create a - new ``version/X.Y`` branch out of it. Like ``version/X.Y``, ``main`` is + new ``version/X.Y.Z`` branch out of it. Like ``version/X.Y.Z``, ``main`` is *protected* and cannot be force-pushed. * ``experimental/XXX`` are where *experimental* features are being developed. These branches can be created, removed or rebased at will. If @@ -289,7 +285,10 @@ commits, some bugs have to be fixed differently on different branches. If that is the case, either fix the merge commit immediately, or do a merge for the sake of it (``git merge -s ours``) and then add the correct commits. Forcing merges to happen provides a clearer history and -facilitates tools like ``git bisect``. +facilitates tools like ``git bisect``. + +This is however the historical way. Since eHive version 2.7, all the older +versions are deprecated and not maintained anymore. Experimental branches should be rebased onto main just before the final merge (which then becomes a **fast-forward**). Together with the above diff --git a/docs/running_pipelines/error-recovery.rst b/docs/running_pipelines/error-recovery.rst index 090360009..ff18b5918 100644 --- a/docs/running_pipelines/error-recovery.rst +++ b/docs/running_pipelines/error-recovery.rst @@ -27,7 +27,7 @@ The recommended way to stop Workers is to set the analysis_capacity for Analyses ``tweak_pipeline.pl -url sqlite:///my_hive_db -SET analysis[logic_name].analysis_capacity=0`` -In some situations it may be necessary to take more drastic action to stop the Workers in a pipeline. In order to do this, you may need to find the underlying processes and kill them using the command appropriate for your scheduler (e.g. ``bkill`` for LSF, ``qdel`` for PBS-like systems) - or using the ``kill`` command if the Worker is running in the LOCAL meadow. It may help to look up the Worker's process IDs in the "worker" table: +In some situations it may be necessary to take more drastic action to stop the Workers in a pipeline. In order to do this, you may need to find the underlying processes and kill them using the command appropriate for your scheduler (e.g. ``bkill`` for LSF, ``scancel`` for SLURM) - or using the ``kill`` command if the Worker is running in the LOCAL meadow. It may help to look up the Worker's process IDs in the "worker" table: ``db_cmd.pl -url sqlite://my_hive_db -sql 'SELECT process_id FROM worker WHERE status in ("JOB_LIFECYCLE", "SUBMITTED")'`` diff --git a/docs/running_pipelines/management.rst b/docs/running_pipelines/management.rst index 355237d33..93bc1feb1 100644 --- a/docs/running_pipelines/management.rst +++ b/docs/running_pipelines/management.rst @@ -187,8 +187,7 @@ analyses as required to provide the appropriate memory usage steps, e.g. Relying on MEMLIMIT can be inconvenient at times: -* The mechanism may not be available on all job schedulers (of the ones - eHive support, only LSF has that functionality). +* The mechanism may not be available on all job schedulers. * When LSF kills the jobs, the open file handles and database connections are interrupted, potentially leading in corrupted data, and temporary files hanging around.