diff --git a/notebooks/3-hpc-allocation.ipynb b/notebooks/3-hpc-allocation.ipynb index 4681ecd3..a32f656a 100644 --- a/notebooks/3-hpc-allocation.ipynb +++ b/notebooks/3-hpc-allocation.ipynb @@ -5,7 +5,46 @@ "id": "87c3425d-5abe-4e0b-a948-e371808c322c", "metadata": {}, "source": [ - "# HPC Allocation" + "# HPC Allocation Mode\n", + "In contrast to the [HPC Submission Mode]() which submitts individual Python functions to HPC job schedulers, the HPC Allocation Mode takes a given allocation of the HPC job scheduler and executes Python functions with the resources available in this allocation. In this regard it is similar to the [Local Mode]() as it communicates with the individual Python processes using the [zero message queue](https://zeromq.org/), still it is more advanced as it can access the computational resources of all compute nodes of the given HPC allocation and also provides the option to assign GPUs as accelerators for parallel execution.\n", + "\n", + "Available Functionality: \n", + "* Submit Python functions with the [submit() function or the map() function]().\n", + "* Support for parallel execution, either using the [message passing interface (MPI)](), [thread based parallelism]() or by [assigning dedicated GPUs]() to selected Python functions. All these resources assignments are handled via the [resource dictionary parameter resource_dict]().\n", + "* Performance optimization features, like [block allocation](), [dependency resolution]() and [caching]().\n", + "\n", + "The only parameter the user has to change is the `backend` parameter. " + ] + }, + { + "cell_type": "markdown", + "id": "8c788b9f-6b54-4ce0-a864-4526b7f6f170", + "metadata": {}, + "source": [ + "## SLURM\n", + "With the [Simple Linux Utility for Resource Management (SLURM)](https://slurm.schedmd.com/) currently being the most commonly used job scheduler, executorlib provides an interface to submit Python functions to SLURM. Internally, this is based on the [srun](https://slurm.schedmd.com/srun.html) command of the SLURM scheduler, which creates job steps in a given allocation. Given that all resource requests in SLURM are communicated via a central database a large number of submitted Python functions and resulting job steps can slow down the performance of SLURM. To address this limitation it is recommended to install the hierarchical job scheduler [flux](https://flux-framework.org/) in addition to SLURM, to use flux for distributing the resources within a given allocation. This configuration is discussed in more detail below in the section [SLURM with flux]()." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "133b751f-0925-4d11-99f0-3f8dd9360b54", + "metadata": {}, + "outputs": [], + "source": [ + "from executorlib import Executor" + ] + }, + { + "cell_type": "markdown", + "id": "9b74944e-2ccd-4cb0-860a-d876310ea870", + "metadata": {}, + "source": [ + "```python\n", + "with Executor(backend=\"slurm_allocation\") as exe:\n", + " future = exe.submit(sum, [1, 1])\n", + " print(future.result())\n", + "```" ] }, { @@ -13,7 +52,14 @@ "id": "36e2d68a-f093-4082-933a-d95bfe7a60c6", "metadata": {}, "source": [ - "## Flux " + "## SLURM with Flux \n", + "As discussed in the installation section it is important to select the [flux](https://flux-framework.org/) version compatible to the installation of a given HPC cluster. Which GPUs are available? Who manufactured these GPUs? Does the HPC use [mpich](https://www.mpich.org/) or [OpenMPI](https://www.open-mpi.org/) or one of their commercial counter parts like cray MPI or intel MPI? Depending on the configuration different installation options can be choosen, as explained in the [installation section](). \n", + "\n", + "Afterwards flux can be started in an [sbatch](https://slurm.schedmd.com/sbatch.html) submission script using:\n", + "```\n", + "srun flux start python \n", + "```\n", + "In this Python script `` the `\"flux_allocation\"` backend can be used." ] }, { @@ -21,338 +67,341 @@ "id": "68be70c3-af18-4165-862d-7022d35bf9e4", "metadata": {}, "source": [ - "### Resource Assignment" + "### Resource Assignment\n", + "Independent of the selected backend [local mode](), [HPC submission mode]() or HPC allocation mode the assignment of the computational resoruces remains the same. They can either be specified in the `submit()` function by adding the resource dictionary parameter [resource_dict]() or alternatively during the initialization of the `Executor` class by adding the resource dictionary parameter [resource_dict]() there. \n", + "\n", + "This functionality of executorlib is commonly used to rewrite individual Python functions to use MPI while the rest of the Python program remains serial." ] }, { "cell_type": "code", - "execution_count": 1, - "id": "4839ef29-48f5-48f3-a3fd-0c337c6683a3", + "execution_count": 2, + "id": "8a2c08df-cfea-4783-ace6-68fcd8ebd330", "metadata": {}, "outputs": [], "source": [ - "from executorlib import Executor" + "def calc_mpi(i):\n", + " from mpi4py import MPI\n", + "\n", + " size = MPI.COMM_WORLD.Get_size()\n", + " rank = MPI.COMM_WORLD.Get_rank()\n", + " return i, size, rank" + ] + }, + { + "cell_type": "markdown", + "id": "715e0c00-7b17-40bb-bd55-b0e097bfef07", + "metadata": {}, + "source": [ + "Depending on the choice of MPI version, it is recommended to specify the pmi standard which [flux](https://flux-framework.org/) should use internally for the resource assignment. For example for OpenMPI >=5 `\"pmix\"` is the recommended pmi standard." ] }, { "cell_type": "code", - "execution_count": 2, - "id": "8a2c08df-cfea-4783-ace6-68fcd8ebd330", + "execution_count": 3, + "id": "5802c7d7-9560-4909-9d30-a915a91ac0a1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[4, 6, 8]\n", - "CPU times: user 44.4 ms, sys: 16.4 ms, total: 60.7 ms\n", - "Wall time: 1.09 s\n" + "[(3, 2, 0), (3, 2, 1)]\n" ] } ], "source": [ - "%%time\n", "with Executor(backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\") as exe:\n", - " future_lst = [exe.submit(sum, [i, i]) for i in range(2, 5)]\n", - " print([f.result() for f in future_lst])" + " fs = exe.submit(calc_mpi, 3, resource_dict={\"cores\": 2})\n", + " print(fs.result())" ] }, { - "cell_type": "code", - "execution_count": 3, - "id": "bd26d97b-46fd-4786-9ad1-1e534b31bf36", + "cell_type": "markdown", + "id": "da862425-08b6-4ced-999f-89a74e85f410", "metadata": {}, - "outputs": [], "source": [ - "def add_funct(a, b):\n", - " return a + b" + "### Block Allocation\n", + "The block allocation for the HPC allocation mode follows the same implementation as the [block allocation for the local mode](). It starts by defining the initialization function `init_function()` which returns a dictionary which is internally used to look up input parameters for Python functions submitted to the `Executor` class. Commonly this functionality is used to store large data objects inside the Python process created for the block allocation, rather than reloading these Python objects for each submitted function. " ] }, { "cell_type": "code", "execution_count": 4, - "id": "1a2d440f-3cfc-4ff2-b74d-e21823c65f69", + "id": "cdc742c0-35f7-47ff-88c0-1b0dbeabe51b", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "7\n" - ] - } - ], + "outputs": [], "source": [ - "with Executor(backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\") as exe:\n", - " future = None\n", - " for i in range(1, 4):\n", - " if future is None:\n", - " future = exe.submit(add_funct, i, i)\n", - " else:\n", - " future = exe.submit(add_funct, i, future)\n", - " print(future.result())" + "def init_function():\n", + " return {\"j\": 4, \"k\": 3, \"l\": 2}" ] }, { "cell_type": "code", "execution_count": 5, - "id": "74a21480-2435-4c89-b60e-1b06a719bf54", + "id": "5ddf8343-ab2c-4469-ac9f-ee568823d4ad", "metadata": {}, "outputs": [], "source": [ - "def calc(i):\n", - " from mpi4py import MPI\n", - "\n", - " size = MPI.COMM_WORLD.Get_size()\n", - " rank = MPI.COMM_WORLD.Get_rank()\n", - " return i, size, rank" + "def calc_with_preload(i, j, k):\n", + " return i + j + k" ] }, { "cell_type": "code", "execution_count": 6, - "id": "7ff5d31c-27cb-45cf-aff9-5c0cccfd5b3f", + "id": "0da13efa-1941-416f-b9e6-bba15b5cdfa2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[(3, 2, 0), (3, 2, 1)]\n" + "10\n" ] } ], "source": [ - "with Executor(backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\") as exe:\n", - " fs = exe.submit(calc, 3, resource_dict={\"cores\": 2})\n", - " print(fs.result())" + "with Executor(\n", + " backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\", max_workers=2, init_function=init_function, block_allocation=True\n", + ") as exe:\n", + " fs = exe.submit(calc_with_preload, 2, j=5)\n", + " print(fs.result())\n" ] }, { "cell_type": "markdown", - "id": "c24ca82d-60bd-4fb9-a082-bf9a81e838bf", + "id": "82f3b947-e662-4a0d-b590-9475e0b4f7dd", + "metadata": {}, + "source": [ + "In this example the parameter `k` is used from the dataset created by the initialization function while the parameters `i` and `j` are specified by the call of the `submit()` function. \n", + "\n", + "When using the block allocation mode, it is recommended to set either the maxium number of workers using the `max_workers` parameter or the maximum number of CPU cores using the `max_cores` parameter to prevent oversubscribing the available resources. " + ] + }, + { + "cell_type": "markdown", + "id": "8ced8359-8ecb-480b-966b-b85d8446d85c", "metadata": {}, "source": [ - "### Nested executors" + "### Dependencies\n", + "Python functions with rather different computational resource requirements should not be merged into a single function. So to able to execute a series of Python functions which each depend on the output of the previous Python function executorlib internally handles the dependencies based on the [concurrent futures future](https://docs.python.org/3/library/concurrent.futures.html#future-objects) objects from the Python standard library. This implementation is independent of the selected backend and works for HPC allocation mode just like explained in the [local mode section](). " ] }, { "cell_type": "code", "execution_count": 7, - "id": "9c2d7dd7-409a-4833-92a5-dff32aa4ecb8", + "id": "bd26d97b-46fd-4786-9ad1-1e534b31bf36", "metadata": {}, "outputs": [], "source": [ - "import flux.resource" + "def add_funct(a, b):\n", + " return a + b" ] }, { "cell_type": "code", "execution_count": 8, - "id": "fba2c9f2-791e-4534-a2d5-03c5f7b626b6", + "id": "1a2d440f-3cfc-4ff2-b74d-e21823c65f69", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "nodes: cmpc06 #cores: 4 #free: 4\n" + "6\n" ] } ], "source": [ - "with flux.Flux() as handle:\n", - " rs = flux.resource.status.ResourceStatusRPC(handle).get()\n", - " rl = flux.resource.list.resource_list(handle).get()\n", - " print(\n", - " \"nodes: \", rs.nodelist, \" #cores: \", rl.all.ncores, \" #free: \", rl.free.ncores\n", - " )" + "with Executor(backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\") as exe:\n", + " future = 0\n", + " for i in range(1, 4):\n", + " future = exe.submit(add_funct, i, future)\n", + " print(future.result())" ] }, { "cell_type": "markdown", - "id": "34a8c690-ca5a-41d1-b38f-c67eff085750", + "id": "f526c2bf-fdf5-463b-a955-020753138415", "metadata": {}, "source": [ - "### Resource Monitoring" + "### Caching\n", + "Finally, also the caching is available for HPC allocation mode, in analogy to the [local mode](). Again this functionality is not designed to identify function calls with the same parameters, but rather provides the option to reload previously cached results even after the Python processes which contained the executorlib `Executor` class is closed. As the cache is stored on the file system, this option can decrease the performance of executorlib. Consequently the caching option should primarily be used during the prototyping phase. " ] }, { "cell_type": "code", "execution_count": 9, - "id": "7481eb0a-a41b-4d46-bb48-b4db299fcd86", + "id": "dcba63e0-72f5-49d1-ab04-2092fccc1c47", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - " STATE NNODES NCORES NGPUS NODELIST\n", - " free 1 4 0 cmpc06\n", - " allocated 0 0 0 \n", - " down 0 0 0 \n" + "[2, 4, 6]\n" ] } ], "source": [ - "! flux resource list" + "with Executor(backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\", cache_directory=\"./cache\") as exe:\n", + " future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]\n", + " print([f.result() for f in future_lst])" ] }, { "cell_type": "code", "execution_count": 10, - "id": "1ee6e147-f53a-4526-8ed0-fd036f2ee6bf", + "id": "c3958a14-075b-4c10-9729-d1c559a9231c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - " JOBID USER NAME ST NTASKS NNODES TIME INFO\n", - "\u001b[01;32m ƒ2oHWAuoD janssen python CD 2 1 0.864s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ2nv5oHeF janssen python CD 1 1 0.601s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ2ngP9vbR janssen python CD 1 1 0.515s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ2nQ2KrEj janssen python CD 1 1 0.525s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ2n6PNPnp janssen python CD 1 1 0.484s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ2n6PNPno janssen python CD 1 1 0.477s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ2n6PNPnq janssen python CD 1 1 0.472s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ22V35h75 janssen pysqa CD 2 1 3.679s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ21fZzgnj janssen pysqa CD 1 1 2.836s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ21WSj7Bd janssen pysqa CD 1 1 2.916s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ21LrmkWK janssen pysqa CD 1 1 2.870s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒ21BacFD9 janssen pysqa CD 1 1 2.922s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒzPbWXEo janssen pysqa CD 1 1 2.898s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒzFHgYej janssen pysqa CD 1 1 2.965s cmpc06\n", - "\u001b[0;0m\u001b[01;32m ƒz6Lnt23 janssen pysqa CD 1 1 2.940s cmpc06\n", - "\u001b[0;0m" + "['sumd1bf4ee658f1ac42924a2e4690e797f4.h5out', 'sum5171356dfe527405c606081cfbd2dffe.h5out', 'sumb6a5053f96b7031239c2e8d0e7563ce4.h5out']\n" ] } ], "source": [ - "! flux jobs -a" + "import os\n", + "import shutil\n", + "\n", + "cache_dir = \"./cache\"\n", + "if os.path.exists(cache_dir):\n", + " print(os.listdir(cache_dir))\n", + " try:\n", + " shutil.rmtree(cache_dir)\n", + " except OSError:\n", + " pass" ] }, { "cell_type": "markdown", - "id": "845b4c81-672b-4f14-ac14-9d66d6405f11", + "id": "c24ca82d-60bd-4fb9-a082-bf9a81e838bf", "metadata": {}, "source": [ - "## SLURM " + "### Nested executors\n", + "The hierarchical nature of the [flux](https://flux-framework.org/) job scheduler allows the creation of additional executorlib Executors inside the functions submitted to the Executor. This hierarchy can be beneficial to separate the logic to saturate the available computational resources. " ] }, { - "cell_type": "markdown", - "id": "b75b8dc8-6b8c-40ac-9449-2d9961d5d8b0", + "cell_type": "code", + "execution_count": 11, + "id": "06fb2d1f-65fc-4df6-9402-5e9837835484", "metadata": {}, + "outputs": [], "source": [ - "### Resource Assignment" + "def calc_nested():\n", + " from executorlib import Executor\n", + " \n", + " with Executor(backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\") as exe:\n", + " fs = exe.submit(sum, [1, 1])\n", + " return fs.result()" ] }, { - "cell_type": "markdown", - "id": "bc468d0f-7eab-48a2-9d60-b7f84353ad38", + "cell_type": "code", + "execution_count": 12, + "id": "89b7d0fd-5978-4913-a79a-f26cc8047445", "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2\n" + ] + } + ], "source": [ - "### Advanced Configuration\n", - "* explain additional arguments" + "with Executor(backend=\"flux_allocation\", flux_executor_pmi_mode=\"pmix\") as exe:\n", + " fs = exe.submit(calc_nested)\n", + " print(fs.result())" ] }, { "cell_type": "markdown", - "id": "a4525e24-a693-4da2-ac59-ea84c538827a", + "id": "34a8c690-ca5a-41d1-b38f-c67eff085750", "metadata": {}, "source": [ - "### Combined with Flux" + "### Resource Monitoring\n", + "For debugging it is commonly helpful to keep track of the computational resources. [flux](https://flux-framework.org/) provides a number of features to analyse the resource utilization, so here only the two most commonly used ones are introduced. Starting with the option to list all the resources available in a given allocation with the `flux resource list` command:" ] }, { - "cell_type": "markdown", - "id": "99919e31-8149-46d4-9231-e1a3e750eace", + "cell_type": "code", + "execution_count": 13, + "id": "7481eb0a-a41b-4d46-bb48-b4db299fcd86", "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " STATE NNODES NCORES NGPUS NODELIST\n", + " free 1 2 0 fedora\n", + " allocated 0 0 0 \n", + " down 0 0 0 \n" + ] + } + ], "source": [ - "## Other Queuing Systems\n", - "While primarily Flux and SLURM are supported it should be possible to adopt the configuration to other queuing systems. " + "! flux resource list" ] }, { - "cell_type": "code", - "execution_count": 11, - "id": "885179f8-7985-496e-8045-36b4e117be68", + "cell_type": "markdown", + "id": "08d98134-a0e0-4841-be82-e09e1af29e7f", "metadata": {}, - "outputs": [], "source": [ - "from executorlib.standalone.interactive.spawner import generate_slurm_command" + "Followed by the list of jobs which were executed in a given flux session. This can be retrieved using the `flux jobs -a` command:" ] }, { "cell_type": "code", - "execution_count": 12, - "id": "b09ba577-fa03-4801-988e-904d902f5e0e", + "execution_count": 14, + "id": "1ee6e147-f53a-4526-8ed0-fd036f2ee6bf", "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "\u001b[0;31mSignature:\u001b[0m\n", - "\u001b[0mgenerate_slurm_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcores\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcwd\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mthreads_per_core\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mgpus_per_core\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mopenmpi_oversubscribe\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mslurm_cmd_args\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mSource:\u001b[0m \n", - "\u001b[0;32mdef\u001b[0m \u001b[0mgenerate_slurm_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcores\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcwd\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mthreads_per_core\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mgpus_per_core\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mopenmpi_oversubscribe\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mslurm_cmd_args\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0;34m\"\"\"\u001b[0m\n", - "\u001b[0;34m Generate the command list for the SLURM interface.\u001b[0m\n", - "\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m Args:\u001b[0m\n", - "\u001b[0;34m cores (int): The number of cores.\u001b[0m\n", - "\u001b[0;34m cwd (str): The current working directory.\u001b[0m\n", - "\u001b[0;34m threads_per_core (int, optional): The number of threads per core. Defaults to 1.\u001b[0m\n", - "\u001b[0;34m gpus_per_core (int, optional): The number of GPUs per core. Defaults to 0.\u001b[0m\n", - "\u001b[0;34m openmpi_oversubscribe (bool, optional): Whether to oversubscribe the cores. Defaults to False.\u001b[0m\n", - "\u001b[0;34m slurm_cmd_args (list[str], optional): Additional command line arguments. Defaults to [].\u001b[0m\n", - "\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m Returns:\u001b[0m\n", - "\u001b[0;34m list[str]: The generated command list.\u001b[0m\n", - "\u001b[0;34m \"\"\"\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcommand_prepend_lst\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mSLURM_COMMAND\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"-n\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcores\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcwd\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcommand_prepend_lst\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"-D\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcwd\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mthreads_per_core\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcommand_prepend_lst\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"--cpus-per-task\"\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mthreads_per_core\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mgpus_per_core\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcommand_prepend_lst\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"--gpus-per-task=\"\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgpus_per_core\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mopenmpi_oversubscribe\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcommand_prepend_lst\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"--oversubscribe\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mslurm_cmd_args\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mcommand_prepend_lst\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0mslurm_cmd_args\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mcommand_prepend_lst\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mFile:\u001b[0m ~/projects/executorlib/executorlib/standalone/interactive/spawner.py\n", - "\u001b[0;31mType:\u001b[0m function" - ] - }, - "metadata": {}, - "output_type": "display_data" + "name": "stdout", + "output_type": "stream", + "text": [ + " JOBID USER NAME ST NTASKS NNODES TIME INFO\n", + "\u001b[01;32m ƒDqBpVYK jan python CD 1 1 0.695s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒDxdEtYf jan python CD 1 1 0.225s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒDVahzPq jan python CD 1 1 0.254s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒDSsZJXH jan python CD 1 1 0.316s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒDSu3Hod jan python CD 1 1 0.277s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒDFbkmFD jan python CD 1 1 0.247s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒD9eKeas jan python CD 1 1 0.227s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒD3iNXCs jan python CD 1 1 0.224s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒCoZ3P5q jan python CD 1 1 0.261s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒCoXZPoV jan python CD 1 1 0.261s fedora\n", + "\u001b[0;0m\u001b[01;32m ƒCZ1URjd jan python CD 2 1 0.360s fedora\n", + "\u001b[0;0m" + ] } ], "source": [ - "generate_slurm_command??" + "! flux jobs -a" + ] + }, + { + "cell_type": "markdown", + "id": "021f165b-27cc-4676-968b-cbcfd1f0210a", + "metadata": {}, + "source": [ + "## Flux\n", + "While the number of HPC clusters which use [flux](https://flux-framework.org/) as primary job scheduler is currently still limited the setup and functionality provided by executorlib for running [SLURM with flux]() also applies to HPCs which use [flux](https://flux-framework.org/) as primary job scheduler." ] }, { "cell_type": "code", "execution_count": null, - "id": "1de93586-d302-4aa6-878a-51acfb1d3009", + "id": "04f03ebb-3f9e-4738-b9d2-5cb0db9b63c3", "metadata": {}, "outputs": [], "source": [] @@ -374,7 +423,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.10" + "version": "3.12.5" } }, "nbformat": 4,