This is an adapted version of Geocomputing Python Puhti examples for Aalto project course workshop in March 2023.
Three different job styles: interactive, serial, and embarrasingly/delightfully/naturally parallel.
For parallel jobs there are multiple options with different command line / workload manager tools or Python libraries. We'll have a look at using GNUParallel
and dask
.
Interactive: developing your scripts, limited test data. Computationally more demanding analysis: use Puhti's batch system for requesting the resources and running your scripts.
Calculate NDVI (Normalized Difference Vegetation Index) from the Sentinel2 satellite image's red and near infrared bands.
Basic idea behind the script is to:
- Find red and near infrared channels of Sentinel L2A products from its
SAFE
folder and open thejp2
files. ->readImage
- Read the data as
numpy
array withrasterio
, scale the values to reflectance values and calculate NDVI index. ->calculateNDVI
- Save output as GeoTiff with
rasterio
. ->saveImage
Files:
- The input ESA Sentinel-2 L2A images are in JPG2000 format and are already stored in Puhti:
/appl/data/geo/sentinel/s2_example_data/L2A
. Puhti has only these example images, more Sentinel L2A images are available from CSC Allas. - The example scripts are in subfolders by job type and used parallel library. Each subfolder has 2 files:
- A .py file for defining the tasks to be done.
- A batch job .sh file for submitting the job to Puhti batch job scheduler (SLURM).
- Log in to Puhti web interface: https://puhti.csc.fi
- Start a
Login node shell
- Create a folder for yourself:
- Switch to your projects scratch directory:
cd /scratch/project_200xxxx/
(fill in your project number for x) - Create new own folder:
mkdir yyy
(fill in your user name for y) - Switch into your own folder:
cd yyy
(fill in your username for y) - Get the exercise material by cloning this repository:
git clone https://github.com/samumantha/puhti_python_example
- Switch to the directory with example files:
cd puhti_python_example
. - Check that we are in the correct place:
pwd
should show something like/scratch/project_200xxxx/yyy/puhti_python_example
.
- Switch to your projects scratch directory:
Within an interactive job it is possible to edit the script in between test-runs; so this suits best when still writing the script. Usually it is best to use some smaller test dataset for this.
With Visual Studio Code you can also just run parts of the script.
- Start Visual Studio Code in Puhti web interface.
- Open VSCode start page from front page: Apps -> Visual Studio code
- Choose settings for VSCode:
- Project: project_200xxxx
- Partition: interactive
- Number of CPU cores: 1
- Memory: 4 Gb
- Local disk: 0
- Time: 2:00:00
- Modules: geoconda
Launch
- Wait a moment -> Connect to Visual studio code
- VSCode opens up in the browser
- Open folder with exercise files:
- File -> Open folder ->
/scratch/project_200xxxx/yyy/geocomputing/python/puhti
-> OK
- File -> Open folder ->
- Open serial/single_core_example.py. This is basic Python script, which uses a for loop for going through all 3 files.
- Check that needed Python libraries are available in Puhti. If it is not your own script you can see which libraries are used in this script by checking the imports. To check whether those libraries are available: Select all import rows and press
Shift+Enter
. Wait a few seconds. The import commands are run in Terminal (which opens automatically on the bottom of the page). If no error messages are visible, the packages are available. Also other parts of the script can be tested in the same manner (select the code and run withShift+Enter
). - Run the full script:
- Exit Python console in Terminal: type
exit()
in the terminal - Click green arrow above script (Run Python File in Terminal)
- Wait, it takes a few minutes for complete. The printouts will appear during the process.
- Check that there are 3 new GeoTiff files in your outout directory in the Files panel of VSCode.
- Exit Python console in Terminal: type
- Optional, check your results with QGIS
If you prefer prototyping and testing in a Jupyter Notebook, you can also do that in a similar manner than using Visual Studio Code. Choose Jupyter from the Puhti web interface dashboard or the Apps tab in the Puhti webinterface.
If you prefer working in the terminal, you can also start an interactive job there by starting a compute node shell directly from Tools tab in Puhti webinterface. Choose settings for the interactive session:
- Project: project_200xxxx
- Number of CPU cores: 1
- Memory: 4 Gb
- Local disk: 0
- Time: 2:00:00
You can also start an interactive session by starting a login node shell from Tools tab in Puhti webinterface or by connecting to Puhti via ssh connection with sinteractive --account project_200xxxx --cores 1 --time 02:00:00 --mem 4G --tmp 0
. Which gives you a compute node shell (you can see "where" you are from your terminal prompt [@puhti-loginXX] -> login node, [@rXXcXX] (XX being some numbers) -> compute node).
For both of above:
After getting access to the compute node shell, you can load modules and run scripts "like on a normal linux computer", excluding graphical access.
module load geoconda
python interactive_example.py /appl/data/geo/sentinel/s2_example_data/L2A
For a one core batch job, use the same Python-script as for interactive working. Latest now, we have to move to the terminal.
- Open serial_job/single_core_example.sh. Where are output and error messages written? How many cores and for how long time are reserved? How much memory? Which partition is used? Which module is used?
- Submit batch job from login node shell
cd /scratch/project_200xxxx/yyy/puhti_python_example/serial
sbatch single_core_example.sh
sbatch
prints out a job ID, use it to check state and efficiency of the batch job. Did you reserve a good amount of memory?
seff [jobid]
- Once the job is finished, see output in out.txt and err.txt for any possible errors and other outputs.
- Check that you have new GeoTiff files in working folder.
- Check the resources used in another way.
sacct -j [jobid] -o JobName,elapsed,TotalCPU,reqmem,maxrss,AllocCPUS
- elapsed – time used by the job
- TotalCPU – time used by all cores together
- reqmem – amount of requested memory
- maxrss – maximum resident set size of all tasks in job.
- AllocCPUS – how many CPUs were reserved
In this case the Python code takes care of dividing the work to 3 processes, one for each input file. Python has several packages for code parallelization, here we'll take a look at dask
(for others, check Geocomputing multiporcessing example and Geocomputing joblib example ):
dask
is versatile and has several options for parallelization, this example is for single-node (max 40 cores)- usage, but dask
can also be used for multi-node jobs. This example uses delayed functions from Dask to parallelize the workload. Typically, if a workflow contains a for-loop, it can benefit from delayed. Dask delayed tutorial
-
parallel_dask/dask_singlenode.sh batch job file for
dask
.--ntasks=1
+--cpus-per-task=3
reserves 3 cores - one for each file--mem-per-cpu=4G
reserves memory per core
-
Submit the parallel job to Puhti from Puhti login node shell:
cd ../parallel_dask
sbatch dask_singlenode.sh
- Check with
seff
how much time and resources you used?
GNU parallel can help parallelizing a script which otherwise is not parallelized. In this example we want to run the same script on three different inputfiles which we can read into a textfile and use as argument to the parallel tool.
This is similar to array jobs (see Geocomputing array job example), with the advantage that we do not start and need to monitor multiple jobs.
gnu_parallel/gnu_parallel_example.sh. The only difference to serial job is that we do not loop through the directory inside the Python script but let GNU parallel handle that step.
To get to know how many
cpus-per-task
we need you can use for examplels /appl/data/geo/sentinel/s2_example_data/L2A | wc -l
to count everything within the data directory before writing the batch job script.
Submit the gnu_parallel job
cd ../gnu_parallel
sbatch gnu_parallel_example.sh
- Check with
seff
how much time and resources you used?
Before using Allas from Puhti we need to load the tools and configure the connection:
module load allas
allas-conf --mode s3cmd
Using s3cmd
, you can find all commands in bottom of this page: https://s3tools.org/usage
s3cmd <command>
See allas_from_python.py
for some examples.
https://pouta.csc.fi/dashboard -> object store -> containers (might need Pouta access via project?)
You can make single buckets publicly available via the internet, everyone can then access the files within a bucket, such as todays presentation.
"Too many files" issues are also often encountered with workflows consisting of thousands of small runs. As a general guide, keep the number of files in a single directory well below one thousand, and organize your data into multiple directories. Also, use command csc-workspaces
to monitor that the total number of files in your projects stays well below the limits. If most of the files are temporary, or there simply is too many of them, using the fast local SSD disks (NVMe) in the I/O nodes can solve the problem. You can pack small files into a bigger archive file with the tar
command. Most importantly, if there are output files that you do not need, find out how to turn off writing those in the first place.
These are just to demonstrate the difference between single core sequential vs. some kind of parallelism. Depending on the issue, some library might be faster or slower.
Example | Jobs | CPU cores / job | Time (min) | CPU efficiency |
---|---|---|---|---|
single core | 1 | 1 | 02:54 | 86.70% |
multiprocessing | 1 | 3 | 01:05 | 92.31% |
joblib | 1 | 3 | 01:12 | 86.57% |
dask | 1 | 3 | 01:01 | 78.46% |
array job | 3 | 1 | 01:03 | 95.16% |
GNU parallel | 1 | 3 | 00:55 | 15.15% |