diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..385eead --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +*.csv +*.pdf +__pycache__/ \ No newline at end of file diff --git a/README.md b/README.md index 2c78b43..4c5cf9c 100644 --- a/README.md +++ b/README.md @@ -1,31 +1,12 @@ -# Toolset to perform scaling analysis of ICON(-HAM), ECHAM(-HAM) and MPI-ESM(-HAM) +# Toolset to perform scaling analysis of ICON It has been tested on Piz Daint (CSCS) to produce the technical part of production projects at CSCS. -On Euler (ETHZ) only limited functionality is provided for the analysis of Icon. + +On Euler (ETHZ), only limited functionality is provided for the analysis of Icon. See [Limitations on Euler](#limitations-on-euler) for more information. Below is a description of each script and a recipe. -- Original devleopment: Colombe Siegenthaler (2020-01) -- Maintainted by Michael Jähn from 2021-03 on - -## Table of contents - - [Recipe for scaling analysis with ECHAM/ICON-(HAM)](#recipe-for-scaling-analysis-with-echamicon-ham) - - [1. Configure and compile your model as usual.](#1-configure-and-compile-your-model-as-usual) - - [2. Prepare your running script](#2-prepare-your-running-script) - - [ICON](#icon) - - [ECHAM](#echam) - - [3. Create and launch different running scripts based on my_exp, but using different numbers of nodes.](#3-create-and-launch-different-running-scripts-based-on-my_exp-but-using-different-numbers-of-nodes) - - [ICON](#icon) - - [ECHAM](#echam) - - [4. When all the runs are finished, read all the slurm/log files to get the Wallclock for each run, and put them into a table:](#4-when-all-the-runs-are-finished-read-all-the-slurmlog-files-to-get-the-wallclock-for-each-run-and-put-them-into-a-table) - - [ICON](#icon) - - [ECHAM](#echam) - - [5. Create a summary plot and table of the variable you wish (Efficiency, NH, Wallclock) for different experiments with respect to the number of nodes.](#5-create-a-summary-plot-and-table-of-the-variable-you-wish-efficiency-nh-wallclock-for-different-experiments-with-respect-to-the-number-of-nodes) - - [Limitations on Euler](#limitations-on-euler) - -## Recipe for scaling analysis with ECHAM/ICON-(HAM) - ### 1. Configure and compile your model as usual. ### 2. Prepare your running script @@ -39,81 +20,45 @@ $ conda env create -f environment.yaml To load your environment, simply type: ```console -$ conda env create -f environment.yaml +$ conda activate scaling_analysis ``` - -#### ICON - -Prepare your machine-independent setting file "my_exp" (e.g. exp.atm_amip, without the '.run'). - -#### ECHAM -Prepare your setting file as usual with the jobscript toolkit: - -```console -$ prepare_run -r [path_to_your_setting_folder] my_exp -``` +Prepare your machine-independent setting file `my_exp` (e.g. `exp.atm_amip`, without the `'.run`'). ### 3. Create and launch different running scripts based on my_exp, but using different numbers of nodes. -#### ICON - -Use `send_several_run_ncpus_perf_ICON.py`. +Use `send_several_run_ncpus_perf.py`. For example for running `my_exp` on 1, 10, 12 and 16 nodes: ```console -$ python [path_to_scaling_analysis_tool]/send_several_run_ncpus_perf_ICON.py -e my_exp -n 1 10 12 15 +$ python [path_to_scaling_analysis_tool]/send_several_run_ncpus_perf.py -e my_exp -n 1 10 12 15 ``` With the command above, 4 running scripts will be created (`exp.my_exp_nnodes1.run`, `exp.my_exp_nnodes10.run`, `exp.my_exp_nnodes12.run` and `exp.my_exp_nnodes15.run`), and each of them will be launched. -To send several experiments on different node numbers at once, use: `send_analyse_different_exp_at_once_ICON.py` +To send several experiments on different node numbers at once, use: `send_analyse_different_exp_at_once.py` form inside `<path_to_scaling_analysis_tool>`: ```console -$ python send_analyse_different_exp_at_once_ICON.py +$ python send_analyse_different_exp_at_once.py ``` -The script `send_analyse_different_exp_at_once_ICON.py` (n_step = 1) is a wrapper which calls -`send_several_run_ncpus_perf_ICON.py` for different experiments (for example different set-ups, or compilers). +The script `send_analyse_different_exp_at_once.py` (n_step = 1) is a wrapper which calls +`send_several_run_ncpus_perf.py` for different experiments (for example different set-ups, or compilers). -The script `send_analyse_different_exp_at_once_ICON.py` (n_step = 2) is a wrapper which gets +The script `send_analyse_different_exp_at_once.py` (n_step = 2) is a wrapper which gets the wallclocks from the log files for different experiments (for example different set-ups, or compilers) (point 4 of this README). -#### ECHAM - -Use `send_several_run_ncpus_perf.py` which creates and sends several running scripts using the option -o of the jobsubm_echam script. -For example, sending the my_exp run on 1, 10, 12 and 15 nodes: - -```console -$ python [path_to_scaling_analysis_tool]/send_several_run_ncpus_perf.py -b [path_to_echam-ham_folder]/my_experiments/my_exp -n 1 10 12 15 -``` - -With the command above, 4 running folders will be created based on the running folder `my_exp` -(`my_exp_cpus12`, `my_exp_cpus120`, `my_exp_cpus144and my_exp_cpus180`), and each of them will be launched. - ### 4. When all the runs are finished, read all the slurm/log files to get the Wallclock for each run, and put them into a table: -#### ICON - Use the option `-m icon`: ```console $ python [path_to_scaling_analysis_tool]/create_scaling_table_per_exp.py -e my_exp -m icon ``` -or for different experiments at once: `send_analyse_different_exp_at_once_ICON.py` (n_step = 2) (cf point 3) - -#### ECHAM - -Use the option `-m icon` - -```console -$ python [path_to_scaling_analysis_tool]/create_scaling_table_per_exp.py -e my_exp -m echam-ham -``` - -For both model types, this creates a table `my_exp.csv`, which contains the wallclock, efficiency and NH for each run. +or for different experiments at once: `send_analyse_different_exp_at_once.py` (n_step = 2) (cf point 3) ### 5. Create a summary plot and table of the variable you wish (Efficiency, NH, Wallclock) for different experiments with respect to the number of nodes. @@ -126,9 +71,7 @@ $ python [path_to_scaling_analysis_tool]/plot_perfs.py ## Limitations on Euler * The scaling analysis tools were tested for Icon only. -* Because of differing nodes-architectures on Euler, the number of nodes passed via the -n option -corresponds to the number of Euler-cores. -* Parsing the logfiles only works using the --no_sys_report option. +* Because of differing nodes-architectures on Euler, the number of nodes passed via the -n option corresponds to the number of Euler-cores. * In order to have nice plots, the number of Euler-cores needs to be divided by 12. * Automatic runtime-specification is not as smooth as on Daint -> a minimum of 20 min is requested in any case. diff --git a/create_scaling_table_per_exp.py b/create_scaling_table_per_exp.py index bc22906..44e8833 100755 --- a/create_scaling_table_per_exp.py +++ b/create_scaling_table_per_exp.py @@ -1,17 +1,4 @@ #!/usr/bin/python -# -# Script to parse all the slurm or echam6.log files for one experiment (runned on different number of nodes) -# to extract the wallclock time. It creates a table containing wallclock time and associated scaling data -# (Efficiency, Speed-up, NH,...). -# -# -#Example : create_scaling_table_per_exp.py -e my_exp -m icon -y 1 -# -# C. Siegenthaler (C2SM) , July 2015 -# C. Siegenthaler (C2SM) : adaptation for ICON, November 2017 -# C. Siegenthaler (C2SM) : modifications, December 2019 -# -############################################################################################ import numpy as np import os @@ -28,8 +15,112 @@ 'date_run': datetime.datetime(1900, 1, 1).strftime("%Y-%m-%d %H:%M:%S") } -if __name__ == "__main__": +def grep(string, filename): + # returns lines of file_name where string appears + # mimic the "grep" function + + # initialisation + # list of lines where string is found + list_line = [] + list_iline = [] + lo_success = False + file = open(filename, 'r') + count = 0 + while True: + try: # Some lines are read in as binary with the pgi compilation + line = file.readline() + count += 1 + if string in line: + list_line.append(line) + list_iline.append(count) + lo_success = True + if not line: + break + except Exception as e: + continue + file.close() + return {"success": lo_success, "iline": list_iline, "line": list_line} + + +def extract_line(filename, line_number): + # Open the file in read mode + with open(filename, 'r') as file: + # Read all lines into a list + lines = file.readlines() + + # Check if the line number is valid + if 1 <= line_number <= len(lines): + # Extract the content of the specified line + content = lines[line_number - 1] + return content.strip() # Strip any leading/trailing whitespace + else: + print("Error: Line number is out of range.") + return None + + +def extract_job_id(filename, prefix="slurm-", suffix=".out"): + # Find the starting index of "slurm-" and ".out" + start_index = filename.find(prefix) + len(prefix) + end_index = filename.find(suffix) + + # Extract the job ID substring + if start_index != -1 and end_index != -1: + job_id = filename[start_index:end_index] + return job_id + else: + print("Error: Filename format is incorrect.") + return None + + +def get_wallclock_icon(filename, no_x, num_ok=1, success_message=None): + + required_ok_streams = num_ok + if success_message: + OK_streams = grep(success_message, filename)["line"] + else: + OK_streams = grep('Script run successfully: OK', filename)["line"] + + if len(OK_streams) >= required_ok_streams: + total_grep = grep("total ", filename)["line"] + wallclock = float(total_grep[0].split()[-2]) + line_times = grep(" Elapsed", filename)["iline"][0] + 2 + date_run = extract_line(filename, line_times).split()[2] + else: + print("file {} did not finish properly".format(filename)) + print("Set Wallclock = 0") + wallclock = datetime.timedelta(0) + + return wallclock, date_run + + +def check_icon_finished(filename, + string_sys_report='Script run successfully: OK'): + # return True if icon finished properly + + # initilisation + lo_finished_ok = False + + # look for ok_line + if grep(string_sys_report, filename)['success']: + lo_finished_ok = True + + return (lo_finished_ok) + + +def set_default_error_slurm_file(txt_message="Problem in the slurm file"): + # error in the slurm file, set default values + + wallclock = default_wallclock['wallclock'] + nnodes = default_wallclock['nnodes'] + date_run = default_wallclock['date_run'] + print(txt_message) + print("Set Wallclock = {} , and nodes = {}".format(wallclock, nnodes)) + + return (wallclock, nnodes, date_run) + + +if __name__ == "__main__": # parsing arguments parser = argparse.ArgumentParser() parser.add_argument('--exp', '-e', dest = 'basis_name',\ @@ -52,22 +143,19 @@ help='resolution(with ocean) eg T63L31GR15 ') parser.add_argument('--mod','-m', dest = 'mod',\ - default='echam-ham',\ - help='model type (echam-ham, icon, icon-ham)') + default='icon',\ + help='model type (icon, icon-ham, icon-clm)') - parser.add_argument('--cpu_per_node', dest = 'cpu_per_node',\ - default = 12,\ + parser.add_argument('--mpi_procs_per_node', dest = 'mpi_procs_per_node',\ + default = 1,\ type = int,\ - help = 'numper of CPUs per node') + help = 'numper of MPI procs per node') parser.add_argument('--fact_nh_yr', '-y', dest = 'factor_nh_year',\ default = 12,\ type = int,\ help = 'factor to multiply for getting NH per year') - parser.add_argument('--no_sys_report', action='store_true',\ - help = 'no time report provided by the system, per default, the wallclock will be taken from this report. If this option enabled, the wallclock will computed in a different way') - parser.add_argument('--no_x', action='store_false',\ help = 'some model logs have a "set -x" in the first line, therefore the "Script run successfully: OK" string is contained twice in the logfile. Passing this argument assumes NO "set -x" set.') @@ -101,8 +189,8 @@ if l_cpus_def: if args.mod.upper().startswith("ICON-CLM"): slurm_files_ar = [ - glob.glob("{}/{}_nnodes{}_*.o*".format(path_exps_dir, - args.basis_name, n)) + glob.glob("{}/{}_nnodes{}/slurm-*.out".format( + path_exps_dir, args.basis_name, n)) for n in nodes_to_proceed ] slurm_files = list(itertools.chain.from_iterable(slurm_files_ar)) @@ -113,26 +201,16 @@ for n in nodes_to_proceed ] slurm_files = list(itertools.chain.from_iterable(slurm_files_ar)) - elif args.mod.upper() == "ECHAM-HAM": - slurm_files_ar = [ - glob.glob("{}/{}_cpus{}/slurm*".format(path_exps_dir, - args.basis_name, - n * args.cpu_per_node)) - for n in nodes_to_proceed - ] - slurm_files = list(itertools.chain.from_iterable(slurm_files_ar)) # 3rd possibility : use all the slurm files containing the basis name if (not l_cpus_def): if args.mod.upper().startswith("ICON-CLM"): - slurm_files = glob.glob("{}/*{}*.o*".format( - path_exps_dir, args.basis_name, args.basis_name)) + slurm_files = sorted( + glob.glob("{}/{}_nnodes*/slurm-*.out".format( + path_exps_dir, args.basis_name, args.basis_name))) elif args.mod.upper().startswith("ICON"): slurm_files = glob.glob("{}/LOG.exp.{}*.run.*".format( path_exps_dir, args.basis_name, args.basis_name)) - elif args.mod.upper() == "ECHAM-HAM": - slurm_files = glob.glob("{}/{}*/slurm_{}*".format( - path_exps_dir, args.basis_name, args.basis_name)) # fill up array #----------------------------------------------------------------------------------------------- @@ -153,295 +231,73 @@ # performs the analysis (create a csv file) #ilin = 0 - - def grep(string, filename): - # returns lines of file_name where string appears - # mimic the "grep" function - - # initialisation - # list of lines where string is found - list_line = [] - list_iline = [] - lo_success = False - file = open(filename, 'r') - count = -1 - while True: - try: # Some lines are read in as binary with the pgi compilation - line = file.readline() - count += 1 - if string in line: - list_line.append(line) - list_iline.append(count) - lo_success = True - if not line: - break - except Exception as e: - continue - file.close() - return {"success": lo_success, "iline": list_iline, "line": list_line} - - def get_wallclock_icon(filename, no_x, num_ok=1, success_message=None): - - required_ok_streams = num_ok - if success_message: - OK_streams = grep(success_message, filename)["line"] - else: - OK_streams = grep('Script run successfully: OK', filename)["line"] - - if len(OK_streams) >= required_ok_streams: - timezone = 'CEST' - time_grep = grep(timezone, filename)["line"] - if not time_grep: - timezone = 'CET' - time_grep = grep(timezone, filename)["line"] - - time_arr = [ - datetime.datetime.strptime( - s.strip(), '%a %b %d %H:%M:%S ' + timezone + ' %Y') - for s in time_grep - ] - - wallclock = time_arr[-1] - time_arr[0] - else: - print("file {} did not finish properly".format(filename)) - print("Set Wallclock = 0") - wallclock = datetime.timedelta(0) - - return {"wc": wallclock, "st": time_arr[0]} - - def check_icon_finished(filename, - string_sys_report='Script run successfully: OK'): - # return True if icon finished properly - - # initilisation - lo_finished_ok = False - - # look for ok_line - if grep(string_sys_report, filename)['success']: - lo_finished_ok = True - - return (lo_finished_ok) - - def set_default_error_slurm_file(txt_message="Problem in the slurm file"): - # error in the slurm file, set default values - - wallclock = default_wallclock['wallclock'] - nnodes = default_wallclock['nnodes'] - date_run = default_wallclock['date_run'] - print(txt_message) - print("Set Wallclock = {} , and nodes = {}".format(wallclock, nnodes)) - - return (wallclock, nnodes, date_run) - - def get_date_from_echam_slurm_file(filename): - string_timer_report = 'Submit Eligible' - summary_in_file = grep(string_timer_report, filename) - if summary_in_file['success']: - summary_line = summary_in_file["line"][0] - summary_iline = summary_in_file["iline"][0] - f = open(filename) - lines = f.readlines() - - line_labels = [s.strip() for s in summary_line.split()] - ind_start = line_labels.index('Start') - ind_end = line_labels.index('End') - - line_time = [ - lines[summary_iline + 2].split()[i] - for i in [ind_start, ind_end] - ] - first_row = grep(string_timer_report, filename) - first_row_line = first_row["line"][0] - first_row_iline = first_row["iline"][0] - ind_start = line_labels.index('Start') - ind_end = line_labels.index('End') - line_time = [ - lines[first_row_iline + 2].split()[i] - for i in [ind_start, ind_end] - ] - time_arr = [ - datetime.datetime.strptime(s.strip(), '%Y-%m-%dT%H:%M:%S') - for s in line_time - ] - date_run = time_arr[0] - else: - print("Warning: Cannot get date from slurm file %s" % filename) - date_run = default_wallclock['date_run'] - - return date_run - - def get_wallclock_Nnodes_gen_daint(filename, - string_sys_report="Elapsed", - use_timer_report=False): - # Find report - summary_in_file = grep(string_sys_report, filename) - if summary_in_file['success']: - summary_line = summary_in_file["line"][0] - summary_iline = summary_in_file["iline"][0] - - f = open(filename) - lines = f.readlines() - - line_labels = [s.strip() for s in summary_line.split()] - ind_start = line_labels.index('Start') - ind_end = line_labels.index('End') - - # For summary_iline + x had to be subtracted by one - line_time = [ - lines[summary_iline + 2].split()[i] - for i in [ind_start, ind_end] - ] - time_arr = [ - datetime.datetime.strptime(s.strip(), '%Y-%m-%dT%H:%M:%S') - for s in line_time - ] - - if use_timer_report: - string_timer_report = '# calls' - timer_in_file = grep(string_timer_report, filename) - if timer_in_file['success']: - timer_line = timer_in_file["line"][0] - timer_iline = timer_in_file["iline"][0] - string_timer_firstrow = 'total ' - first_row = grep(string_timer_firstrow, filename) - first_row_line = first_row["line"][0] - first_row_iline = first_row["iline"][0] - time_str = lines[first_row_iline].split()[-1] - wallclock = datetime.timedelta(seconds=float(time_str)) - else: - wallclock, nnodes, time_arr = set_default_error_slurm_file( - "Warning : Timer output report is not present or the word {} is not found" - .format(filename, string_timer_report)) - else: - # find index of "start" and "end" in the report line - wallclock = time_arr[-1] - time_arr[0] - - # Nnodes - line_labels_n = [ - s.strip() for s in lines[summary_iline + 7].split() - ] - ind_nodes = line_labels_n.index('NNodes') - nodes = int(lines[summary_iline + 9].split()[ind_nodes]) - - f.close() - else: - wallclock, nnodes, time_arr = set_default_error_slurm_file( - "Warning : Batch summary report is not present or the word {} is not found" - .format(filename, string_sys_report)) - - return {"n": nodes, "wc": wallclock, "st": time_arr[0]} - # security. If not file found, exit if len(slurm_files) == 0: - print("No slurm file founded with this basis name") + print("No slurm file found with this basis name") print("Exiting") exit() # loop over number of cpus to be lauched for filename in slurm_files: - print("Read file : {}".format(os.path.basename(filename))) + print(f"Read file: {filename}") # read nnodes and wallclock from file if args.mod.upper() == "ICON": if check_icon_finished(filename) or args.ignore_errors: # get # nodes and wallclock - if args.no_sys_report: - # infer nnodes from MPI-procs in ICON output - nodes_line = grep( - "mo_mpi::start_mpi ICON: Globally run on", - filename)["line"][0] - nnodes = int(nodes_line.split(' ')[6]) - - nnodes = nnodes // args.cpu_per_node - - wallclock = get_wallclock_icon( - filename, args.no_x)["wc"].total_seconds() - date_run = get_wallclock_icon(filename, args.no_x)["st"] - else: - n_wc_st = get_wallclock_Nnodes_gen_daint( - filename, use_timer_report=True) - nnodes = n_wc_st["n"] - wallclock = n_wc_st["wc"].total_seconds() - date_run = n_wc_st["st"] + # infer nnodes from MPI-procs in ICON output + nodes_line = grep("mo_mpi::start_mpi ICON: Globally run on", + filename)["line"][0] + nnodes = int(nodes_line.split(' ')[6]) + nnodes = nnodes // args.mpi_procs_per_node + + wallclock = get_wallclock_icon( + filename, args.no_x)["wc"].total_seconds() + date_run = get_wallclock_icon(filename, args.no_x)["st"] else: wallclock, nnodes, date_run = set_default_error_slurm_file( "Warning : Run did not finish properly") # get job number jobnumber = float(filename.split('.')[-2]) - if args.mod.upper() == "ICON-CLM": - success_message = "ICON experiment FINISHED" + elif args.mod.upper() == "ICON-CLM": + success_message = "----- ICON finished" if check_icon_finished(filename, success_message) or args.ignore_errors: # get # nodes and wallclock - if args.no_sys_report: - # infer nnodes from MPI-procs in ICON output - nodes_line = grep( - "mo_mpi::start_mpi ICON: Globally run on", - filename)["line"][0] - nnodes = int(nodes_line.split(' ')[6]) - - nnodes = nnodes // args.cpu_per_node - - wallclock = get_wallclock_icon( - filename, - args.no_x, - num_ok=1, - success_message=success_message)["wc"].total_seconds() - date_run = get_wallclock_icon( - filename, - args.no_x, - num_ok=1, - success_message=success_message)["st"] - else: - n_wc_st = get_wallclock_Nnodes_gen_daint( - filename, use_timer_report=True) - nnodes = n_wc_st["n"] - wallclock = n_wc_st["wc"].total_seconds() - date_run = n_wc_st["st"] + # infer nnodes from MPI-procs in ICON output + nodes_line = grep("mo_mpi::start_mpi ICON: Globally run on", + filename)["line"][0] + nnodes = int(nodes_line.split(' ')[6]) + nnodes = nnodes // args.mpi_procs_per_node + + wallclock, date_run = get_wallclock_icon( + filename, + args.no_x, + num_ok=1, + success_message=success_message) + print(f"Simulation on {nnodes} nodes launched at: {date_run}") else: wallclock, nnodes, date_run = set_default_error_slurm_file( "Warning : Run did not finish properly") # get job number - jobnumber = filename[-8:] - print(jobnumber) - if args.mod.upper() == "ICON-HAM": + jobnumber = extract_job_id(filename) + elif args.mod.upper() == "ICON-HAM": # get # nodes and wallclock - if args.no_sys_report: - # infer nnodes from MPI-procs in ICON output - nodes_line = grep("mo_mpi::start_mpi ICON: Globally run on", - filename)["line"][0] - nnodes = int(nodes_line.split(' ')[6]) + # infer nnodes from MPI-procs in ICON output + nodes_line = grep("mo_mpi::start_mpi ICON: Globally run on", + filename)["line"][0] + nnodes = int(nodes_line.split(' ')[6]) + nnodes = nnodes // args.mpi_procs_per_node - nnodes = nnodes // args.cpu_per_node - - wallclock = get_wallclock_icon(filename, args.no_x, - num_ok=0)["wc"].total_seconds() - date_run = get_wallclock_icon(filename, args.no_x, - num_ok=0)["st"] - else: - n_wc_st = get_wallclock_Nnodes_gen_daint(filename, - use_timer_report=True) - nnodes = n_wc_st["n"] - wallclock = n_wc_st["wc"].total_seconds() - date_run = n_wc_st["st"] + wallclock = get_wallclock_icon(filename, args.no_x, + num_ok=0)["wc"].total_seconds() + date_run = get_wallclock_icon(filename, args.no_x, num_ok=0)["st"] # get job number jobnumber = float(filename.split('.')[-2]) - elif args.mod.upper() == "ECHAM-HAM": - ncpus_line = grep("Total number of PEs", filename)["line"][0] - ncpus = int(ncpus_line.split(':')[1].split()[0].strip()) - nnodes = ncpus / float(args.cpu_per_node) - - wallclock_line = grep("Wallclock", filename)["line"][0] - wallclock = float(wallclock_line.split(':')[1].strip()[:-1]) - - date_run = get_date_from_echam_slurm_file(filename) - - jobnumber = float(filename.replace('_', '.').split('.')[-2]) - # fill array in np_2print.append([nnodes, wallclock, jobnumber, date_run]) @@ -477,11 +333,9 @@ def get_wallclock_Nnodes_gen_daint(filename, perf_sorted.to_csv(filename_out, columns=[ 'Date', 'Jobnumber', 'N_Nodes', 'Wallclock', - 'Wallclock_hum', 'Speedup', 'Node_hours', - 'Efficiency', 'NH_year' + 'Wallclock_hum', 'Speedup', 'Efficiency', + 'Node_hours', 'NH_year' ], sep=';', index=False, float_format="%.2f") - -################################################################################ diff --git a/def_exps_plot.py b/def_exps_plot.py index 780f1d5..7ea48bf 100644 --- a/def_exps_plot.py +++ b/def_exps_plot.py @@ -31,8 +31,8 @@ def __init__(self, color='#253494', linestyle='-') -daint_01 = experiment(name='icon_cordex_12km_era5_gpu_20230222', - label='CORDEX-12km', +daint_01 = experiment(name='icon-clm_scaling', + label='EUR-12km', bestconf=36, marker='>', color='#253494', diff --git a/def_exps_plot_2019.py b/def_exps_plot_2019.py deleted file mode 100644 index 869b00d..0000000 --- a/def_exps_plot_2019.py +++ /dev/null @@ -1,117 +0,0 @@ -# definition of the object "experiment". It contains mostly the potting properties - -import numpy as np - - -class experiment: - - def __init__(self, - name, - label=None, - bestconf=np.nan, - linewidth=1., - **kwargs): - self.name = name - if label is None: - self.label = name - else: - self.label = label - self.bestconf = bestconf - - self.line_appareance = kwargs - self.line_appareance['linewidth'] = linewidth - - -icon_amip_intel = experiment( - name='atm_amip_intel', - label='ICON intel', - bestconf=64, - marker='<', - color='Red') #,linestyle = '-')#, marker = 'd', linestyle = '-') -icon_amip_6h_intel = experiment( - name='atm_amip_intel_6h', - label='ICON intel 6h', - bestconf=64, - marker='.', - color='Red') #,linestyle = '--')#, marker = 'd', linestyle = '-') -icon_amip_1m_intel = experiment(name='atm_amip_intel_1m', - label='ICON intel 1m', - bestconf=40, - marker='*', - color='Red') #, marker = 'd', linestyle = '-') - -icon_amip_6h_cray = experiment( - name='atm_amip_6h', - label='ICON cray 6h', - bestconf=24, - marker='.', - color='Green', - linestyle='--') #, marker = 'c', linestyle = '-') -icon_amip_1m_cray = experiment( - name='atm_amip_1m', - label='ICON cray 1m', - bestconf=24, - marker='*', - color='LightGreen') #, marker = '*', linestyle = '-') -icon_lam_init_cray = experiment(name='ICON_limarea_Bernhard_init', - label='ICON-LAM init cray', - bestconf=24, - color='Green', - marker='.') #, linestyle = '-') -icon_lam_cray = experiment(name='ICON_limarea_Bernhard_7d', - label='ICON-LAM cray', - bestconf=128, - color='Green', - marker='s') #, linestyle = '-') - -icon_amip_6h_final = experiment( - name='atm_amip_intel_6h', - label='ICON 6h', - bestconf=64, - marker='.', - color='Magenta', - linestyle='--') #, marker = 'd', linestyle = '-') -icon_amip_1m_final = experiment( - name='atm_amip_intel_1m', - label='ICON 1m', - bestconf=40, - marker='.', - color='Purple') #, marker = 'd', linestyle = '-') -icon_lam_final = experiment(name='ICON_limarea_Bernhard_7d', - label='ICON-LAM', - bestconf=128, - color='Red', - marker='>', - markersize=4, - linestyle='--') -icon_ham_final = experiment(name='atm_amip_hammoz_marc', - label='ICON-HAM', - bestconf=24, - marker='.', - color='Blue', - linestyle='-') -e63h23_1m_final = experiment(name='e63ham_1m', - label='ECHAM-HAM', - bestconf=36, - color='Green', - marker='*', - linestyle='-') -esmham_1m_final = experiment(name='mpiesm-ham_1m', - label='MPI-ESM-HAM', - bestconf=48, - color='LightGreen', - marker='d', - markersize=4, - linestyle='-') - -icon_amip_6h_pgi = experiment(name='atm_amip_pgi_6h', - label='ICON PGI 6h', - bestconf=24, - marker='.', - color='darkviolet', - linestyle='--') -icon_amip_1m_pgi = experiment(name='atm_amip_pgi_1m', - label='ICON PGI 1m', - bestconf=24, - marker='*', - color='plum') diff --git a/def_exps_plot_2021.py b/def_exps_plot_2021.py deleted file mode 100644 index f7b9cb4..0000000 --- a/def_exps_plot_2021.py +++ /dev/null @@ -1,70 +0,0 @@ -# definition of the object "experiment". It contains mostly the potting properties - -import numpy as np - - -class experiment: - - def __init__(self, - name, - label=None, - bestconf=np.nan, - linewidth=1., - **kwargs): - self.name = name - if label is None: - self.label = name - else: - self.label = label - self.bestconf = bestconf - - self.line_appareance = kwargs - self.line_appareance['linewidth'] = linewidth - - -# Color palette -#fdcc8a -#fc8d59 -#d7301f -#bdc9e1 -#67a9cf -#02818a - -# Definition of each experiment properties (colors, labels,ect) - -echam_ham_amip_T63L47 = experiment(name='ECHAM-HAM_amip_T63L47', - label='ECHAM-HAM 1M (cpu, intel)', - bestconf=36, - marker='>', - color='#fdcc8a', - linestyle='-') -icon_ham_amip = experiment(name='ICON-HAM_amip', - label='ICON-HAM 1M (cpu, pgi)', - bestconf=19, - marker='<', - color='#fc8d59', - linestyle='-') -icon_cpu_gcc_amip = experiment(name='ICON_cpu_gcc_amip', - label='ICON 1M (cpu, gcc)', - bestconf=43, - marker='x', - color='#bdc9e1', - linestyle='--') -icon_cpu_pgi_amip = experiment(name='ICON_cpu_pgi_amip', - label='ICON 1M (cpu, pgi)', - bestconf=42, - marker='x', - color='#67a9cf', - linestyle='-') -icon_gpu_pgi_amip_rte = experiment(name='ICON_gpu_pgi_amip_rte', - label='ICON 1M (gpu, pgi)', - bestconf=4, - marker='v', - color='#02818a', - linestyle='-') -icon_r2b9 = experiment(name='ICON_R2B9', - label='ICON@R2B9 1h (gpu, pgi)', - bestconf=1692, - marker='.', - color='#b30000', - linestyle='-') diff --git a/parse_craypat.py b/parse_craypat.py deleted file mode 100644 index d03157e..0000000 --- a/parse_craypat.py +++ /dev/null @@ -1,197 +0,0 @@ -#!/usr/bin/python - -# Parse the craypat analysis files to extract the info CSCS ask and create a unique csv file -# The script will read recursively all the files named "summary*.txt" in the current directory - -# Colombe Siegenthaler C2SM (ETHZ) , 2018-10 - -import numpy as np -import pandas as pd # needs pythn modules to be loaded: module load cray-python/3.6.5.1 PyExtensions/3.6.5.1-CrayGNU-18.08 -import os -import glob -import argparse -import subprocess -import sys - - -def decide_summary_filename(runtime_file, path_dir_for_sumfile, exp_name): - # create filename of teh summary file - # for MPI-ESM, there are two models running in paralell, each of them create a runtime file, - # so each of them needs a summary file - - if len(glob.glob('{}/*+*'.format(path_dir_for_sumfile))) > 1: - mod = os.path.relpath(runtime_file, path_dir_for_sumfile).split('+')[0] - filename_sum = 'summary_{}_{}.txt'.format(exp_name, mod) - else: - filename_sum = 'summary_{}.txt'.format(exp_name) - - out_summary_file = os.path.join(path_dir_for_sumfile, filename_sum) - - return (out_summary_file) - - -def create_summary_file(runtime_file, path_dir_for_sumfile, exp_name): - # Create summary_[label_model_name].txt from RUNTIME (written by craypat tool) file - # input is runtime file - - # final summary file for this exp - out_summary_file = decide_summary_filename(runtime_file, - path_dir_for_sumfile, exp_name) - - if not os.path.isfile(runtime_file): - print('Warning: Runtime file is not a proper file : {}'.format( - runtime_file)) - - # copy part of runtime file into summary file - with open(runtime_file) as fin, open(out_summary_file, 'w') as fout: - for line in fin: - # get starting point - if line.startswith("#"): - continue - - # copy the line into fout - fout.write(line) - - # ending point - if line.startswith('I/O Write Rate'): - break - - return (out_summary_file) - - -def extract_dir_exp(runtime_file): - # get the general path to exp dir - - path_dir = os.path.dirname(runtime_file.split('+')[0]) - exp_name = os.path.basename(path_dir) - - return (path_dir, exp_name) - - -def get_slurm_file_dep_mod(path_dir): - # get the path to the slurm file depending on the model family - - gen_mod_family = os.path.join(path_dir).split('/')[-2].upper() - if gen_mod_family.startswith('ICON'): - slurm_file_path = glob.glob('{}/LOG*.o'.format(path_dir)) - elif (gen_mod_family.startswith('ECHAM') - or gen_mod_family.startswith('MPI-ESM')): - slurm_file_path = glob.glob('{}/slurm*.txt'.format(path_dir)) - else: - print("Warning: No rule for finding the slurm filefor the file {}.". - format(filename)) - print( - "Rules for finding slurm files are only defined for module family : ECHAM, MPI-ESM or ICON " - ) - print('The family model found is : {}'.format(gen_mod_family)) - slurm_file_path = [] - - return (slurm_file_path) - - -def get_jobnumber_from_slurmfile(slurm_file_path): - # get the jobnumber from the slurm filename - - if not len(slurm_file_path) == 1: - print("Warning, several or no slurm file.") - print("The following files are found:{}".format( - "\n".join(slurm_file_path))) - print("Set job number to 0") - jobnumber = "0" - else: - jobnumber = os.path.basename(slurm_file_path[0]).split('.')[-2] - - # remove the submission number (especially for echam run) - jobnumber = jobnumber.split('_')[-1] - - return (jobnumber) - - -def get_jobnumber(path_dir): - - # add look for Jobnumber of the craypat run - > need to find slurm file - slurm_file_path = get_slurm_file_dep_mod(path_dir) - - # get the jobnumber from the slurm filename - jobnumber = get_jobnumber_from_slurmfile(slurm_file_path) - - return (jobnumber) - - -if __name__ == "__main__": - # parsing arguments - parser = argparse.ArgumentParser() - parser.add_argument('--exclude', '-e', dest = 'exclude_dir',\ - default = [],\ - nargs = '*',\ - help='folders to exclude.') - parser.add_argument('--out_f', '-o', dest = 'out_f',\ - default = 'Craypat_table',\ - help = 'filename of the output.') - args = parser.parse_args() - - # get current directory - pwd = os.getcwd() - - # find all the summary files - all_files = glob.glob('{}/**/RUNTIME.rpt'.format(pwd), recursive=True) - - # definition of teh directories to exclude - #exclude_dir = ['before_update_Oct2018'] - files_to_exclude = [] - for filename in all_files: - if any([s in filename for s in args.exclude_dir]): - files_to_exclude.append(filename) - - # exclude files - for f in files_to_exclude: - all_files.remove(f) - - # define dataframe for output - data_global = pd.DataFrame(columns=['Variable']) - - # parse each file of the list - for ifile, filename in enumerate(all_files): - print( - '----------------------------------------------------------------------' - ) - print('Parsing file {}'.format(filename)) - - path_dir, exp_name = extract_dir_exp(filename) - - # creation of a summary file from the report file (written by Craypat tool) - summary_file_exp = create_summary_file(filename, path_dir, exp_name) - - # read file - data_single = pd.read_csv(summary_file_exp, sep=':', header=None) - - # rename first column into 'Variable' - data_single.rename(columns={0: "Variable"}, inplace=True) - - # retrieve number of columns - ncol = len(data_single.columns) - - # combine all the columns instead Variable together (the HH:MM:SS were separated by mistake ) - data_single[exp_name] = data_single[1] - del data_single[1] - for icol in np.arange(2, ncol): - data_single[exp_name] = data_single[exp_name] + ':' + data_single[ - icol].fillna("") - del data_single[icol] - - # delete '::' in case it was added in teh column combination - data_single[exp_name] = data_single[exp_name].str.rstrip(':') - - # get the jobnumber from the slurm filename in the directory - jobnumber = get_jobnumber(path_dir) - - # add the jobnumber in the dtaaframe as a new line - data_single.loc[len(data_single)] = ['Job Number', jobnumber] - - # fill the outter dataframe - data_global = pd.merge(data_global, \ - data_single.rename(columns={exp_name:exp_name.replace('_',' ')}), \ - how='outer', on=['Variable']) - - # write out the global dataframe - data_global.to_csv('{}.csv'.format(args.out_f), sep=',', index=False) diff --git a/plot_perfs.py b/plot_perfs.py index 3d7f60d..67484a5 100755 --- a/plot_perfs.py +++ b/plot_perfs.py @@ -150,14 +150,17 @@ best_n = exp.bestconf if best_n in dt.N_Nodes.values: if var_to_plot == 'Efficiency': - perf_chosen = float( - dt[dt.N_Nodes == best_n].Efficiency) - if var_to_plot == 'Wallclock': - perf_chosen = float(dt[dt.N_Nodes == best_n].Wallclock) - if var_to_plot == 'Speedup': - perf_chosen = float(dt[dt.N_Nodes == best_n].Speedup) - if var_to_plot == 'NH_year': - perf_chosen = float(dt[dt.N_Nodes == best_n].NH_year) + perf_chosen = float(dt.loc[dt.N_Nodes == best_n, + 'Efficiency'].iloc[0]) + elif var_to_plot == 'Wallclock': + perf_chosen = float(dt.loc[dt.N_Nodes == best_n, + 'Wallclock'].iloc[0]) + elif var_to_plot == 'Speedup': + perf_chosen = float(dt.loc[dt.N_Nodes == best_n, + 'Speedup'].iloc[0]) + elif var_to_plot == 'NH_year': + perf_chosen = float(dt.loc[dt.N_Nodes == best_n, + 'NH_year'].iloc[0]) ax.scatter(best_n, perf_chosen, s=120., diff --git a/send_analyse_different_exp_at_once_ICON.py b/send_analyse_different_exp_at_once.py similarity index 95% rename from send_analyse_different_exp_at_once_ICON.py rename to send_analyse_different_exp_at_once.py index 1cf8093..9b2bdd5 100755 --- a/send_analyse_different_exp_at_once_ICON.py +++ b/send_analyse_different_exp_at_once.py @@ -70,9 +70,6 @@ def __init__(self, name, path, mod=None, factor=None, comp=None): '-e', exp.name,'-o', exp.comp, \ '-NH','6', '-n', '1', '12','16','36','48']) elif lo_send_batch: - print( - 'WARNING : Sending different experiments with different numbers of nodes for ECHAM_HAM has not been implemented yet' - ) print('The experiment {} is not done asssociated is : {}'.format( os.path.join(exp.path, 'run', exp.name), exp.mod.upper())) else: diff --git a/send_several_run_ncpus_perf.py b/send_several_run_ncpus_perf.py index 16a0d32..aaaf707 100755 --- a/send_several_run_ncpus_perf.py +++ b/send_several_run_ncpus_perf.py @@ -1,10 +1,11 @@ #!/usr/bin/python -# Wrapper to send several ECHAM-(HAM) runs using the jobscriptoolkit +# Wrapper to send several ICON (-HAM) runs # for performance anaylsis with different number of cpus # -# Usage: go into your echam run directory, prepare the basis run set-up -# call the present script +# This script uses the automatic running script generation in ICON (make_target_runscript). +# +# Usage : send_several_run_ncpus_perf_ICON.py -b $SCRATCH/icon-eniac/ -e my_exp -n 10 12 15 # # C. Siegenthaler (C2SM) , July 2015 # @@ -12,163 +13,190 @@ import numpy as np import os -import datetime import argparse +import datetime + + +def create_runscript(exp_base, output_postfix, nnodes, nproma=None): + # name experiment + exp_nnodes = "{}{}_nnodes{}".format(exp_base, output_postfix, nnodes) + + # create scripts + if nproma is None: + os.system( + "/bin/bash ./run/make_target_runscript in_folder=run in_script=exp.{} in_script=exec.iconrun out_script=exp.{}.run EXPNAME={} memory_model='large' omp_stacksize=200M grids_folder='/scratch/snx3000/colombsi/ICON_input/grids' no_of_nodes={}" + .format(exp_base, exp_nnodes, exp_nnodes, nnodes)) + else: + os.system( + "/bin/bash ./run/make_target_runscript in_folder=run in_script=exp.{} in_script=exec.iconrun out_script=exp.{}.run EXPNAME={} memory_model='large' omp_stacksize=200M grids_folder='/scratch/snx3000/colombsi/ICON_input/grids' no_of_nodes={} nproma={}" + .format(exp_base, exp_nnodes, exp_nnodes, nnodes, nproma)) + + #return name of exp + return (exp_nnodes) + + +def define_and_submit_job(hostname, wallclocktime, path_to_newscript, nnodes, + euler_node, account): + + # Daint login nodes + if 'daint' in hostname: + if account == None: + # Use standard account + account = os.popen('id -gn').read().split('\n')[0] + submit_job = 'sbatch --time=%s --account=%s %s' % ( + wallclocktime, account, path_to_newscript) + + # Euler login nodes + elif 'eu-login' in hostname: + if euler_node == 7: + submit_job = 'bsub -W %s -n %s -R "select[model==EPYC_7H12]" < %s' % ( + wallclocktime, nnodes, path_to_newscript) + elif euler_node == 6: + submit_job = 'bsub -W %s -n %s -R "select[model==EPYC_7742]" < %s' % ( + wallclocktime, nnodes, path_to_newscript) + elif euler_node == 4: + submit_job = 'bsub -W %s -n %s -R "select[model==XeonGold_6150]" -R "span[ptile=36]" < %s' % ( + wallclocktime, nnodes, path_to_newscript) + else: + print('Error: Please specify a correct Euler node (4,6 or 7).') + exit(-1) + + print(submit_job) + os.system(submit_job) + print( + '--------------------------------------------------------------------------------------------------------' + ) + if __name__ == "__main__": # parsing arguments parser = argparse.ArgumentParser() - parser.add_argument('--basis_folder', '-b', dest = 'basis_folder',\ - help='basis folder run containing the configuration as to use as template. The name has to finish with "_cpusXX".') - parser.add_argument('--ncpus_incr', dest = 'cpus_incr',\ - default = 16,\ - type = int,\ - help = 'increment of cpus number between each simulation.') - parser.add_argument('--niter', dest = 'niter',\ - default = 10,\ + parser.add_argument('--basis_folder_icon', '-b', dest = 'basis_folder',\ + default = os.getcwd(),\ + help='basis model folder e.g. /users/colombsi/icon-hammoz') + parser.add_argument('--exp_base', '-e', dest = 'exp_base',\ + default = 'atm_amip_1month',\ + help='basis model folder e.g. atm_amip_1month') + parser.add_argument('--output_postfix', '-o', dest = 'output_postfix',\ + default = '',\ + help='postfix for the output name of the running scripts e.g. "_cray" will give exp.atm_amip_cray_nnodesX.run') + parser.add_argument('--arrange_nnodes', '-a', dest = 'arrange_nnodes',\ + default = [1,11,1],\ type = int,\ - help = 'number of iterations (niter simulations will be performed with the number of cpus for each simulation is [1,2,....,niter-1]*ncpus_incr.') - - parser.add_argument('--nbeg_iter', dest = 'nbeg_iter',\ - default = 1,\ - type = int,\ - help = 'begining of the iteration (the simulations will be performed with the number of cpus for each simulation is [nbeg_iter,nbeg_iter+1....,niter-1]*ncpus_incr.') - - parser.add_argument('--ncpus', dest = 'cpus_to_proceed',\ - default = [],\ - type = int,\ - nargs = '*',\ - help = 'cups number of the simulation to analyse.This have priority over -ncpus_incr, -niter and -nbeg_iter') + nargs = 3,\ + help = 'list of number of nodes in the np.arrange format : [begining iteration, end iteration, step]. Default:[1,11,1] (=[1,2,3,4,5,6,7,8,9,10]') parser.add_argument('--nnodes', '-n', dest = 'nodes_to_proceed',\ default = [],\ type = int,\ nargs = '*',\ - help = 'nodes number of the simulation to analyse. This have priority over -ncpus_incr, -niter and -nbeg_iter') - parser.add_argument('--cpu_per_node', dest = 'cpu_per_node',\ - default = 12,\ + help = 'cups number of the simulation to analyse. This have priority over -arrange_nnodes') + parser.add_argument('--wallclock','-w', dest = 'wallclock' ,\ + default = None,\ + type = str,\ + help = 'wallclock to use when sending the run to the batch system') + parser.add_argument('--nproma','-p', dest = 'nproma' ,\ + default = None,\ + type = int,\ + help = 'value of nproma') + parser.add_argument('--oneNH','-NH', dest = 'oneNH' ,\ + default = 24,\ + type = int,\ + help = 'estimation of one node hour (wallclock time in hour when running on 1 node). This will be used for estimating a wallclock to use. In case -w is set, oneNH is not used.') + parser.add_argument('--euler-node','-m', dest = 'euler_node' ,\ + default = 6,\ type = int,\ - help = 'numper of CPUs per node') - parser.add_argument('-d', action='store_true',\ - help = 'perform dry run, i.e. run the script competely, but do not send the jobs to the batch queue') - parser.add_argument('-dw', action='store_true',\ - help = 'redifine walltime') + help = 'node type for Euler simulations') + parser.add_argument('--account','-A', dest = 'account' ,\ + default = None,\ + type = str,\ + help = 'project account on Piz Daint') args = parser.parse_args() - # define number of cpus for which experiment should be sent - #------------------------------------------------------------- - l_cpus_def = False + hostname = os.uname()[1] - if (len(args.cpus_to_proceed) > 0) and (len(args.nodes_to_proceed) > 0): - print( - 'You can specify either the number of cpus or the number of nodes, not both.' - ) - print('Exiting') - exit(1) - - if (len(args.nodes_to_proceed) > 0): - args.cpus_to_proceed = args.cpu_per_node * np.array( - args.nodes_to_proceed) - l_cpus_def = True - - if len(args.cpus_to_proceed) > 0: - l_cpus_def = True - - if not l_cpus_def: - args.cpus_to_proceed = (np.arange(args.nbeg_iter, args.niter) * - args.cpus_incr) - l_cpus_def = True - - # define new experiment name - #-------------------------------------------------------------- - # experiment name basis exp - exp_name_bas_exp = os.path.basename(args.basis_folder) - - # setting filename basis exp - setting_bas_exp = os.path.join(args.basis_folder, - 'settings_{}'.format(exp_name_bas_exp)) - - # check if the basis name is finishing by "cpusXX" and assign kernel name of teh new experiments - if (exp_name_bas_exp.split('_')[-1].startswith('cpus')): - exp_name_nucl = '_'.join( - exp_name_bas_exp.split('_')[:-1]) # name of the new experiments - else: - exp_name_nucl = exp_name_bas_exp - - # get walltime and cpus from basis exp, for computning later the new walltime - #-------------------------------------------------------------- - def grep(string, filename): - # returns lines of file_name where string appears - # mimic the "grep" function - - # list of lines where string is found - list_line = [] - - for line in open(filename): - if string in line: - list_line.append(line) - return list_line - - def value_string_file(string, filename): - # returns the value of a variable defined in a file - # e.g. for walltime, returns 8:00:00 if if filename walltime=8:00:00 - - #initialisation - values = [] - - # list of occurences found by "grep" - occurences = grep(string, setting_bas_exp) + # Daint login nodes + if 'daint' in hostname: + print('Host is Daint') - for occ in occurences: + # Euler login nodes + elif 'eu-login' in hostname: + print('Host is Euler') - #remove comments - line_wo_comment = occ.split('#')[0] - - # get value - def_split = [s.strip() for s in line_wo_comment.split('=')] + # unknown host + else: + print("Unknown host with hostname %s" % (hostname)) + exit(-1) - # do not consider variable if string is in the middle of another word - if string == def_split[0]: - values.append(def_split[1]) - return (values) + # base experiment + exp_base = args.exp_base - walltime_bas = value_string_file("walltime", setting_bas_exp)[0].strip('"') - ncpus_bas = int(value_string_file("ncpus", setting_bas_exp)[0].strip('"')) + # Euler node + euler_node = args.euler_node - #time in datetime format - basis_day = "2000-01-01" - walltime_datetime = datetime.datetime.strptime('{} {}'.format(basis_day,walltime_bas), '%Y-%m-%d %H:%M:%S') - \ - datetime.datetime.strptime(basis_day, '%Y-%m-%d') + # account + account = args.account - # send experiments - #--------------------------------------------------------------- + if len(args.nodes_to_proceed) == 0: + args.nodes_to_proceed = np.arange(args.arrange_nnodes[0], + args.arrange_nnodes[1], + args.arrange_nnodes[2]) # change directory to be in the basis folder - # os.chdir(args.basis_folder) - - # loop over number of cpus to be lauched - for ncpus in args.cpus_to_proceed: - - # define name of the new experiment - new_exp_name = '%s_cpus%i' % (exp_name_nucl, ncpus - ) # new experiment name - - # new walltime - comp_walltime = (walltime_datetime * ncpus_bas / ncpus) - new_walltime = datetime.timedelta( - seconds=round(comp_walltime.total_seconds())) - - # job definition and submission - string_to_overwrite = "ncpus={};exp={}".format(ncpus, new_exp_name) - if args.dw: - string_to_overwrite += ";walltime={}".format(new_walltime) - - print('jobsubm_echam.sh -o "{}" {}'.format(string_to_overwrite, - setting_bas_exp)) + if os.path.isdir(args.basis_folder): + os.chdir(args.basis_folder) + else: + print("The following basis direcotory does not exist :%s" % + args.basis_folder) print( - '--------------------------------------------------------------------------------------------------------' + "Please give an existing directory with the option -basis_folder_icon" ) - if not args.d: - os.system('jobsubm_echam.sh -o "{}" {}'.format( - string_to_overwrite, setting_bas_exp)) + print("Exiting") + exit(-1) + + # define run dir + path_run_dir = os.path.join(args.basis_folder, "run") + + # estimated time for one node + one_node_hour = args.oneNH + + # nproma + nproma = args.nproma + + # loop over number of nodes to create scripts + for nnodes in args.nodes_to_proceed: + + # need to be in basis folder to have some function defined + os.chdir(args.basis_folder) + + # create the runscripts with the icon script creating tool + print("Create runscript") + new_script = create_runscript(exp_base, args.output_postfix, nnodes, + nproma) + + # path to the newly created script (needed for launching it) + path_to_newscript = os.path.join(path_run_dir, + "exp.%s.run" % new_script) + + # need to be in run folder to have some function defined + os.chdir(path_run_dir) + + wallclocktime = args.wallclock + # roughly estimated time in sbatch format + if args.wallclock is None: + seconds = datetime.timedelta(hours=np.float(one_node_hour) / + nnodes).total_seconds() + hours = seconds // 3600 + minutes = (seconds % 3600) // 60 + if 'eu-login' in hostname: + # ensure no -W 0:00 request + if seconds < 1200: + minutes = 20 + wallclocktime = "%02i:%02i" % (hours, minutes) + else: + wallclocktime = "%02i:%02i:00" % (hours, minutes) + + # submit machine-dependent job + define_and_submit_job(hostname, wallclocktime, path_to_newscript, + nnodes, euler_node, account) diff --git a/send_several_run_ncpus_perf_ICON.py b/send_several_run_ncpus_perf_ICON.py deleted file mode 100755 index aaaf707..0000000 --- a/send_several_run_ncpus_perf_ICON.py +++ /dev/null @@ -1,202 +0,0 @@ -#!/usr/bin/python - -# Wrapper to send several ICON (-HAM) runs -# for performance anaylsis with different number of cpus -# -# This script uses the automatic running script generation in ICON (make_target_runscript). -# -# Usage : send_several_run_ncpus_perf_ICON.py -b $SCRATCH/icon-eniac/ -e my_exp -n 10 12 15 -# -# C. Siegenthaler (C2SM) , July 2015 -# -############################################################################################ - -import numpy as np -import os -import argparse -import datetime - - -def create_runscript(exp_base, output_postfix, nnodes, nproma=None): - # name experiment - exp_nnodes = "{}{}_nnodes{}".format(exp_base, output_postfix, nnodes) - - # create scripts - if nproma is None: - os.system( - "/bin/bash ./run/make_target_runscript in_folder=run in_script=exp.{} in_script=exec.iconrun out_script=exp.{}.run EXPNAME={} memory_model='large' omp_stacksize=200M grids_folder='/scratch/snx3000/colombsi/ICON_input/grids' no_of_nodes={}" - .format(exp_base, exp_nnodes, exp_nnodes, nnodes)) - else: - os.system( - "/bin/bash ./run/make_target_runscript in_folder=run in_script=exp.{} in_script=exec.iconrun out_script=exp.{}.run EXPNAME={} memory_model='large' omp_stacksize=200M grids_folder='/scratch/snx3000/colombsi/ICON_input/grids' no_of_nodes={} nproma={}" - .format(exp_base, exp_nnodes, exp_nnodes, nnodes, nproma)) - - #return name of exp - return (exp_nnodes) - - -def define_and_submit_job(hostname, wallclocktime, path_to_newscript, nnodes, - euler_node, account): - - # Daint login nodes - if 'daint' in hostname: - if account == None: - # Use standard account - account = os.popen('id -gn').read().split('\n')[0] - submit_job = 'sbatch --time=%s --account=%s %s' % ( - wallclocktime, account, path_to_newscript) - - # Euler login nodes - elif 'eu-login' in hostname: - if euler_node == 7: - submit_job = 'bsub -W %s -n %s -R "select[model==EPYC_7H12]" < %s' % ( - wallclocktime, nnodes, path_to_newscript) - elif euler_node == 6: - submit_job = 'bsub -W %s -n %s -R "select[model==EPYC_7742]" < %s' % ( - wallclocktime, nnodes, path_to_newscript) - elif euler_node == 4: - submit_job = 'bsub -W %s -n %s -R "select[model==XeonGold_6150]" -R "span[ptile=36]" < %s' % ( - wallclocktime, nnodes, path_to_newscript) - else: - print('Error: Please specify a correct Euler node (4,6 or 7).') - exit(-1) - - print(submit_job) - os.system(submit_job) - print( - '--------------------------------------------------------------------------------------------------------' - ) - - -if __name__ == "__main__": - - # parsing arguments - parser = argparse.ArgumentParser() - parser.add_argument('--basis_folder_icon', '-b', dest = 'basis_folder',\ - default = os.getcwd(),\ - help='basis model folder e.g. /users/colombsi/icon-hammoz') - parser.add_argument('--exp_base', '-e', dest = 'exp_base',\ - default = 'atm_amip_1month',\ - help='basis model folder e.g. atm_amip_1month') - parser.add_argument('--output_postfix', '-o', dest = 'output_postfix',\ - default = '',\ - help='postfix for the output name of the running scripts e.g. "_cray" will give exp.atm_amip_cray_nnodesX.run') - parser.add_argument('--arrange_nnodes', '-a', dest = 'arrange_nnodes',\ - default = [1,11,1],\ - type = int,\ - nargs = 3,\ - help = 'list of number of nodes in the np.arrange format : [begining iteration, end iteration, step]. Default:[1,11,1] (=[1,2,3,4,5,6,7,8,9,10]') - parser.add_argument('--nnodes', '-n', dest = 'nodes_to_proceed',\ - default = [],\ - type = int,\ - nargs = '*',\ - help = 'cups number of the simulation to analyse. This have priority over -arrange_nnodes') - parser.add_argument('--wallclock','-w', dest = 'wallclock' ,\ - default = None,\ - type = str,\ - help = 'wallclock to use when sending the run to the batch system') - parser.add_argument('--nproma','-p', dest = 'nproma' ,\ - default = None,\ - type = int,\ - help = 'value of nproma') - parser.add_argument('--oneNH','-NH', dest = 'oneNH' ,\ - default = 24,\ - type = int,\ - help = 'estimation of one node hour (wallclock time in hour when running on 1 node). This will be used for estimating a wallclock to use. In case -w is set, oneNH is not used.') - parser.add_argument('--euler-node','-m', dest = 'euler_node' ,\ - default = 6,\ - type = int,\ - help = 'node type for Euler simulations') - parser.add_argument('--account','-A', dest = 'account' ,\ - default = None,\ - type = str,\ - help = 'project account on Piz Daint') - - args = parser.parse_args() - - hostname = os.uname()[1] - - # Daint login nodes - if 'daint' in hostname: - print('Host is Daint') - - # Euler login nodes - elif 'eu-login' in hostname: - print('Host is Euler') - - # unknown host - else: - print("Unknown host with hostname %s" % (hostname)) - exit(-1) - - # base experiment - exp_base = args.exp_base - - # Euler node - euler_node = args.euler_node - - # account - account = args.account - - if len(args.nodes_to_proceed) == 0: - args.nodes_to_proceed = np.arange(args.arrange_nnodes[0], - args.arrange_nnodes[1], - args.arrange_nnodes[2]) - - # change directory to be in the basis folder - if os.path.isdir(args.basis_folder): - os.chdir(args.basis_folder) - else: - print("The following basis direcotory does not exist :%s" % - args.basis_folder) - print( - "Please give an existing directory with the option -basis_folder_icon" - ) - print("Exiting") - exit(-1) - - # define run dir - path_run_dir = os.path.join(args.basis_folder, "run") - - # estimated time for one node - one_node_hour = args.oneNH - - # nproma - nproma = args.nproma - - # loop over number of nodes to create scripts - for nnodes in args.nodes_to_proceed: - - # need to be in basis folder to have some function defined - os.chdir(args.basis_folder) - - # create the runscripts with the icon script creating tool - print("Create runscript") - new_script = create_runscript(exp_base, args.output_postfix, nnodes, - nproma) - - # path to the newly created script (needed for launching it) - path_to_newscript = os.path.join(path_run_dir, - "exp.%s.run" % new_script) - - # need to be in run folder to have some function defined - os.chdir(path_run_dir) - - wallclocktime = args.wallclock - # roughly estimated time in sbatch format - if args.wallclock is None: - seconds = datetime.timedelta(hours=np.float(one_node_hour) / - nnodes).total_seconds() - hours = seconds // 3600 - minutes = (seconds % 3600) // 60 - if 'eu-login' in hostname: - # ensure no -W 0:00 request - if seconds < 1200: - minutes = 20 - wallclocktime = "%02i:%02i" % (hours, minutes) - else: - wallclocktime = "%02i:%02i:00" % (hours, minutes) - - # submit machine-dependent job - define_and_submit_job(hostname, wallclocktime, path_to_newscript, - nnodes, euler_node, account)