-
Notifications
You must be signed in to change notification settings - Fork 14
Simulation Troubleshooting
Either the simulation failed or the postprocessing failed.
- Verify all of the simulations did not fail
cd into the
simulation_output_folder/results/simulation_output/
folder and type the following command:zgrep -o '"completed_status": "Fail"' *.json.gz | wc -l
andzgrep -o '"completed_status": "Success"' *.json.gz | wc -l
to find the number of failed and Successful simulation. If you find that all/most of the simulations have failed, move on to the next section to find out why. - Find the job with failures
run
zgrep -l '"completed_status": "Fail"' *.json.gz
get the list of jobs that have failed simulation. then for a particular results_jobxx.json.gz file, run the following command to get list of buildings with failed simulation.zgrep -oP '("completed_status":\s*?"Fail").*?("building_id":\s*?[0-9]+)' results_jobxx.json.gz
You can now proceed to the next section and try to find why the simulation failed.
Note: if the postprocessing has completed, then *.json.gz
files are no longer available since they are deleted as a part of cleanup process after postprocessing is complete. In this case, refer to the other approach to track down failed simulations.
In the results_up##.csv file there is a column named job_id
. If you are looking to debug a particular simulation, find the row for simulation you want and note the job id. That will tell you which archive file your simulation results is in. Inside the results/simulation_output
directory there are a number of files named simulations_job#.tar.gz
. Find the one corresponding to the job id you noted. You can also use the following command on eagle to list all rows from results_upxx.csv.gz which have failed simulation:
zgrep "Fail" results_up00.csv.gz
The result will look something like this:
12,48,2022-02-14 04:21:53,2022-02-14 04:21:54,Fail,,,,...
427,13,2022-02-14 04:21:38,2022-02-14 04:21:40,Fail,,,,,,,,,,,,,,
716,12,2022-02-14 04:24:55,2022-02-14 04:24:56,Fail,,,,,,,,,,,,,,
For the first line, 12 is the building number and 48 is the job id.
You can extract the folder for the simulation in question with the following tar
command:
tar xvzf simulations_job48.tar.gz ./up00/bldg0000012
Which would extract the baseline (upgrade = 0
) simulation for building_id = 12
ran with job_id = 48
. Change the numbers to get simulation results you are specifically looking for.
The best first file to look at is the OpenStudio output. It has the following names:
- on Eagle:
singularity_output.log
- on AWS:
os_stdout.log
- on Local Docker:
docker_output.log
(Yes, we should probably change that so they all output a file with the same name.)
Search for the word "error" (case insensitive) in that file to get some guidance about what went wrong.
Use the following command to print all the errors inside a particular simulations_jobXX.tar.gz.
tar xzf simulations_jobXX.tar.gz ./*/*/singularity_output.log --to-command 'grep --label=$TAR_FILENAME -oPH "failed with Measure \w+ reported an error with .*?]"; true'