Skip to content

Simulation Troubleshooting

Rajendra Adhikari edited this page Feb 14, 2022 · 13 revisions

Debugging simulation failures

If no results_csv is availabe in simulation_output_folder/results/results_csv

Either the simulation failed or the postprocessing failed.

  1. Verify all of the simulations did not fail cd into the simulation_output_folder/results/simulation_output/ folder and type the following command: zgrep -o '"completed_status": "Fail"' *.json.gz | wc -l and zgrep -o '"completed_status": "Success"' *.json.gz | wc -l to find the number of failed and Successful simulation. If you find that all/most of the simulations have failed, move on to the next section to find out why.
  2. Find the job with failures run zgrep -l '"completed_status": "Fail"' *.json.gz get the list of jobs that have failed simulation. then for a particular results_jobxx.json.gz file, run the following command to get list of buildings with failed simulation. zgrep -oP '("completed_status":\s*?"Fail").*?("building_id":\s*?[0-9]+)' results_jobxx.json.gz

You can now proceed to the next section and try to find why the simulation failed. Note: if the postprocessing has completed, then *.json.gzfiles are no longer available since they are deleted as a part of cleanup process after postprocessing is complete. In this case, refer to the other approach to track down failed simulations.

If some particular buildings simulation has failed in results_csv

In the results_up##.csv file there is a column named job_id. If you are looking to debug a particular simulation, find the row for simulation you want and note the job id. That will tell you which archive file your simulation results is in. Inside the results/simulation_output directory there are a number of files named simulations_job#.tar.gz. Find the one corresponding to the job id you noted. You can also use the following command on eagle to list all rows from results_upxx.csv.gz which have failed simulation:

zgrep "Fail" results_up00.csv.gz The result will look something like this: 12,48,2022-02-14 04:21:53,2022-02-14 04:21:54,Fail,,,,... 427,13,2022-02-14 04:21:38,2022-02-14 04:21:40,Fail,,,,,,,,,,,,,, 716,12,2022-02-14 04:24:55,2022-02-14 04:24:56,Fail,,,,,,,,,,,,,, For the first line, 12 is the building number and 48 is the job id.

You can extract the folder for the simulation in question with the following tar command:

tar xvzf simulations_job48.tar.gz ./up00/bldg0000012

Which would extract the baseline (upgrade = 0) simulation for building_id = 12 ran with job_id = 48. Change the numbers to get simulation results you are specifically looking for.

The best first file to look at is the OpenStudio output. It has the following names:

  • on Eagle: singularity_output.log
  • on AWS: os_stdout.log
  • on Local Docker: docker_output.log

(Yes, we should probably change that so they all output a file with the same name.)

Search for the word "error" (case insensitive) in that file to get some guidance about what went wrong.

Printing all the errors across all buildings in a batch run

Use the following command to print all the errors inside a particular simulations_jobXX.tar.gz.

tar xzf simulations_jobXX.tar.gz ./*/*/singularity_output.log --to-command 'grep --label=$TAR_FILENAME -oPH "failed with Measure \w+ reported an error with .*?]"; true'