-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matplotlib Issue in Error Messages #145
Comments
@mgovorcin - when you ran and generated your sample products on I have a feeling this is related to running out of memory on the instance. I am seeing similar issues running the current If we were to set this environment variable - do you set the number of threads to 4 for the entire workflow or just for iono? Thanks for your help. |
This issue was found locally and resolved in a PR. The original bug did not go away even once this bug was fixed. Running the product suggested in #142 (to create the sample product there), I am getting the following error:
In leffe, the S1-GUNW is not generated yet i.e. the netcdf data is not there, so I believe that this error is misleading in that the workflow is trying to perform the SET correction and cannot because this portion of the workflow is trying to open a file that is not there. @mgovorcin - when you have a chance, please re-try your workflow using this Another piece that could be affecting this is the |
So to be clear: I ran the container locally on our linux server and found a bug that fixed by #146. Now, I reinstall the environment and rerun the workflow on a test using:
and the process runs to completion. When I run:
Ran out of space on device (might be how docker is installed on our linux server). I am not able to get the bug above in 2 seconds as hyp3 is indicating. |
And furthermore, the test that I reported failed initially can be run locally as:
This too locally does not fail after a few seconds. I do set |
To answer some of the questions I posed above:
|
@cmarshak , @AndrewPlayer3 on our team offered to try to reproduce the matplotlib error in one of our copies of HyP3 and investigate further today, FYI |
@cmarshak, @asjohnston-asf After running some test jobs in one of our instances using the topsapp test image, I was not able to reproduce this error. |
I've been able to confirm this is due to the job running out of disk space. Jobs in the hyp3-tibet-jpl deployment are limited to 100 GB or 118 GB of disk space. We weren't able to reproduce the issue in an ASF environment because we use a different EC2 instance type that gives jobs 237 GB of disk space. We attempt every job up to three times. The initial attempt fails after running out of disk in ~2 hours. The subsequent two attempts fail immediately with the Matplotlib error because the disk is already full when they start; the retry attempt is starting faster than the docker engine can release the disk from the first attempt. The HyP3 API only reports the log and processing time from the final attempt, which obscured the underlying failure in the first attempt. I'll forward a log file later today so JPL can review where in the workflow the job is hitting its disk limit. I recommend reviewing the workflow to reduce it's overall disk space requirement. Investigating whether there's an opportunity to delete intermediate files after they've been used would be a good first step. We can also provision more disk space for each job if necessary, but that will make the workflow more expensive. I can provide more detail on the tradeoffs if that's an option you'd like to explore. |
Thank you so much @asjohnston-asf and @AndrewPlayer3. I don't think it's a good idea to reduce disk space. One of ISCE2's weakness is the intermediate io and managing this is really a big ask. If we have been doing the expanded disk space for access, let's do it for Tibet. Questions:
Part of my confusion was my inability to answer these questions. Super appreciate your help! |
|
Two more follow up questions:
The link you shared only says |
If that is indeed the case (and we don't need more storage) I created this PR: ASFHyP3/hyp3#1752 Otherwise, we should find instance with more storage because it's not worth the development to manage disk space for this workflow. |
Also, thank you for the clarification - I see the "Job Attempts" tab under the batch with different log streams for each batch. Super helpful. |
I can confirm that resubmitting jobs that failed on tibet successfully ran to completion using the ACCESS account (with a few transient exceptions). This means the 18 GB that Andrew cited above would likely fix the issue being reported here. |
This was resolved using instances with more disk space. |
Describe the bug
The recent set of merges to
dev
produce a matlotlib issue.All the jobs fail rather quickly e.g.
The text was updated successfully, but these errors were encountered: