-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: collect det task logs as artifacts for ci jobs #9459
Conversation
✅ Deploy Preview for determined-ui canceled.
|
9d965c3
to
7a54a84
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9459 +/- ##
=======================================
Coverage 48.97% 48.97%
=======================================
Files 1238 1238
Lines 160512 160512
Branches 2784 2784
=======================================
+ Hits 78609 78611 +2
+ Misses 81728 81726 -2
Partials 175 175
Flags with carried forward coverage won't be shown. Click here to find out more. |
740ef70
to
8c70ba2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with some general observations/questions. I won't request any of these changes, but I want to make a note of them (in no particular order).
-
I'm not sure which Determined library functions return errors or throw exceptions, but as it is, I don't see a lot of error handling. How likely is the happy path, and how informative will the resultant error messages be if they break CI?
-
I see a lot of string interpolation/manipulation for paths. Consider Python's pathlib for a higher-level interface for working with OS paths.
-
What is the expected log volume?
-
Is there a way to use an existing Determined venv in CircleCI instead of installing
fire
anddetermined
each time? The compute cycles add up, and I'd rather install a package once ahead of time instead of on every build.
disable log collection for tests mess with master availability
8c70ba2
to
ee02bc9
Compare
Thanks, all good points
This one I'm not sure I'm understanding correctly.
two optimizations:
I added a bit to the cmd to make a report on the size. They only stay around for 30 days afaik
probably? but also that adds a bit of complexity and reduces the portability of this command. I think it already runs as part of the venv that the parent job sets up and if the pkg is found, it should skip it eg here it doesn't download determined https://app.circleci.com/pipelines/github/determined-ai/determined/56582/workflows/f1305a3a-cd98-4741-a914-6b621ccd0448/jobs/2645863/parallel-runs/0/steps/0-128 |
a quick log size estimate as is: looks like each run the total of all the logs collected today is <10MB edit: also addressed this
|
How likely are those Determined library functions to return errors/raise exceptions? Is it possible for any of them to fail and raise an unhandled exception? I don't see much error handling, so I'm wondering if those Determined library functions are always going to succeed, or if there's a possibility that we're not handling errors. For example: |
thanks for explaining, I get it now : ) outside of unexpected bugs that I might be adding here the usual network failures should be covered by our default retry logic that all the API calls via our Session gets determined/harness/determined/common/api/_session.py Lines 18 to 23 in da9820e
any other class of errors you're thinking of? |
Ticket
https://hpe-aiatscale.atlassian.net/browse/RM-237
Description
collect and save tasks (currently just first trial of an experiment's) logs as ci artifacts for e2e tests.
adds about 4s to each job.
a quick log size estimate as is: looks like each run the total of all the logs collected today is <10MB
we have about 1300 test-e2e workflow runs each 30 days that adds up to about 13GB of storage use that should auto clear as the artifacts age out
Test Plan
samples: https://app.circleci.com/pipelines/github/determined-ai/determined/56392/workflows/04457f70-98f8-407c-835f-e5bca81a481b/jobs/2630701/artifacts
sample with task logs https://app.circleci.com/pipelines/github/determined-ai/determined/56396/workflows/24e85025-e118-4dd6-b095-301aa6e15dee/jobs/2631032/artifacts
https://app.circleci.com/pipelines/github/determined-ai/determined/56392/workflows/04457f70-98f8-407c-835f-e5bca81a481b/jobs/2630703/artifacts
https://app.circleci.com/pipelines/github/determined-ai/determined/56392/workflows/04457f70-98f8-407c-835f-e5bca81a481b/jobs/2630711/artifacts
I recycled some old code I had to do this. there's probably a better way to authenticate now
Checklist
docs/release-notes/
.See Release Note for details.