Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: download artifacts from current workflow run, improve API usage #9

Closed
wants to merge 6 commits into from

Conversation

AlCalzone
Copy link
Contributor

@AlCalzone AlCalzone commented Sep 12, 2022

I've been playing around with this action, which currently seems to be the only sane option for using Turborepo caching on GH actions. Unfortunately because my workflow is split into separate phases, I ran into a few limitations which this PR is supposed to solve:

Artifacts from previous jobs of the same workflow are not found
This is a limitation of Github Actions - apparently list and delete only works when the workflow was completed. This resulted in caching not working at all in subsequent jobs. I worked around this by downloading all previously uploaded artifacts from the same workflow before starting the server. Then when handling a request, the cache files are checked before falling back to listing other artifacts.

Reduce API usage
In the middle of testing, I actually hit my Github Actions rate limit of 1000/h. I believe this can be avoided using the following tricks:

  • Check file system first, then the API
  • Cache list request for serving GET requests. The turbo server is usually short-lived, so it should be relatively unlikely that we miss an artifact this way. And if we do, it's not the end of the world
  • Re-upload "remote" artifacts as workflow artifacts. I'm not 100% sold on this, but in theory we should be able avoid the list request on subsequent job runs this way, since these are automatically downloaded in followup jobs.

Oh and I implemented a stub for the statistics endpoint that newer versions of turborepo are using. Otherwise you get a lot of unnecessary errors in the turborepo logs.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
this prevents doing multiple list request in a single job
and unnecessary list requests in subsequent jobs
@felixmosh
Copy link
Owner

felixmosh commented Sep 12, 2022

Hi @AlCalzone thank you for this PR,
Can you provide a small example of GH action that can represent your setup?
What do you mean by list & delete are not available?

@AlCalzone
Copy link
Contributor Author

Here's a PR that uses the updated workflow - not sure if you see logs:
zwave-js/node-zwave-js#5050
Here's the corresponding workflow run: https://github.com/zwave-js/node-zwave-js/actions/runs/3036837290, the ones after this one are failing because I'm cleaning up other stuff.

What do you mean by list & delete are not available?

This comment explains it - essentially the artifacts generated by the current workflow never show up in the artifact list request this action is using.
actions/upload-artifact#53 (comment)
The @actions/artifact download method uses some different paths related to the current workflow, so it does find them.

export async function downloadSameWorkflowArtifacts() {
const client = create();
// Try to download all artifacts from the current workflow, but do not fail the build if this fails
const artifacts = await client.downloadAllArtifacts(cacheDir).catch((e) => {
Copy link
Owner

@felixmosh felixmosh Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you ensure that these artifacts are belongs to the current workflow?
Looks like you are downloading all the files, which may contain hundreds of gigs in my case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's this method: https://github.com/actions/toolkit/blob/e6257f111756d2f3567917c8e27ab57de8c3e09c/packages/artifact/src/internal/artifact-client.ts#L221
using this method to list artifacts: https://github.com/actions/toolkit/blob/e6257f111756d2f3567917c8e27ab57de8c3e09c/packages/artifact/src/internal/download-http-client.ts#L45
which in turn is using https://github.com/actions/toolkit/blob/e6257f111756d2f3567917c8e27ab57de8c3e09c/packages/artifact/src/internal/utils.ts#L222 as the download URL
which is referencing the current workflow run using the env variable here:
https://github.com/actions/toolkit/blob/e6257f111756d2f3567917c8e27ab57de8c3e09c/packages/artifact/src/internal/config-variables.ts#L50

Looks like you are downloading all the files, which may contain hundreds of gigs in my case.

Good point. An alternative would be using @actions/artifacts internals as I attempted here:
zwave-js/node-zwave-js@10121ae (#5050)
This way one could filter for files that look like a hash, but that might not be stable across releases of that package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't like this approach, I first attempted to download the files from the current workflow and fall back to the ones from other workflows if that didn't work. However, it was pretty slow (20 seconds for a handful of small artifacts).

@felixmosh
Copy link
Owner

I think that you are misusing turborepo for sharing assets between jobs.

As I understanding it, GH jobs are ment to parallel your tasks, meaning they must by definition not share anything, therefore, you may parallel them.
So how do you expect turbo to reuse artifacts of other jobs from the same workflow? Turborepo should reuse artifacts from previous workflows, this it's whole point.

@AlCalzone
Copy link
Contributor Author

Some of the jobs depend on my TS code being compiled. As the project grows, that takes more and more time, so to avoid compiling the same stuff 6 times, I factored it out to a separate job, so the compilation would only be done once. Before migrating to Turborepo over the weekend, I shared the build output with the following jobs by uploading and downloading a single artifact.

Now if I don't share the turbo cache between the build job and the following ones, a cache miss in the build job will automatically be a cache miss in the following jobs that depend on a built project too.

I guess one way to do it would be a mix of the two approaches: Go back to downloading workflow artifacts on demand, but manually share the cache dir with the following jobs in the same workflow.

@felixmosh
Copy link
Owner

felixmosh commented Sep 12, 2022

You can use turborepo for caching between workflow on the TS artifact, and in order to share the result of TS between jobs, you will need to upload & download like u did.

Hope this make sense for you.
Anyway, thank you for the PR, and sorry for not approving it, but I don't think that this is the responsibility of this action to share artifacts between jobs of the same workflow.

@AlCalzone
Copy link
Contributor Author

AlCalzone commented Sep 12, 2022

What about the part where the list request is cached and the local FS is checked for existence of the artifact first? Would you accept this? Oh and the /v8/artifacts/events endpoint?

@felixmosh
Copy link
Owner

Can you elaborate of both of the points?
Why a new workflow will contain a turborepo cache on the disk?

@felixmosh
Copy link
Owner

I've checked turbo-repo code, and looks like they added /v8/artifacts/events for analytics.
You can make a separate PR with this end point, thanks.

Funny that my workflows works without it... :|

@AlCalzone
Copy link
Contributor Author

Regarding the list request:
Let's say your workflow has 20 turborepo tasks. As it currently is, every GET request to the endpoint /v8/artifacts/:id will list the existing artifacts using the Github API. Unless another parallel workflow run finishes and adds another artifact, each of these 20 list requests will have the same result - i.e. there were 19 unnecessary API uses. And even if there are new artifacts between request 1 and 20, I think the chance of needing that exact artifact is relatively low.
By caching the result of the Github API request on the first GET, we can save some requests.

Regarding the local FS check:
It probably isn't terribly useful if there is only one job in a workflow, but it is useful if:

  1. you repeat a task for some reason and need the same artifact as earlier for it
  2. you have multiple jobs and persist the cache between jobs (so I'm going to need this).
    In any case, a check if the file exists isn't expensive and can potentially prevent downloading unnecessarily.

As for the endpoint:
It doesn't cause failures, but if you run turbo with -vv, every request will log an error with a HTML page in it. I can't find a corresponding log, but it's a bit annoying when trying to debug stuff.

@felixmosh
Copy link
Owner

Make PR's for both of the points you've mentioned, (separately), I will review them, and we will see if it makes the code more complex or not.

@felixmosh felixmosh closed this Sep 12, 2022
@AlCalzone
Copy link
Contributor Author

Will do 👍️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants