Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub #182

Merged

Conversation

bhearsum
Copy link
Contributor

@bhearsum bhearsum commented Mar 2, 2023

@bhearsum bhearsum force-pushed the taskcluster-yml-remote-references branch 2 times, most recently from 79cb314 to cd40641 Compare March 2, 2023 20:56
Copy link
Contributor

@matt-boris matt-boris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for taking the time to write this up 🙂

```
---
version: 1
config-from: https://github.com/taskcluster/taskgraph/blob/main/data/taskcluster-yml-github.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a page out of GitHub's book which I like over the full url here would be a simpler format like:
USER_OR_ORG_NAME/REPO_NAME/.github/workflows/REUSABLE_WORKFLOW_FILE.yml@TAG_OR_BRANCH

I could see the full https url here useful if the user wants to use GitLab or Mercurial, for example. Maybe this config-from could be "smart" and visit the url, if provided, otherwise just use the GitHub shorthand method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally fine with using the format you suggest. Someone (Pete?) made the point before that this is a GitHub specific service, so using GitHub-specific references ought to be fine.

I mostly used full https links here to start with the most flexible thing, and have us pare down to something else if desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've previously tried to unify behavior across GH and Mercurial. Maybe it makes sense to put github.com somewhere in the string, even if the result is not a URL that can be fetched? This is what we've done for scope names, for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to pin specific version of the file? If that file is to be updated and we rerun event, then we could have completely different graphs.
This is probably a feature, not a bug, but maybe it would be good to somehow include this information in the decision task which version of the file was used. Something like attaching a hash of it as a metadata. But then it would probably fail on rerun if this hash changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've previously tried to unify behavior across GH and Mercurial. Maybe it makes sense to put github.com somewhere in the string, even if the result is not a URL that can be fetched? This is what we've done for scope names, for example.

Can you clarify what the "unify behavior" part means? (I'm not exactly sure how it relates to including github.com in the string.)

In any case, It seems sensible to me to use the same format as we use elsewhere, eg: for scopes, unless we have a good reason not to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to pin specific version of the file? If that file is to be updated and we rerun event, then we could have completely different graphs. This is probably a feature, not a bug

Whether or not pinning is desirable is probably going to vary by project, and to be honest, I'm not even sure where we might land for Firefox CI projects. Obviously pinning gives you more reproducibility, but you lose some of the benefits of centralization if you have to manually go around and bump pins everywhere.

Perhaps it's best to allow for both pinned and unpinned, at least for now, until we have some better insight into how things work in practice?

but maybe it would be good to somehow include this information in the decision task which version of the file was used. Something like attaching a hash of it as a metadata.

I agree that we ought to include some information about what was used. metadata seems like a reason place for it. It might be nice to include revision as well, to make tracing things back easier. Another idea, probably not viable, would be to somehow pre-create an artifact for all tasks that end up being created, and attach this information there. If this were somehow possible, we could probably include the entire .taskcluster.yml, since we would no longer be limited to something reasonably sized.

But then it would probably fail on rerun if this hash changes?

Hmmm, I'm not sure I fully understand what you're getting at. If I understand correctly, Taskcluster-GitHub would run once, create task definitions with this extra metadata, and then the tasks would eventually run. If they're rerun, the metadata would stay the same, even if the .taskcluster.yml had changed, since Taskcluster-GitHub is not involved with the reruns at all. (I think the only time we would get new metadata would be when we get new events, which would cause new task ids.)

Are you saying that the workers or queue or something else would be verifying the new information in the metadata? (I can certainly see ways reruns could fail in that case.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's best to allow for both pinned and unpinned, at least for now, until we have some better insight into how things work in practice?

With either the URL or GH shorthand, we could reference a branch for unpinned and revision or tag for pinned. So we get both for free.

Copy link
Contributor

@lotas lotas Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously pinning gives you more reproducibility, but you lose some of the benefits of centralization if you have to manually go around and bump pins everywhere.

Agree with this and what Andrew says below.
My thinking was if someone would be using "unofficial" source for the config-from then it would be a security concern if those tasks are not doing something unwanted.

metadata seems like a reason place for it

Yes, this is what I was thinking of, too. We should include few entries in metadatda by default (provided by github service at the time of parsing those configs) which files were used and probably their hashes (but maybe that's too much)

if they're rerun, the metadata would stay the same, even if the .taskcluster.yml had changed, since Taskcluster-GitHub is not involved with the reruns at all.

For the individual tasks yes. I was thinking about the broader case, when you have a long living PR, and each push event would spawn a new decision task, so possibly a different set of tasks even. But this is rather an edgecase, and in this particular case it would be nicer to reference those config-from by revision. Not a problem I guess.

Are you saying that the workers or queue or something else would be verifying the new information in the metadata?

no, probably doesn't make sense unless we would somehow need this for CoT? Just ignore for now :) Metadata is for humans to see what was the origin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you folks feel about the latest version of this? Both with regard to the new proposed format for config-from, as well as the extra context?

3) Render the fetched `.taskcluster.yml` with the combined context
4) Proceed as usual, creating any Tasks resulting from the rendered `.taskcluster.yml`

Note: Only a `.taskcluster.yml` defined directly in a project repository will have its `config-from` processed. There is no need to support recursion into multiple levels of references, and it is more likely to result in confusion than add any value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 Good callout.

@bhearsum bhearsum force-pushed the taskcluster-yml-remote-references branch from cd40641 to 821b4bf Compare March 3, 2023 17:18

# Details

Two new variables will be introduced to `.taskcluster.yml` files. The first, `config-from`, will allow an URL to be provided that points at a full-fledged `.taskcluster.yml` hosted elsewhere. The second, `context`, will allow for variables to be defined that will be included with the JSON-e context when the referenced `.taskcluster.yml` is rendered. Here is an example of a possible concrete `.taskcluster.yml`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the base also has config-from? Will nested references be allowed? If so, is it worth checking for circular references?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a note further down that nested references will explicitly not be allowed. (It doesn't mean we can never support them...but I don't see a reason to add this complexity until we know there's a good reason to support it.)

```
---
version: 1
config-from: https://github.com/taskcluster/taskgraph/blob/main/data/taskcluster-yml-github.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's best to allow for both pinned and unpinned, at least for now, until we have some better insight into how things work in practice?

With either the URL or GH shorthand, we could reference a branch for unpinned and revision or tag for pinned. So we get both for free.

- secrets:get:project/mozillavpn/*
```

When `config-from` is present in the `.taskcluster.yml`, existing top level keys such as `policy` and `tasks` are not valid. `context` is optional, and only permitted when `config-from` is present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that policy or top-level keys other than tasks should be disallowed.. Then we'd need to maintain multiple versions of identical base configs, e.g one for each PR policy. That seems like a step in the wrong direction.

Instead I think we should merge the top-level keys other than tasks. Disallowing tasks makes sense to me for simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, thanks for raising it. I'll update the RFC to reflect this.


When Taskcluster-GitHub encounters a `.taskcluster.yml` such as the one above when processing an event for a repository, instead of directly rendering it with the context from the GitHub event, it will instead:
1) Fetch the `.taskcluster.yml` referenced in `config-from`
2) Combine the GitHub event context with the context provided in the repository's `.taskcluster.yml` (with GitHub event context taking priority)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a deep merge? Or just wholesale replacement of top-level keys. If the latter, GH events having priority makes sense.

But if we're deep merging, project context taking priority could allow for some powerful overriding capabilities of the base config.. Though it would be pretty hacky and hard to read, so I'm not advocating for this. Just pointing out the possibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep merge is what I was intending (...I should make that explicit).

I can certainly see that allow for overriding from the project context would allow for more possibilities...but I find it difficult to imagine a scenario where it wouldn't cause confusion or perhaps even security issues.

For example, if project context takes priority I think we could override things such as tasks_for or event.pusher.email, which I cannot imagine would be desirable.

(Maybe I'm misunderstanding what you're getting at though....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was rewriting this section for clarity, I couldn't come up with a compelling reason why deep merging would be useful, so my upcoming change declare it as shallow. (I can tweak it still, but it's a starting point for further discussion.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahal - any thoughts on the updates here?


`.taskcluster.yml` files in GitHub repositories are responsible for creating tasks (or not) in response to various GitHub events. In some cases (eg: when a decision task is used to delegate most of this to a separate tool), these files end up looking almost entirely identical, and hundreds of lines of JSON-e ends up duplicated across many different repositories. This duplication is a significant maintenance burden, and easily results in unwanted differences.

Allowing a `.taskcluster.yml` to reference a file stored elsewhere will allow for deduplication of the bulk of the content in these files, resolving the concerns noted above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave a note here saying that those external .taskcluster.yml should be a valid version:1 tc yamls that can be validated with the same schema? Or those embedded files would be free-form yamls that only produce tasks:?

I think we could add a line here about such validity and also when implementation will go, github service would fail if embedded config has invalid schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the same schema for both embedded and referenced .taskcluster.yml files. (There's ways you can implement mutual exclusiveness and such in jsonschema.)

However, the only way we can deal with refusing recursion is to make that invalid in referenced files outside of the schema. That might be slightly less than ideal, but I think it's preferable to duplicating schemas? I don't feel terribly strongly either way though.

(I have an upcoming change that will spell out the situation for the configs more clearly...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotas - I pushed a change that makes the schema more clear - any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates @bhearsum. I feel like we have already something implementable here :)

lotas
lotas previously approved these changes Mar 10, 2023
@bhearsum bhearsum force-pushed the taskcluster-yml-remote-references branch from 821b4bf to e4519fd Compare March 10, 2023 14:54
@bhearsum
Copy link
Contributor Author

I've pushed a few changes that I think address all of the comments above. Here's a quick summary:

  • Adjust config-from format to be GitHub-specific, and explicitly allow tag, branch, and revision references
  • Add additional context that will make .taskcluster.yml files aware of the repository and revision of the config-from file at runtime. (This will allow them to, eg: include this information as additional metadata.)
  • Clarifications around merging and precedence of context from various sources
  • Clarifications around what is supported in project repo .taskcluster.yml files vs. config-from .taskcluster.yml files.

Copy link
Contributor

@lotas lotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Nice!

Copy link
Member

@petemoore petemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Ben!

@matt-boris matt-boris marked this pull request as ready for review March 28, 2023 16:08
@matt-boris matt-boris requested a review from a team as a code owner March 28, 2023 16:08
@matt-boris matt-boris requested review from matt-boris and removed request for a team March 28, 2023 16:08
@matt-boris matt-boris merged commit 308f866 into taskcluster:main Mar 28, 2023
bhearsum added a commit to bhearsum/taskcluster-rfcs that referenced this pull request Mar 29, 2023
matt-boris pushed a commit that referenced this pull request Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants