-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub #182
Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub #182
Conversation
79cb314
to
cd40641
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for taking the time to write this up 🙂
``` | ||
--- | ||
version: 1 | ||
config-from: https://github.com/taskcluster/taskgraph/blob/main/data/taskcluster-yml-github.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a page out of GitHub's book which I like over the full url here would be a simpler format like:
USER_OR_ORG_NAME/REPO_NAME/.github/workflows/REUSABLE_WORKFLOW_FILE.yml@TAG_OR_BRANCH
I could see the full https url here useful if the user wants to use GitLab or Mercurial, for example. Maybe this config-from
could be "smart" and visit the url, if provided, otherwise just use the GitHub shorthand method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally fine with using the format you suggest. Someone (Pete?) made the point before that this is a GitHub specific service, so using GitHub-specific references ought to be fine.
I mostly used full https
links here to start with the most flexible thing, and have us pare down to something else if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've previously tried to unify behavior across GH and Mercurial. Maybe it makes sense to put github.com
somewhere in the string, even if the result is not a URL that can be fetched? This is what we've done for scope names, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to pin specific version of the file? If that file is to be updated and we rerun event, then we could have completely different graphs.
This is probably a feature, not a bug, but maybe it would be good to somehow include this information in the decision task which version of the file was used. Something like attaching a hash of it as a metadata. But then it would probably fail on rerun if this hash changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've previously tried to unify behavior across GH and Mercurial. Maybe it makes sense to put
github.com
somewhere in the string, even if the result is not a URL that can be fetched? This is what we've done for scope names, for example.
Can you clarify what the "unify behavior" part means? (I'm not exactly sure how it relates to including github.com
in the string.)
In any case, It seems sensible to me to use the same format as we use elsewhere, eg: for scopes, unless we have a good reason not to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to pin specific version of the file? If that file is to be updated and we rerun event, then we could have completely different graphs. This is probably a feature, not a bug
Whether or not pinning is desirable is probably going to vary by project, and to be honest, I'm not even sure where we might land for Firefox CI projects. Obviously pinning gives you more reproducibility, but you lose some of the benefits of centralization if you have to manually go around and bump pins everywhere.
Perhaps it's best to allow for both pinned and unpinned, at least for now, until we have some better insight into how things work in practice?
but maybe it would be good to somehow include this information in the decision task which version of the file was used. Something like attaching a hash of it as a metadata.
I agree that we ought to include some information about what was used. metadata
seems like a reason place for it. It might be nice to include revision as well, to make tracing things back easier. Another idea, probably not viable, would be to somehow pre-create an artifact for all tasks that end up being created, and attach this information there. If this were somehow possible, we could probably include the entire .taskcluster.yml
, since we would no longer be limited to something reasonably sized.
But then it would probably fail on rerun if this hash changes?
Hmmm, I'm not sure I fully understand what you're getting at. If I understand correctly, Taskcluster-GitHub would run once, create task definitions with this extra metadata, and then the tasks would eventually run. If they're rerun, the metadata would stay the same, even if the .taskcluster.yml
had changed, since Taskcluster-GitHub is not involved with the reruns at all. (I think the only time we would get new metadata would be when we get new events, which would cause new task ids.)
Are you saying that the workers or queue or something else would be verifying the new information in the metadata
? (I can certainly see ways reruns could fail in that case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's best to allow for both pinned and unpinned, at least for now, until we have some better insight into how things work in practice?
With either the URL or GH shorthand, we could reference a branch for unpinned and revision or tag for pinned. So we get both for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously pinning gives you more reproducibility, but you lose some of the benefits of centralization if you have to manually go around and bump pins everywhere.
Agree with this and what Andrew says below.
My thinking was if someone would be using "unofficial" source for the config-from
then it would be a security concern if those tasks are not doing something unwanted.
metadata seems like a reason place for it
Yes, this is what I was thinking of, too. We should include few entries in metadatda by default (provided by github service at the time of parsing those configs) which files were used and probably their hashes (but maybe that's too much)
if they're rerun, the metadata would stay the same, even if the .taskcluster.yml had changed, since Taskcluster-GitHub is not involved with the reruns at all.
For the individual tasks yes. I was thinking about the broader case, when you have a long living PR, and each push event would spawn a new decision task, so possibly a different set of tasks even. But this is rather an edgecase, and in this particular case it would be nicer to reference those config-from
by revision. Not a problem I guess.
Are you saying that the workers or queue or something else would be verifying the new information in the metadata?
no, probably doesn't make sense unless we would somehow need this for CoT? Just ignore for now :) Metadata is for humans to see what was the origin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you folks feel about the latest version of this? Both with regard to the new proposed format for config-from
, as well as the extra context
?
3) Render the fetched `.taskcluster.yml` with the combined context | ||
4) Proceed as usual, creating any Tasks resulting from the rendered `.taskcluster.yml` | ||
|
||
Note: Only a `.taskcluster.yml` defined directly in a project repository will have its `config-from` processed. There is no need to support recursion into multiple levels of references, and it is more likely to result in confusion than add any value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 Good callout.
cd40641
to
821b4bf
Compare
|
||
# Details | ||
|
||
Two new variables will be introduced to `.taskcluster.yml` files. The first, `config-from`, will allow an URL to be provided that points at a full-fledged `.taskcluster.yml` hosted elsewhere. The second, `context`, will allow for variables to be defined that will be included with the JSON-e context when the referenced `.taskcluster.yml` is rendered. Here is an example of a possible concrete `.taskcluster.yml`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the base also has config-from
? Will nested references be allowed? If so, is it worth checking for circular references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a note further down that nested references will explicitly not be allowed. (It doesn't mean we can never support them...but I don't see a reason to add this complexity until we know there's a good reason to support it.)
``` | ||
--- | ||
version: 1 | ||
config-from: https://github.com/taskcluster/taskgraph/blob/main/data/taskcluster-yml-github.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's best to allow for both pinned and unpinned, at least for now, until we have some better insight into how things work in practice?
With either the URL or GH shorthand, we could reference a branch for unpinned and revision or tag for pinned. So we get both for free.
- secrets:get:project/mozillavpn/* | ||
``` | ||
|
||
When `config-from` is present in the `.taskcluster.yml`, existing top level keys such as `policy` and `tasks` are not valid. `context` is optional, and only permitted when `config-from` is present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that policy
or top-level keys other than tasks
should be disallowed.. Then we'd need to maintain multiple versions of identical base configs, e.g one for each PR policy. That seems like a step in the wrong direction.
Instead I think we should merge the top-level keys other than tasks
. Disallowing tasks
makes sense to me for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, thanks for raising it. I'll update the RFC to reflect this.
|
||
When Taskcluster-GitHub encounters a `.taskcluster.yml` such as the one above when processing an event for a repository, instead of directly rendering it with the context from the GitHub event, it will instead: | ||
1) Fetch the `.taskcluster.yml` referenced in `config-from` | ||
2) Combine the GitHub event context with the context provided in the repository's `.taskcluster.yml` (with GitHub event context taking priority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be a deep merge? Or just wholesale replacement of top-level keys. If the latter, GH events having priority makes sense.
But if we're deep merging, project context taking priority could allow for some powerful overriding capabilities of the base config.. Though it would be pretty hacky and hard to read, so I'm not advocating for this. Just pointing out the possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deep merge is what I was intending (...I should make that explicit).
I can certainly see that allow for overriding from the project context would allow for more possibilities...but I find it difficult to imagine a scenario where it wouldn't cause confusion or perhaps even security issues.
For example, if project context takes priority I think we could override things such as tasks_for
or event.pusher.email
, which I cannot imagine would be desirable.
(Maybe I'm misunderstanding what you're getting at though....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was rewriting this section for clarity, I couldn't come up with a compelling reason why deep merging would be useful, so my upcoming change declare it as shallow. (I can tweak it still, but it's a starting point for further discussion.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahal - any thoughts on the updates here?
|
||
`.taskcluster.yml` files in GitHub repositories are responsible for creating tasks (or not) in response to various GitHub events. In some cases (eg: when a decision task is used to delegate most of this to a separate tool), these files end up looking almost entirely identical, and hundreds of lines of JSON-e ends up duplicated across many different repositories. This duplication is a significant maintenance burden, and easily results in unwanted differences. | ||
|
||
Allowing a `.taskcluster.yml` to reference a file stored elsewhere will allow for deduplication of the bulk of the content in these files, resolving the concerns noted above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave a note here saying that those external .taskcluster.yml
should be a valid version:1
tc yamls that can be validated with the same schema? Or those embedded files would be free-form yamls that only produce tasks:
?
I think we could add a line here about such validity and also when implementation will go, github service would fail if embedded config has invalid schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use the same schema for both embedded and referenced .taskcluster.yml
files. (There's ways you can implement mutual exclusiveness and such in jsonschema.)
However, the only way we can deal with refusing recursion is to make that invalid in referenced files outside of the schema. That might be slightly less than ideal, but I think it's preferable to duplicating schemas? I don't feel terribly strongly either way though.
(I have an upcoming change that will spell out the situation for the configs more clearly...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lotas - I pushed a change that makes the schema more clear - any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates @bhearsum. I feel like we have already something implementable here :)
…ing to revisions, tags, or branches
…on information when rendering.
821b4bf
to
e4519fd
Compare
I've pushed a few changes that I think address all of the comments above. Here's a quick summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Ben!
rendered