Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task has too many dependencies #653

Closed
Tracked by #490 ...
eu9ene opened this issue May 30, 2024 · 5 comments · Fixed by #858
Closed
Tracked by #490 ...

Task has too many dependencies #653

eu9ene opened this issue May 30, 2024 · 5 comments · Fixed by #858
Assignees
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented May 30, 2024

https://firefox-ci-tc.services.mozilla.com/tasks/ZqlokLMTQG-pZPJtK9UnOw/runs/0/logs/public/logs/live.log

Exception: task merge-corpus/merge-corpus-da-en has too many dependencies (105 > 99)

@eu9ene eu9ene added bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline labels May 30, 2024
@bhearsum
Copy link
Collaborator

The only workaround we have for this is adding tasks between merge-corpus and its upstreams to avoid hitting this limit. We've done this before for the all tasks, but this will bit a little bit different because we need to pull artifacts from the upstream. It's tractable though.

@bhearsum
Copy link
Collaborator

We've done this before for the all tasks, but this will bit a little bit different because we need to pull artifacts from the upstream. It's tractable though.

To this point, I'm just going to republish artifacts in the dummy tasks. I looked the bicleaner tasks on one of the large recent training runs, and the artifacts totaled to ~25GB at rest. That costs ~$.40/month to store in GCP, so even if we had 100 of those size runs in a year we're looking at ~$500/year to store them. We can revisit this decision at some point, but I don't think it's worth fussing with an alternate solution for finding artifacts of an indirect upstream at this time.

bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue Jun 18, 2024
The fix for mozilla#653 is going to involve dummy tasks that copy artifacts from their upstreams. To support this, we need our little helper transform to do a few things:
* Allow for one fewer upstream dependency, to make room for a `docker-image` task upstream
* Be able to add `fetches` entries for the tasks upstream of it
* Store any fetched artifact names in `attributes`, to allow tasks _downstream_ of a dummy to easily find all artifacts that the dummy will republish

I consider the last part fairly hacky, but couldn't come up with anything better.
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue Jun 19, 2024
The fix for mozilla#653 is going to involve dummy tasks that copy artifacts from their upstreams. To support this, we need our little helper transform to do a few things:
* Allow for one fewer upstream dependency, to make room for a `docker-image` task upstream
* Be able to add `fetches` entries for the tasks upstream of it
* Store any fetched artifact names in `attributes`, to allow tasks _downstream_ of a dummy to easily find all artifacts that the dummy will republish

I consider the last part fairly hacky, but couldn't come up with anything better.
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue Jun 19, 2024
The fix for mozilla#653 is going to involve dummy tasks that copy artifacts from their upstreams. To support this, we need our little helper transform to do a few things:
* Allow for one fewer upstream dependency, to make room for a `docker-image` task upstream
* Be able to add `fetches` entries for the tasks upstream of it
* Store any fetched artifact names in `attributes`, to allow tasks _downstream_ of a dummy to easily find all artifacts that the dummy will republish

I consider the last part fairly hacky, but couldn't come up with anything better.
@bhearsum bhearsum self-assigned this Jun 19, 2024
@bhearsum
Copy link
Collaborator

bhearsum commented Jul 2, 2024

Another solution for this could be to chunk dataset tasks together. If we managed to chunk them by size we could avoid increasing the end to end runtime as well. This might not be great for caching purposes, but I wanted to mention it here for completeness.

bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue Jul 8, 2024
The fix for mozilla#653 is going to involve dummy tasks that copy artifacts from their upstreams. To support this, we need our little helper transform to do a few things:
* Allow for one fewer upstream dependency, to make room for a `docker-image` task upstream
* Be able to add `fetches` entries for the tasks upstream of it
* Store any fetched artifact names in `attributes`, to allow tasks _downstream_ of a dummy to easily find all artifacts that the dummy will republish

I consider the last part fairly hacky, but couldn't come up with anything better.
@bhearsum
Copy link
Collaborator

It turns out that the current limit is likely not a hard limit these days. We're working an removing or greatly increasing this limit in Taskcluster in taskcluster/taskcluster#7151.

@bhearsum
Copy link
Collaborator

The Firefox CI cluster now supports up to 10,000 dependencies 🥳. We'll still need a taskgraph change to have that allow for it.

bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue Sep 23, 2024
Most notably, this includes a change that allows us to have up to 10,000 dependencies per task, which will fix mozilla#653.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants