Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torch-mlir][sparse] re-enable most sparse test #3929

Closed
wants to merge 2 commits into from
Closed

Conversation

aartbik
Copy link
Contributor

@aartbik aartbik commented Dec 19, 2024

only the one that needs decomposition is still broken but that one never ran correctly anyway

only the one that needs decomposition is still broken
but that one never ran correctly anyway
@stellaraccident
Copy link
Collaborator

The flakiness of sparse tests has been a persistent drag on my infra team, who are the ones running the current CI. I'm willing to try one more time, but if this needs more intervention in future upgrades, we probably need dedicated bots for it.

@aartbik
Copy link
Contributor Author

aartbik commented Dec 20, 2024

The flakiness of sparse tests has been a persistent drag on my infra team, who are the ones running the current CI. I'm willing to try one more time, but if this needs more intervention in future upgrades, we probably need dedicated bots for it.

Like the rest of torch-mlir, the sparsity support needs to adjust to upstream changes from time to time. In that regard, it is no different from the other dependencies on pytorch. I can only see one instance of a "drag", but perhaps there were others that never made it to the github discussions.

At this point, it is probably a matter of whether you consider sparsity support a first citizen feature of torch-mlir or not. Happy to discuss this 1:1 with you if that helps.

@stellaraccident
Copy link
Collaborator

The flakiness of sparse tests has been a persistent drag on my infra team, who are the ones running the current CI. I'm willing to try one more time, but if this needs more intervention in future upgrades, we probably need dedicated bots for it.

Like the rest of torch-mlir, the sparsity support needs to adjust to upstream changes from time to time. In that regard, it is no different from the other dependencies on pytorch. I can only see one instance of a "drag", but perhaps there were others that never made it to the github discussions.

At this point, it is probably a matter of whether you consider sparsity support a first citizen feature of torch-mlir or not. Happy to discuss this 1:1 with you if that helps.

I'll need to write better guidelines for this in the new year, but suffice to say there are many niche things in torch-mlir and the habit of the project is to only enable those by default that have been proven to be highly resilient to change and have staffing that can manage them through upgrades (and the primary features all have an additional, large downstream testing apparatus). I'm open to a great many things having a space to grow, but the default CI only tests the most stable core of the project. We've disabled things from that CI which have many more users and contributors than sparsity for reasons of maintainability. It isn't personal -- the ocean is vast and if I didn't bias towards pruning things that need more maturity (or their own CI), there wouldn't be a project.

I get pinged privately every time my team hits a roadblock. Sometimes they go the extra mile and silently do the work to clear it and sometimes I tell them to just disable and move on, but both instances are work that I keep a tally on.

We can give this a shot one more time, but this style of sparsity is still a relatively young, niche feature. Having moved with pytorch through many phases, my guess is that it will be a bit more time before it truly settles. Again, not personal. Just been here many times and have learned not to take a position on things that need time to mature and find their spot. It can't be rushed, in my experience.

@aartbik
Copy link
Contributor Author

aartbik commented Dec 20, 2024

Noted. I will not burden you any further.

@aartbik aartbik closed this Dec 20, 2024
@aartbik aartbik deleted the bik branch December 20, 2024 00:25
@stellaraccident
Copy link
Collaborator

stellaraccident commented Dec 20, 2024

I think you might be taking this differently than I mean, so let me attempt to restate and include an analogy:

It took a concentrated effort from many of us, across a few companies almost a year of effort to execute on bringing up the FX infra in torch-mlir and making it the default path (and disabling the old). As part of that, we reworked the CI, upgraded all of the backends, and ensured that the path worked for everyone. The result is that many people know how to maintain it. Also, the pytorch APIs that it rides on have stabilized, allowing it to be supported. During that time, the priority remained being able to keep moving forward with the tried and true infra. Things worth doing take time, and it is fine -- but there is more to this kind of project than the code: there is the shared understanding of how to keep it running.

Afaict, the sparse tensor work is a single engineer effort (you), and it is built on pytorch APIs that, if they are stable, are newly so... Newer than what we usually commit to support. That is a precarious position for the project and for you. That is fine. Many things start from humble beginnings, but the timeframe and maturity need to be considered.

Now, my instincts were telling me when you showed up that this was still too young to be supported properly, but I override those instincts and decided to give it a go. But what we found is that indeed, the underlying infra requires maintenance and no one except you both uses and understands it. We need a broader base to build on than that.

With that in mind, I'm trying to give you an option to proceed and cross the official support barrier when the time is right. Perhaps a separate bot is too far. Other acceptable options would be to have a lit flag to enable tests for the experimental sparse feature. Since lit tests are cheap, we could even run them in a separate step for advisory use until they are ready. What I'd rather not do is have a bunch of tests that are always in a state of being half commented out that we need to coordinate on. Just carve out an area with a flag and iterate there until ready. Then in the future let's have a look and see if we can support enabling by default. There would be other things we would want to cross that bridge too, such as end to end tests, etc.

We're in the process of working out governance for the tensor compiler infra overall upstream, and my hope is that we can define it in a way that also gives better written guidelines for this part of the project -- and because tensor compilers are like 70% testing and CI, some significant guidelines need to be laid out for maturity and stability.

In the meantime, you've got me and my team trying to keep the infra and the version trains moving forward. It is a struggle, and apologies if the judgements are a bit capricious. But we have more infra than we have investment to support, and I'm afraid this puts me in an uncomfortable position of needing to moderate the ambitions of things that stress our ability to keep the trains running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants