-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge plot ids across dvc.yaml files #9898
Conversation
Co-authored-by: David de la Iglesia Castro <[email protected]>
This comment was marked as resolved.
This comment was marked as resolved.
Codecov ReportPatch coverage is
📢 Thoughts on this report? Let us know!. |
This goes against how everything works in Also, a breaking change as you said, and makes it more confusing to the users if they try to share some configs across dvc.yaml files and expect them to be unique. I don't have a strong opinion on disallowing use of same id/names across multiple dvc.yaml files, whether by enforcing it or by convention. This would make it easier to migrate plots of same ids across different But here, it feels like the correct fix should be to always use the definitions from workspace/HEAD instead of merging them (could not find the issue to link to :( ). |
This PR doesn't touch how we merge plot definitions or how we handle changes across revisions. It is only about cases where the same plot ID is found in different @skshetry I agree with your overall point that merging two plots with the same ID in different Examples (need to make sure we test all of these):
In the last example, should |
Need to test how it looks in VS Code and Studio. Edit: VS Code looks good. |
tests/unit/render/test_match.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker: I prefer higher-level tests, like setting up dvc.yaml definitions and asserting on plots.show
output. At least for me, these ones feel harder to follow/read and are kind of brittle because they are tied to the internal structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an integration test for the dvclive 2.x->3.x transition. I don't want to get into testing every combination of functionality in the high-level tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually tested the scenario of a repo with commits defining plots in dvclive/dvc.yaml
and then using root dvc.yaml
(DVLive 2.0 -> DVCLive 3.0).
Works as expected when doing dvc plots show
from both revisions and also dvc plots diff dvclive2 dvclive3
. VSCode also works as expected.
I have not tested other scenarios.
After some debugging with @mvshmakov, I'm so far unable to get the local studio dev ui working, so I might need someone who already has this setup to test these changes to unblock this. |
I will test it |
Unfortunately, the change doesn't have an effect on Studio. It appears that Studio doesn't use the code modified here. I am checking the Studio backend parsing code to see how this change could be applied |
Here is the code: @dberenbaum Should I try to patch it there? In the longer term, it would be good to understand why we kind of have this logic duplicated there |
Ok, this is not a 5-minute change 😅 Studio parses and stores plots separately for each commit. cc @shcheklein |
AFAIR Studio does this in part so that it doesn't change depending on the revisions selected. I think we should update the plot IDs on the backend. The logic in this PR is independent for each revision, so it shouldn't be hard to incorporate.
💯 We are duplicating a lot of plots logic. As a small starting point, I could extract the logic in this PR to a utils function so it can be reused in the Studio code linked above. |
Done. Now it should be possible to replace https://github.com/iterative/studio/blob/5b60ed17ecea77aec9c83cc58b5edf6616dfc4e1/backend/repos/parsing/dvcmeat.py#L192-L195 with: for name, (plot_inner_id, plot_properties) in group_definitions_by_id(definitions).items(): We would still need to iterate over |
Could we also collect and return errors in the utils function here?: for plot_id, plot_definition in config_file_content.get("data", {}).items(): I am also curious as I can't find a test in Studio where the code passes through this |
From looking at the test, it seems something like this is expected: dvc/tests/unit/render/test_match.py Lines 61 to 64 in 33b6c1d
I can't reproduce errors at this level, but I don't know how to prove or test that it's no longer needed. If the file is not found, it will be raised under {
'definitions': {
'data': {
'dvclive/dvc.yaml': {
'data': {
'dvclive/training/plots/metrics/train/acc.tsv': {}
}
}
}
},
'sources': {
'data': {
'dvclive/training/plots/metrics/train/acc.tsv': {
'props': {},
'error': FileNotFoundError(2, 'No such file or directory')
}
}
}
} If there's an error with the plots definition, it will not be caught during collection and will be logged here: Lines 114 to 120 in 33b6c1d
tldr I think we should assume these errors no longer exist and revisit if needed |
Related to #9898 (comment). From going up and down the stack between If a user removed a file from a plot definition and ran an experiment then there will be no way to show that data even if the file was captured as an out and the file is brought back as part of the definition. I can recreate this in all three products. I think the only way to get around this would be to collect/merge definitions first and then use those merged definitions to attempt to collect all of the relevant data. I think that approach would bring back the errors shown in the aforementioned comment. The issue can be recreated using a plots:
- Accuracy:
x: step
y:
training/plots/metrics/train/acc.tsv: acc
training/plots/metrics/test/acc.tsv: acc
y_label: accuracy and outs
To restate the problem with a concrete example: If Edit: bringing this up because I keep running into issues like this with #9940. I'm leaning towards fixing issues like this before moving on. |
@mattseddon AFAIU this is the same as the discussion from iterative/vscode-dvc#3676, or am I missing something? I made an example in https://github.com/iterative/vscode-dvc-demo/tree/drop-test-acc (it still doesn't raise the types of errors mentioned in #9898 (comment)). No matter when we do the merging, we will need to choose between two conflicting In DVC, we merge based on the order the revisions are passed. For example,
For VS Code, AFAIK the revisions are ordered by latest revision ( Screen.Recording.2023-09-18.at.11.43.17.AM.movIn Studio, it seems like it depends on the order in which you click: Screen.Recording.2023-09-18.at.10.58.58.AM.movI think Studio is supposed to split the plots if any of the field definitions change between revisions, so this looks like a bug to me. |
I updated Studio to merge definitions preferring the order of selection to mimic DVC behavior when it receives revs |
@daavoo That was in https://github.com/iterative/studio/pull/6773? I guess I missed that we were changing more than just live plots there. I guess we might need what behavior should be in each product and how to make it more consistent, although not sure it needs to be a high priority. |
Yes, the 3rd point in the description.
The behavior of DVC made more sense to me for simple scenarios (i.e. updating title or axis name) so I went with uptating Studio to match DVC. |
We have done several iterations on how to merge in these scenarios and I'm not sure I've seen it become an issue for users, so I don't think we need to go too deep on this. |
@dberenbaum I know that we already had a lengthy discussion, sorry for dragging this up again.
My point is that in this example when
This was a caching bug that I have fixed in iterative/vscode-dvc#4678
Is this the behaviour that we want to standardise on? |
Thanks, I get it now. I think it's related to #7913 (both are about loading plots data in revs where the data isn't defined in any plot). In that issue, it's called expected behavior rather than a bug. I think it would be useful in some scenarios, but I'm not sure it's worth spending time on now. We already have too many marginal things to do on plots, haven't seen anyone ask for this, and it will likely add its own complexity and break other behavior. WDYT?
👍
I don't have a strong opinion on what we do or that we need to standardize at the moment. My understanding from iterative/vscode-dvc#3676 was that we intentionally chose not to follow either the existing DVC or Studio convention, and I don't know that we have a strong reason to prioritize changes. |
Addresses the issue in iterative/dvclive#687 (comment).
It merges plot IDs across all
dvc.yaml
files in the repo. Technically this is a breaking change, but most likely it's an improvement for almost all use cases. We already merge plot IDs across revisions, so this PR does the same acrossdvc.yaml
files. Not sure if there are valid cases to have separate plots with the same plot ID in differentdvc.yaml
files. Is it worth warning when there is an overlap?Note: artifacts should follow the same behavior, which would help with issues like https://github.com/iterative/studio/issues/6939.
Before this PR:
After this PR: