Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp queue: live metrics #8478

Closed
dberenbaum opened this issue Oct 26, 2022 · 34 comments · Fixed by #9170
Closed

exp queue: live metrics #8478

dberenbaum opened this issue Oct 26, 2022 · 34 comments · Fixed by #9170
Assignees
Labels
A: experiments Related to dvc exp p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension

Comments

@dberenbaum
Copy link
Collaborator

For long-running queued/temp experiments, I'd like to see metrics that are being written in the temp dir while it's running, even if checkpoints aren't enabled. DVC collects these for the workspace but not for experiments running in temp dirs.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important A: experiments Related to dvc exp labels Oct 26, 2022
@dberenbaum
Copy link
Collaborator Author

This should be able to be collected from the metrics files even if not yet written to dvc.lock

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Dec 20, 2022
@dberenbaum dberenbaum added this to DVC Dec 20, 2022
@mattseddon
Copy link
Member

On top of the above information, the VS Code extension needs some way to map the temp directory that the experiment is being run in back to the experiment. We require this information so that we can set up file watchers to:

  1. call exp show to update the table if there are live metrics
  2. call plots diff to get live plots updates.

[Q] Will plots diff be able to collect this information for us without some updates?

@dberenbaum dberenbaum moved this to Backlog in DVC Dec 20, 2022
@mattseddon mattseddon added the product: VSCode Integration with VSCode extension label Dec 20, 2022
@dberenbaum
Copy link
Collaborator Author

[Q] Will plots diff be able to collect this information for us without some updates?

No. VS Code could run dvc plots show from within those temp workspaces to collect their plots. Would that work?

@karajan1001 @daavoo Any ideas here?

@shcheklein
Copy link
Member

Would that work?

I think the main issue is that we don't know those temp dirs, and that's what @mattseddon was asking about. If we know them we can setup watchers (need to make sure that we have all the project information from the dvc exp show and other commands - which files to watch).

@dberenbaum
Copy link
Collaborator Author

Yup, makes sense, just checking if that would be enough once he is able to determine those temp dirs, since AFAIK the extension relies entirely on plots diff right now. It may require some significant refactoring to start down this direction of watching each temp dir, although it could be more flexible and performant.

@karajan1001
Copy link
Contributor

I think the main issue is that we don't know those temp dirs, and that's what @mattseddon was asking about. If we know them we can setup watchers (need to make sure that we have all the project information from the dvc exp show and other commands - which files to watch).

To achieve this, we need to make DVC can gather data (metrics, parameters) remotely and them to the local experiment table. While the current implementation tries to use git to fetch these data into the local workspace.

@dberenbaum
Copy link
Collaborator Author

@karajan1001 mentioned in #8787:

In VSCode extension I can see the live metrics, like in https://user-images.githubusercontent.com/6745454/212892100-ae2d032d-e23b-4b62-9207-28752a3de4c8.mp4

To clarify, this issue is about non-checkpoint updates that happen in the tmp workspace. AFAIK, the live metrics in the video are only possible because that demo repo uses checkpoints.

@shcheklein
Copy link
Member

To achieve this, we need to make DVC can gather data (metrics, parameters) remotely and them to the local experiment table. While the current implementation tries to use git to fetch these data into the local workspace.

@karajan1001 could you clarify please? I'm not sure I understand the point

@karajan1001
Copy link
Contributor

@karajan1001 mentioned in #8787:

In VSCode extension I can see the live metrics, like in https://user-images.githubusercontent.com/6745454/212892100-ae2d032d-e23b-4b62-9207-28752a3de4c8.mp4

To clarify, this issue is about non-checkpoint updates that happen in the tmp workspace. AFAIK, the live metrics in the video are only possible because that demo repo uses checkpoints.

Hi @dberenbaum, so what you mean here is that users continuously update metrics.json during the training, and we need to monitor this file?

@karajan1001 could you clarify please? I'm not sure I understand the point

Hi @shcheklein , the current method to gather remote execution results is that we use git fetch to fetch new commits generated by checkpoints. And in this PR, it looks like we need to read the metrics file update even if they were not committed to Git, in this case, the only way is to monitor it directly in the remote workspace.

@dberenbaum
Copy link
Collaborator Author

Hi @dberenbaum, so what you mean here is that users continuously update metrics.json during the training, and we need to monitor this file?

Correct.

@karajan1001
Copy link
Contributor

Hi @dberenbaum, so what you mean here is that users continuously update metrics.json during the training, and we need to monitor this file?

Correct.

For the local temp dir executor, we can just monitor the temp workspace, but for experiments running on a remote server, It would be much hard to implement this.

@dberenbaum
Copy link
Collaborator Author

Yup, I think it's fine to focus on local execution when we get to working on this issue.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 14, 2023

The directory being used can be determined based on the hash for the experiment. Files related to the temporary execution process/dir are stored in .dvc/tmp/runs/<hash>.

So for something like:

$ dvc exp run --queue --force
Queued experiment 'f684e2a' for future execution.
$ dvc exp run --run-all

The full exp hash will be visible in dvc exp show --json output - in this case it's f684e2a8963ee2c47590cb8fa2823fee4129753b

Inside the runs directory (once the queued experiment has actually been started) you will get:

$ tree .dvc/tmp/exps/run/f684e2a8963ee2c47590cb8fa2823fee4129753b
.dvc/tmp/exps/run/f684e2a8963ee2c47590cb8fa2823fee4129753b
├── f684e2a8963ee2c47590cb8fa2823fee4129753b.json
├── f684e2a8963ee2c47590cb8fa2823fee4129753b.out
├── f684e2a8963ee2c47590cb8fa2823fee4129753b.pid
└── f684e2a8963ee2c47590cb8fa2823fee4129753b.run

The file that vscode should be looking at is runs/<hash>/<hash>.run:

cat .dvc/tmp/exps/run/f684e2a8963ee2c47590cb8fa2823fee4129753b/f684e2a8963ee2c47590cb8fa2823fee4129753b.run
{"git_url": "file:///Users/pmrowla/git/example-get-started/.dvc/tmp/exps/tmpb1z5g_1z", "baseline_rev": "352e2967ff6ca466fa313d9c1fc09a350bcee1a4", "location": "dvc-task", "root_dir": "/Users/pmrowla/git/example-get-started/.dvc/tmp/exps/tmpb1z5g_1z", "dvc_dir": ".dvc", "name": "", "wdir": ".", "result_hash": null, "result_ref": null, "result_force": false, "status": 4}

It's just a json file with a single level dictionary. The root_dir key contains the temporary directory where the experiment is being run. In this case

"root_dir": "/Users/pmrowla/git/example-get-started/.dvc/tmp/exps/tmpb1z5g_1z"

To get file-watcher based live metrics updates without relying on dvc in the parent workspace, the vscode extension can monitor that dir in the same way it monitors the regular workspace for regular runs.

The cli equivalent to get real-time/live metrics for the running exp would just be:

$ cat /Users/pmrowla/git/example-get-started/.dvc/tmp/exps/tmpb1z5g_1z/evaluation.json

or

$ cd /Users/pmrowla/git/example-get-started/.dvc/tmp/exps/tmpb1z5g_1z
$ cat evaluation.json

or even

$ cd /Users/pmrowla/git/example-get-started/.dvc/tmp/exps/tmpb1z5g_1z
$ dvc exp show --json  # "workspace" entry for this command would now give the live values (since we are cd'd into the temp execution dir)

@iterative/vs-code

@pmrowla
Copy link
Contributor

pmrowla commented Feb 14, 2023

For reference, all of the files in a given queued exp's runs/<hash>/ directory are as follows:

  • <hash>.json - json file w/executor process state (pid, paths for redirected output, returncode). (currently stdout will always point to the <hash>.out file in the same directory)
{"pid": 81538, "stdin": null, "stdout": "/Users/pmrowla/git/example-get-started/.dvc/tmp/exps/run/f684e2a8963ee2c47590cb8fa2823fee4129753b/f684e2a8963ee2c47590cb8fa2823fee4129753b.out", "stderr": null, "returncode": 255}
  • <hash>.out - redirected output for the running experiment (currently it always contains combined stdout + stdin, we do not redirect stderr separately for experiments). This can be followed while the task is running with something like tail --follow (or dvc queue logs --follow) to get the live output for queued jobs
Verifying outputs in frozen stage: 'data/data.xml.dvc'

Running stage 'prepare':
> python src/prepare.py data/data.xml

Running stage 'featurize':
> python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20017, 3)
The output matrix data/features/train.pkl size is (20017, 252) and data type is float64
The input data frame data/prepared/test.tsv size is (4983, 3)
The output matrix data/features/test.pkl size is (4983, 252) and data type is float64

Running stage 'train':
...
  • <hash>.pid - standard pidfile containing only the integer PID for the queue task (will be the same as the pid entry in the <hash>.json file
81538
  • <hash>.run - json file as described in previous comment

@pmrowla
Copy link
Contributor

pmrowla commented Feb 14, 2023

We can consider revisiting how exp show collects running queued experiments, but it may be easier on the vscode side to just get the directory path from the .run file(s), and then treat each one like a workspace (and just cd + re-use whatever dvc exp show/dvc plots/etc workflows vscode is already doing)

@dberenbaum
Copy link
Collaborator Author

@pmrowla Why do you think it's easier for VS Code to do this than DVC? Top priority should be having this in VS Code, but I think a user could rightfully ask why they can't get the same info collected easily in the CLI exp show. I wouldn't say it's a hard requirement to have this in CLI exp show, but I think there has to be a strong reason why it shouldn't be there.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 14, 2023

Because they can reuse their existing code, and because using file watchers to wait for a metrics file to change is the correct way to handle this in a gui application (as opposed to repeatedly calling exp show and having DVC re-read the file every time)

We can support this in exp show, but this is another case of "there is a better way for vscode to be doing certain things than repeatedly calling exp show for everything"

@dberenbaum
Copy link
Collaborator Author

using file watchers to wait for a metrics file to change is the correct way to handle this in a gui application

Agreed, VS Code should be responsible for deciding when it's needed to get updates.

Once they determine an update is needed, why not have DVC collect the experiments in one place? When exp show is called, couldn't DVC also determine which files have changed and collect that info while relying on cached info for everything else? That seems like it would benefit both products. If I run dvc exp show while queued experiments are running, this seems like it would be helpful regardless of VS Code.

@shcheklein
Copy link
Member

as opposed to repeatedly calling exp show and having DVC re-read the file every time)

@mattseddon can correct me, but we definitely don't plan to do it this way. We want to run this with watchers. We need to know temporary locations. I think that should be enough for VS Code.

but it may be easier on the vscode side to just get the directory path from the .run file(s

If that works and it's stable enough, I think VS Code can read this information. @mattseddon wdyt?


All of this doesn't change the discussion on the DVC side though. I think @dberenbaum point is that we'd like to be able to see live metric updates as an HTML report in DVC? Etc.

@dberenbaum
Copy link
Collaborator Author

Let's separate plots and the exp table.

I don't really see a use case to have dvc plots show/diff generate live plots from queued experiments. I think this would require a major overhaul of the plots UI since there's nothing like dvc plots diff --queued or something to automatically collect plots from a bunch of experiments. That's out of scope, so providing the location for VS Code to watch for plots changes is enough.

For the exp table, it makes sense to me to have live updates from queued experiments collected as part of dvc exp show. I would also think this would be useful for VS Code, but would like to hear from @mattseddon on whether that's useful or if it would be just as easy to collect those themselves.

@mattseddon
Copy link
Member

Tl;dr - yes we can do it on the VS Code side, no it would not be trivial. Min 2 weeks effort to get either plots or experiments updates then a little more to get the other. As plots are not on the roadmap we will probably have to do this anyway.

exp show updates

In order to implement on the VS Code side I will need to:

  1. Read the contents of .dvc/tmp/exps/run on startup and check if there are any running experiments (search for and read all .run files).
  2. Start watching .dvc/tmp/exps/run.
  3. For any newly created directories read the contents of .dvc/tmp/exps/run/<sha>/<sha>.run*
  4. Use the root_dir entry to add to a mapping between the tmp dir and the sha.
  5. When an event comes through for a file in the tmp dir call the CLI with the tmp dir as a cwd.
  6. Use the mapping created in 4 to pipe the output's workspace record back into the main experiments data under the sha.

[Q]s

  1. *There is a status field in these run files. Do any of the statuses indicate that the experiment is finished? Maybe 1 and/or 6?
  2. Does .dvc/tmp/exps/run ever get garbage collected? Will this directory grow indefinitely?
  3. Am I going to run into lock issues when I try to call exp show from two different temp directories at the same time? (probably not being it is lockless now)

There are plenty of points of failure in the above and we seem to be relying more and more on the internals of DVC. 5/6 will require some heavy lifting to fit in with the code that is already there. An optimistic estimate would be 2 weeks of effort to get this ironed out. This will more than likely turn into 3 and could blow out even further.

plots diff updates

If DVC isn't going to provide plots updates then I will still have to build out most of the steps above anyway. The main difference being for step 6 the task is easier because the data for each experiment is more or less held separately.

@mattseddon
Copy link
Member

To add some further context to the above comment the first task in iterative/vscode-dvc#3178 is

Watch and keep updating plots

@pmrowla
Copy link
Contributor

pmrowla commented Feb 24, 2023

*There is a status field in these run files. Do any of the statuses indicate that the experiment is finished? Maybe 1 and/or 6?

Yes, the status is mapped to celery task states, so the individual values probably aren't meaningful to vscode, but anything >= 3 indicates that the experiment is finished

class TaskStatus(IntEnum):
PENDING = 0
PREPARING = 1
RUNNING = 2
SUCCESS = 3
FAILED = 4
CANCELED = 5
FINISHED = 6

Does .dvc/tmp/exps/run ever get garbage collected? Will this directory grow indefinitely?

It's not garbage collected right now, but will be eventually. This is related to the performance issues with exp show and queue status

Am I going to run into lock issues when I try to call exp show from two different temp directories at the same time? (probably not being it is lockless now)

No, in this case from DVC's perspective you will be running exp show in two completely different git/dvc repositories

@dberenbaum
Copy link
Collaborator Author

Watch and keep updating plots

Makes sense to me. @mattseddon @pmrowla Do you see any way around VS Code watching files to decide when to update? I assumed this has to be handled by VS Code since DVC is not running any kind of daemon and can only check for updates when called.

If DVC isn't going to provide plots updates then I will still have to build out most of the steps above anyway.

I'm open to discussing what DVC can do. I'm not sure it's worth extending the plots syntax, but for dvc plots diff running-exp1 running-exp2, it's possible DVC could try to get the live plots for each running experiment. Maybe we should start with dvc exp show and then we will have a better idea whether we can build on that for dvc plots diff or whether it will be better for VS Code to collect the plots.


For dvc exp show, VS Code will already know which directories need to be updated but still has to call dvc exp show to get a full table update. Theoretically, I don't think that should be much of an issue if DVC caches the revisions and itself checks to see which directories need to be updated, but I don't know how easy that is to implement.

@pmrowla Are there other concerns you have?

@mattseddon
Copy link
Member

@pmrowla how much effort would it take to add the root_dir from the .run files and the <queued_hash> into the exp show JSON for applicable experiments? Would this be out of the question? Having that information available would mean that I could avoid reading the entire contents of .dvc/tmp/runs/ on my end.

@pmrowla
Copy link
Contributor

pmrowla commented Mar 1, 2023

I don't think it belongs in exp show, especially since those fields are only valid as long as .dvc/tmp/runs actually exists (and exp show is for dumping experiment data that exists in git and is completely separate from anything in runs). That separation is more important now that we are caching exp show on the DVC side as well (and stuff from runs should definitely not be cached alongside the git data)

What we could do is add some other plumbing command that walks runs and combines+dumps everything to json for the vscode extension

@mattseddon
Copy link
Member

Would it be more appropriate to extend dvc queue status?

@pmrowla
Copy link
Contributor

pmrowla commented Mar 1, 2023

Thinking about this some more, I think maybe we can keep everything vscode wants/needs in a single dvc command (whether or not that's dvc exp show --json), but what I would like to do is change the exp json output in a backwards incompatible way.

The issue right now is that we started from the standpoint of "just dump the dvc structures dicts in json, in a way that also sort of looks like what gets displayed in the CLI table". This was fine from a "we need something that works for vscode right now" standpoint, but what we have 2 years later is a giant mess because what we really do is:

  1. collect a lot of structured data internally in dvc from a lot of different places
  2. convert this structured data into a bunch of dicts/lists in non-obvious ways, in order to render a bunch of semi-related data that is not flat rows into a flat CLI table
  3. take those ugly dicts meant for CLI rendering only and add some more fields that vscode needs but have nothing to do with the CLI table (making the data structure even worse)
  4. dump all of that to JSON so vscode can consume it

This problem isn't even specific to vscode and the json output. The CLI table has changed over time as well. exp show started from "just display single-commit experiments and flat git branches when a user manually runs a command from the CLI" but over time we've had to add more and more stuff like

  • display checkpoint runs (and checkpoint runs that branch off of existing checkpoint runs)
  • add more granular error handling in params/metrics
  • display actively running workspace exps
  • display queued [tempdir, dvc machine/ssh, celery] exps
  • display actively running [tempdir, dvc machine/ssh, celery] exps
  • display dvc deps/outs

and the exp show collection code in dvc is a bunch of tacked on hacks to do all of these things, which is reflected in the current --json output also looking like it was done as a bunch of tacked on hacks.


Collecting the live metrics stuff from the tempdirs is straightforward for DVC, and the dvc-task/celery stuff was all done in a way to make that kind of collection easier than it was before. But exp show is still the poorly organized mess from before we knew what kind of use cases (and alternative output formats) we would actually need to support in the long run.

Instead of continuing to try and work around the current setup we should just get around to coming up with a data schema that is actually sane and makes sense for both dvc and consumers (whether the consumer is vscode/studio/or something else)

This will require changes in both dvc and vscode, but it will allow us to come up with a data schema that is actually sane, allows for future additions to be easily added, and makes sense for both dvc and consumers (whether that's vscode/studio or something else entirely).

@pmrowla
Copy link
Contributor

pmrowla commented Mar 1, 2023

Would it be more appropriate to extend dvc queue status?

If I was going to put the executor/process stuff in a separate subcommand it probably belongs in dvc queue, but again, I'm not sure it belongs somewhere like dvc queue --json. The --json flag is intended to be "the CLI output for a given command in JSON form". What vscode wants here is stuff like PIDs/hidden directories/etc that typical users should not care about at all, and won't be displayed in dvc queue status CLI output

What we really want is something separate that is intended specifically for vscode

@mattseddon
Copy link
Member

Thinking about this some more, I think maybe we can keep everything vscode wants/needs in a single dvc command (whether or not that's dvc exp show --json), but what I would like to do is change the exp json output in a backwards incompatible way.

I am on board with this 100% and I am sorry that we have contributed to the code/output being so hacked together. If it helps I can easily get together a list of things that we need/use from the output and how we use it. I will happily rewrite whatever I need to on the VS Code side to accommodate these changes and I'd like to contribute on the DVC side too.

@mattseddon
Copy link
Member

I think it would be a good idea to break the original problem into two separate parts:

  1. The mapping between a queued experiment/its temp directory.
  2. Collecting information about that experiment.

VS Code can fairly easily handle 1 and call DVC for updates at the correct time.
I think DVC should be able to handle 2 via plots diff/exp show as it will provide value to users.

@dberenbaum
Copy link
Collaborator Author

Talked to @pmrowla today, and he is planning to research (up until early next week) the direction he mentioned above, which could potentially collect all info (params, metrics, plots, etc.) for any set of experiments.

@pmrowla
Copy link
Contributor

pmrowla commented Mar 3, 2023

If it helps I can easily get together a list of things that we need/use from the output and how we use it.

This would be great to get from the vscode side @mattseddon

@mattseddon
Copy link
Member

So the tl;dr is that we currently use everything that isn't outs.

From the exp show data we extract the following information:

  • Deps, params, metrics are all displayed in the experiments table. For Deps we only use the hash but there is a plan to put the extra information at least into a tooltip.
  • Deps, Params and metrics file(s) are extracted from respective dict's keys. We use these to watch for updates / call exp show. We also show these in a tree structure in the UI.
  • Metrics files. Currently, watch these for plots updates as well / call plots diff.
  • Params are used separately to create new experiments (with -S).
  • We display file level and overall errors in various places in the UI. Behaviour is different based on an error/data key in the JSON. We use the error msg but disregard the type.
  • We make an id from the name or SHA (key from the dict).
  • We recreate logic that is in DVC to add [ ] or ( ) to the displayed name of experiments e.g f596aa8 [mixed-sacs].
  • Status is used to determine the state an experiment is in. We actually aggregate this data to work out whether there is a single experiment running and then stop the user from performing the majority of other experiment actions whilst that is happening.
  • We use the executor field to determine whether experiments are running in the queue and if we can stop that experiment using dvc queue kill.
  • Timestamp is also used in the UI.

Outside of exp show we collect the following for the experiments parts of the extension:

  • Whether or not there are checkpoint experiments in the workspace by reading all available dvc.yamls.
  • We collect and append git information (author, date, hash, message, tags) against commit records. We get this data from `git log ^..HEAD --pretty=format:%H%n%an%n%ar%nrefNames:%D%nmessage:%B -z).
  • Whether or not an experiment is running in DVCLive-only mode (i.e user just onboarded) using a signal file so that we can mark it as running.
  • We get the PID from .dvc/tmp/exps/rwlock.lock so that we can kill experiment running in the workspace.
  • Whether or not there are any stages (dvc stage list) so that we can display a message to the user and help them to add one.

Having a <task_id> & <temp_dir> against records initially created as a queued experiment would be handy.

Please let me know if you need more information. Happy to go into greater details on any of the points above.
🙏🏻

@pmrowla pmrowla self-assigned this Mar 7, 2023
@pmrowla pmrowla moved this from Backlog to In Progress in DVC Mar 7, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants