Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue: wrap up e2e workflow #3178

Closed
3 of 5 tasks
shcheklein opened this issue Jan 30, 2023 · 15 comments · Fixed by #3965
Closed
3 of 5 tasks

Queue: wrap up e2e workflow #3178

shcheklein opened this issue Jan 30, 2023 · 15 comments · Fixed by #3965
Assignees
Labels
priority-p1 Regular product backlog story Product feature aka epic. Discussion, progress, checkboxes for implementation, etc

Comments

@shcheklein
Copy link
Member

shcheklein commented Jan 30, 2023

Followup after #3091

Tasks

Preview Give feedback

See #3091 (comment)

@shcheklein shcheklein added 🔍 review A placeholder issue to review certain part of the product or story priority-p1 Regular product backlog labels Jan 30, 2023
@shcheklein shcheklein self-assigned this Jan 30, 2023
@shcheklein shcheklein added the 📦 product Needs product input or is being actively worked on label Jan 30, 2023
@shcheklein shcheklein changed the title Review queue functionality Queue: wrap up e2e workflow Feb 4, 2023
@shcheklein shcheklein added story Product feature aka epic. Discussion, progress, checkboxes for implementation, etc and removed 🔍 review A placeholder issue to review certain part of the product or story labels Feb 4, 2023
@shcheklein
Copy link
Member Author

I've updated the ticket.

@mattseddon
Copy link
Member

mattseddon commented Feb 5, 2023

Watch and keep updating plots

Depends on iterative/dvc#8478. Unactionable until that is closed.

@mattseddon
Copy link
Member

@shcheklein can you please reorder the checkboxes in order of priority? I will add some comments.

@shcheklein
Copy link
Member Author

@mattseddon done (also, trying a new beta feature - tasklists).

@mattseddon
Copy link
Member

When something runs in background - can we allow running an experiment in workspace?

We need to update the extension's package.json and the experiments table to make this happen. It should be achievable to enable. However, I have manually tested this and the experience feels very unstable.

Attach and follow standard output / error

It can be done with dvc queue logs <exp-id> -f but will need a sizeable amount of wiring built around it. We would need to use the PseudoTerminal and build a wrapper around it that looks a lot like the DvcRunner. Should be enabled from the right-click context menu in the table which will bring its own set of challenges (can it be run twice for the same experiment, should it be disabled, under what circumstances, etc).

Failed / stoped - add resume and / or run again action

I don't think this is currently possible. Maybe we could do something with a combination of apply, re-queue and start the queue but I feel like it is unlikely that it will work as expected. Feels like this needs a FR against DVC.

Add to a queue action is very slow - can we add a progress bar and notification?

What would the progress notification show and when would it update? From the data we get back, it would go from being empty to being full and that would be it (we have no information about intermediate steps). My suggestion for this would be to raise an issue in DVC to make the command more performant.

@mattseddon
Copy link
Member

I don't think we can mark this as done until iterative/dvc#8763 is resolved.

@shcheklein
Copy link
Member Author

Thanks @mattseddon, for giving more color / details to the list.

However, I have manually tested this and the experience feels very unstable.

Could you please add more color/details to this?

It can be done with dvc queue logs -f but will need a sizeable amount of wiring built around it. We would need to use the PseudoTerminal and build a wrapper around it that looks a lot like the DvcRunner. Should be enabled from the right-click context menu in the table which will bring its own set of challenges (can it be run twice for the same experiment, should it be disabled, under what circumstances, etc).

can we cut the scope? Show some tooltip with static output when you show it , dump logs as a file and open it, etc? Clearly ppl will have to run it again to get the most recent result. But at least it something. It's way better compared to nothing.

Failed / stoped - add resume and / or run again action
I don't think this is currently possible. Maybe we could do something with a combination of apply, re-queue and start the > queue but I feel like it is unlikely that it will work as expected. Feels like this needs a FR against DVC.

@dberenbaum could you please chime in? What's your take on being able to resume / queue again a failed / killed experiment(s)?

My suggestion for this would be to raise an issue in DVC to make the command more performant.

It's not possible I think. We can get it only to a certain level. I see that it's problematic w/o DVC support. We need to start thinking about improving this though - for data commands, for queue, for all commands that take long time to run.

@shcheklein shcheklein removed their assignment Feb 11, 2023
@dberenbaum
Copy link
Contributor

@dberenbaum could you please chime in? What's your take on being able to resume / queue again a failed / killed experiment(s)?

Resuming a killed experiment and re-queuing a failed experiment seem like two different scenarios to me:

  1. Resuming a killed experiment: this is strictly a bug/regression that's tracked in exp run: --temp --rev does not properly resume from the target revision dvc#7813. We haven't prioritized yet because we need to solidify the entire checkpoints/resume workflow and this is merely a bandaid, but I don't expect it would be too hard to address if needed.
  2. Re-queuing a failed experiment: I thought there was more discussion in DVC, but for now I only found Repeat an experiment dvc#8867. I think it's important but complicated, and for now let's start with exposing the logs so at least people know what happened. If the experiment failed, how likely is it that someone wants to re-run it again as is? Agree with @mattseddon that the current suggestion would be to apply it, review what was wrong, and then queue it again. I think it's too complicated for VS Code to do in one action. Let's discuss in Repeat an experiment dvc#8867, and if queuing becomes useful enough, we should start getting more user feedback here to decide on next steps.

@shcheklein
Copy link
Member Author

how likely is it that someone wants to re-run it again as is?

Let's say I made a silly mistake and 100 experiments failed? Let's say I forgot to pull data? Let's say it happens in the last stage of the pipeline and it took a lot of time to run them?

I think I would want a way to fix something and re-rerun / resume things.

@dberenbaum
Copy link
Contributor

Yep, agree that it's important. I guess the p3 in iterative/dvc#8867 doesn't reflect that, but I did note there that it's important and is only p3 because we already have too many competing priorities.

Let's say I made a silly mistake and 100 experiments failed? Let's say I forgot to pull data? Let's say it happens in the last stage of the pipeline and it took a lot of time to run them?

I don't mean to discount this, but I wasn't sure this was painful enough to make the queue feel broken. In that example, do you think it's likely that the 100 experiments got queued one at a time or by a grid search like dvc exp run -S lr=range(...)? Is there a common scenario where it would be really painful to re-queue them?

What if I find that I needed to change something in my code or dvc.yaml?

I think it's hard to imagine all the ways that experiments will fail and what the most useful ways will be to "fix" them without much user feedback. I wasn't sure this made sense to prioritize just yet, but of course we can change that if you think it's essential right now and feel confident that a relatively simple fix like exp run --failed would be useful enough.

@shcheklein
Copy link
Member Author

@dberenbaum I put it also almost at the bottom of the list above. So, it's not critical, probably not p3 either :)

@mattseddon
Copy link
Member

mattseddon commented Feb 24, 2023

Watch and keep updating plots

Spent some time thinking about and replying on iterative/dvc#8478 (iterative/dvc#8478 (comment)). Looks likely that we'll need to build out the mechanics in this repo (at least to begin with).

I'll start working on the rest of the list while I wait for a response.

@mattseddon mattseddon added the blocked Issue or pull request blocked due to other dependencies or issues label Mar 2, 2023
@mattseddon
Copy link
Member

Marking this as blocked pending the discussion in iterative/dvc#8478

cc @pmrowla @dberenbaum

@mattseddon
Copy link
Member

2 of the remaining 4 points are currently being discussed elsewhere.

iterative/dvc#8867

Discussion happening in #3676.

Failed / stoped - add resume and / or run again action

The discussion will happen in iterative/dvc#8867. Do we need this for this issue or can we cut the scope?

The other 2 I should get to shortly.

@mattseddon
Copy link
Member

unblocked by iterative/dvc#9432 / 2.58.1 of DVC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-p1 Regular product backlog story Product feature aka epic. Discussion, progress, checkboxes for implementation, etc
Projects
None yet
3 participants