Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data files disappearing from file tree after editing .dvc file(s) by hand and attempting dvc repro #1800

Closed
brbarkley opened this issue Mar 28, 2019 · 10 comments

Comments

@brbarkley
Copy link
Contributor

Please provide information about your setup

DVC version(i.e. dvc --version)

$ dvc --version
0.32.1

Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

$ conda list dvc
# packages in environment at C:\DevTools\miniconda3\envs\cg-analytics:
#
# Name                    Version                   Build  Channel
dvc                       0.32.1                   pypi_0    pypi

Issue:
Data files tracked by DVC are disappearing from my local file tree before I have pushed them to my remote. I'm not sure how it happened and I did not want them to disappear.

However, DVC still detects the files when running dvc status -c

$ dvc status -c
Preparing to collect status from J:/dvc-remote
[##############################] 100% Collecting information
        new:                output/figures/dict_pca_plots.pkl
        new:                output/figures/pca_plots
        new:                output/figures/pca_plots\pca_var_contrib_bsheet2_0.svg
        new:                output/figures/pca_plots\scree_firm_2017.svg
        new:                output/figures/dict_silhouette_plots.pkl
        new:                output/figures/dict_kmeans_plots.pkl
        new:                output/figures/kmeans_plots
        new:                output/figures/kmeans_plots\bsheet2_model_compare_yr.svg
        new:                output/figures/kmeans_plots\heatmap_firms.svg
        new:                output/figures/kmeans_plots\heatmap_firms_bsheet2.svg
        new:                output/figures/kmeans_plots\heatmap_firms_comp8.svg
        new:                output/tables/dict_tables.pkl
        new:                output/tables/summary_stats.pkl
        new:                output/tables/summary_stats.csv
        new:                output/models/model_spec.pkl
        new:                output/models/z_var_dict.json
        new:                output/models/var_dict_labels.json
        new:                output/models/risk_dict.json
        new:                output/tables/model_var_names_definitions.pkl
        new:                output/tables/model_var_names_definitions.csv
        new:                output/models/diagnostics/dict_silhouette_scores.pkl
        new:                output/models/model_output.pkl
        new:                output/models/z_risk_dict.json

But notice the output/figures/pca_plots and output/figures/kmeans_plots folders and their files are not in my file tree:

image

Is there a way to get the files back into my local file tree? Would dvc push followed by dvc pull do the trick since the "missing" files are still detected by dvc status -c?

Also, is it possible to pinpoint how/when the files disappeared?

Thanks,
@brbarkley

@efiop
Copy link
Contributor

efiop commented Mar 28, 2019

Hi @brbarkley !

Looks like you simply deleted it from your workspace. What is the dvc file where output/figures/pca_plots and output/figures/kmeans_plots are listed as outputs? Let's say it is called pca_plots.dvc, then you could simply checkout it with dvc checkout pca_plots.dvc and it should get it back. If you've used dvc pull -R dir/ recently, then it might be our bug #1788 that deleted your outputs from the workspace(it didn't touch cache, so that is why we are still able to recover them with dvc checkout). The fix for that is merged and we will release it today/tomorrow.

Thanks,
Ruslan

@brbarkley
Copy link
Contributor Author

Hi @efiop

Yes, I had attempted dvc checkout previously but it warned that it was going to remove existing files

Checking out '{'scheme': 'local', 'path': 'output\\figures\\dict_pca_plots.pkl'}' with cache 'b32a81a29e0ab21cc0e248ed79041224'.
file '{'scheme': 'local', 'path': 'output\\figures\\dict_pca_plots.pkl'}' is going to be removed. Are you sure you want to proceed? [y/n]

I surmised this meant that it would replace the existing file with the one in cache, but the warning language is not clear so I aborted because I did not want to lose the file.

Perhaps DVC should revise their warning language for dvc checkout and dvc pull?

"file is going to be removed" communicates something different to me than "local file is going to be updated/replaced with cache file".

After working up the courage to proceed with dvc checkout, my desired files are now back in my local file tree as you suggested they would be. But I still do not know how they disappeared. I am pretty certain--but cannot provide proof--that I did not delete them manually. Preceding the disappearance, I had been attempting to update my dvc pipeline (i.e., updating my dvc run commands and consolidating multiple stages into one). When I initiated dvc run to update the stage(s) in question, I received an error during the Python execution that the files in question did not exist. This is now sounding like a mystery novel; perhaps I will need to hire a private investigator :-)

Thanks for your help!

@brbarkley
Copy link
Contributor Author

UPDATE:
I believe I have retraced my steps that led to the above issue

Issue:
Step 1
I attempt to edit file paths of outputs and/or dependencies in a specific dvc file (i.e., output/make_plots_tables.dvc with dvc move

$ dvc move output/models/variables/risk_dict.json output/models/risk_dict.json
?[31mError?[39m: failed to move 'output/models/variables/risk_dict.json' -> 'output/models/risk_dict.json'

- move is not permitted for stages that are not data sources. You need to either move 

'output\model_01_pre_estimation.dvc' to a new location and edit it by hand, or remove

'output\model_01_pre_estimation.dvc' and create a new one at the desired location.

Step 2
I edit my output/make_plots_tables.dvc by hand as suggested by the above dvc move error message. Note: File folders output/figures/pca_plots and output/figures/kmeans_plots and their files have NOT disappeared yet.

Step 3
I attempt to dvc repro

$ dvc repro output/make_plots_tables.dvc
?[33mWarning?[39m: Dvc file 'output\make_plots_tables.dvc' changed.
?[33mStage 'output\make_plots_tables.dvc' changed.?[39m
Reproducing 'output\make_plots_tables.dvc'
Running command:
        python -m src.visualization --make_all
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'output\\figures\\pca_plots\\scree_firm_2017.svg'
Error: failed to reproduce 'output\make_plots_tables.dvc': stage 'output\make_plots_tables.dvc' cmd python -m src.visualization --make_all failed

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Note: File folders output/figures/pca_plots and output/figures/kmeans_plots and their files have disappeared.

Step 4
I dvc checkout and the disappeared files are back.

Solution
Instead of editing my output/make_plots_tables.dvc by hand as suggested by the dvc move error message, the correct solution seems to be to redefine the dvc run command for the affected dvc file in question and execute updated dvc run commands, overwriting the previous dvc file(s).

dvc run command before:

dvc run -d src/visualization/__main__.py \
            -d src/visualization/visualize.py \
            -d src/tools/utils.py \
            -d output/models/model_spec.pkl \
            -d output/models/z_var_dict.json \
            -d output/models/var_dict_labels.json \
            -d output/models/diagnostics/dict_silhouette_scores.pkl \
            -d output/models/model_output.pkl \
            -d output/models/risk_dict.json \
            -d output/models/z_risk_dict.json \
            -d data/processed/df_processed.pkl \
            \
            -f output/make_plots_tables.dvc $4 \
            -o output/figures/dict_pca_plots.pkl \
            -o output/figures/pca_plots/ \
            -o output/figures/dict_silhouette_plots.pkl \
            -o output/figures/dict_kmeans_plots.pkl \
            -o output/figures/kmeans_plots/ \
            -o output/tables/dict_tables.pkl \
            python -m src.visualization --make_all

dvc run command after (note changed file paths in dependencies):

dvc run -d src/visualization/__main__.py \
            -d src/visualization/visualize.py \
            -d src/tools/utils.py \
            -d output/models/model_spec.pkl \
            -d output/models/variables/z_var_dict.json \
            -d output/models/variables/var_dict_labels.json \
            -d output/models/dict_silhouette_scores.pkl \
            -d output/models/model_output.pkl \
            -d output/models/variables/risk_dict.json \
            -d output/models/variables/z_risk_dict.json \
            -d data/processed/df_processed.pkl \
            \
            -f output/make_plots_tables.dvc $4 \
            -o output/figures/dict_pca_plots.pkl \
            -o output/figures/pca_plots/ \
            -o output/figures/dict_silhouette_plots.pkl \
            -o output/figures/dict_kmeans_plots.pkl \
            -o output/figures/kmeans_plots/ \
            -o output/tables/dict_tables.pkl \
            python -m src.visualization --make_all

All is updated correctly. I then git commit output/make_plots_tables.dvc

Suggestion:

  • DVC should update error message for dvc move to instruct users to redefine the stage in question using the dvc run command instead of instructing users to edit the .dvc file by hand
  • If the error message for dvc move is thought to reflect how DVC should actually behave (i.e., update your .dvc file by hand and all should be ok), perhaps there is a bug that is preventing DVC from behaving as expected
  • A related issue that arose during the above workflow is that if the updated dvc run command fails--due, for example, to a typo in a dependency file path--the dvc file associated with the before specification is not restored. (Note, I have my dvc run commands stored in dvc_pipeline.sh and call the updated command with options defined in the bash script.)
$ source ./scripts/dvc_pipeline.sh -m src.visualization --make_all
?[33mWarning?[39m: Output 'output\figures\dict_pca_plots.pkl' of 'output\make_plots_tables.dvc' changed.
'output\make_plots_tables.dvc' already exists. Do you wish to run the command and overwrite it? [y/n] y
Running command:
        python -m src.visualization --make_all
?[31mError?[39m: failed to run command - missing dependency: output\models\diagnostics\dict_silhouette_scores.pk
l

?[33mHaving any troubles??[39m Hit us up at ?[34mhttps://dvc.org/support?[39m, we are always happy to help!

I can easily restore the output/make_plots_tables.dvc with git revert but it seems DVC could just automatically restore it if dvc run fails.

@brbarkley brbarkley changed the title data files disappearing from file tree but are still detected as new by dvc status -c data files disappearing from file tree after editing .dvc file(s) by hand and attempting dvc repro Mar 29, 2019
@efiop
Copy link
Contributor

efiop commented Mar 30, 2019

Hi @brbarkley !

Thanks for the investigation and all the suggestions! We will revisit those error messages to be more informative. 🙂

A related issue that arose during the above workflow is that if the updated dvc run command fails--due, for example, to a typo in a dependency file path--the dvc file associated with the before specification is not restored. (Note, I have my dvc run commands stored in dvc_pipeline.sh and call the updated command with options defined in the bash script.)

I can easily restore the output/make_plots_tables.dvc with git revert but it seems DVC could just automatically restore it if dvc run fails.

I don't think dvc should restore those automatically, because checking out files for the command that clearly failed in your script is even more dangerous and might break your pipeline. It is better to have it error-out like that, so at least you are aware of your previous command failing and not creating needed dependencies for a new stage.

Btw, could you talk a little bit more about your ./scripts/dvc_pipeline.sh script? Is it a wrapper around dvc run command? If so, what is it doing?

@brbarkley
Copy link
Contributor Author

Thanks @efiop!

[replying from mobile]
Is there a short explanation of why dvc repro is deleting the files in question?

My dvc_pipeline.sh file is kind of a wrapper. As I integrated DVC into my workflow, I found it necessary to keep a record of my dvc run commands for each stage file created. The bash script accommodates this need by storing the commands, gives me an easy way to edit them in the case a stage needs to be modified, and depending on bash parameters specified I can re-run specific stages.

The name of the file is a bit misleading I suppose since it mostly contains dvc run commands. However, I do have an option at the end of the file which calls dvc pipeline to create and export my DAG to svg.

@efiop
Copy link
Contributor

efiop commented Apr 1, 2019

@brbarkley dvc repro by-default removes output files of the stage to ensure that it was produced by the stage, so that we know that they are indeed reproducible.

My dvc_pipeline.sh file is kind of a wrapper. As I integrated DVC into my workflow, I found it necessary to keep a record of my dvc run commands for each stage file created. The bash script accommodates this need by storing the commands, gives me an easy way to edit them in the case a stage needs to be modified, and depending on bash parameters specified I can re-run specific stages.

And why don't you just use dvc repro feature? Your stages are already recorded in the dvc files, you could open them with an editor and modify them whenever you'd like. Just trying to understand your usecase.

Thanks,
Ruslan

@brbarkley
Copy link
Contributor Author

[reply by mobile]

@efiop I do use dvc repro but in some cases I find it more efficient to redefine a stage by editing the original dvc run command and re-executing dvc run.

I had in fact attempted to edit one of my dvc files by hand (because I had changed the file paths of some of my dependencies and output) and subsequently call dvc repro to update the pipeline. However, dvc repro would not run successfully—which is the reason I opened this issue. So it’s not clear to me how editing dvc files by hand and running dvc repro is a robust solution (note, dvc move did not work for moving/editing the file paths in question because it said the dvc stage was associated with an image/chart output as opposed to a data file...see string above). In addition, the DVC documentation and usage guide does not clearly show how to go about editing dvc files by hand whereas dvc run has better documentation and seems to be a more programmatic way to update and version control my pipeline.

Also, as DVC is still in development with command syntax still in flux (e.g., see recent changes to dvc run which I’m not complaining about...I liked the changes), I find it much easier to keep my project up-to-date if I have record of the dvc run commands that can simply be edited instead of completely rewritten.

So, generally speaking, I use dvc repro if I have edited the contents of one of my code dependencies. But if I want/need to move/rename files or add/delete dependencies or outputs, I edit my dvc run commands and re-execute dvc run.

If there’s a better or more approved way of doing things, I’m open to suggestions.

Thanks,
Brett

@efiop
Copy link
Contributor

efiop commented Apr 2, 2019

Hi @brbarkley !

Thanks a lot for explaining! That feature for dvc move is on our TODO list #1489 . Also, created iterative/dvc.org#230 for "how to edit dvcfiles" guide. I think your current approach is fine, as long as you are comfortable with it, but I was not able to ignore that in your bash script you are kinda duplicating the role of dvcfiles, which already have your pipeline recorded. I see that it is caused by temporary inconveniences with dvc move and editing dvcfiles, so we will look into that, to see fixing those issues might remove the need for you to use your own bash script. 🙂

Thanks,
Ruslan

@brbarkley
Copy link
Contributor Author

Hi @efiop

Thanks for the insight! Yes, I think #1489 would improve dvc move functionality and help with my workflow. More documentation on editing dvcfiles would be good, too.

As an additional note--one that should have been mentioned in Step 3 of iterative/dvc#1800 (comment)--another factor contributing to my problems with dvc repro has to do with particularities around listing file folders as outputs in a .dvc file.

This may seem obvious, but if file folders are listed as outputs in a particular stage, users need to ensure that the command for that stage includes a means of checking for the folder's existence and creating it if it doesn't exist. Otherwise dvc repro will fail due to a FileNotFoundError when that stage's command is executed (because dvc repro will have deleted those folders before executing the stage's command). Perhaps I overlooked it, but this could be stated in the documentation for dvc repro.

Closing this issue.

Brett

@efiop
Copy link
Contributor

efiop commented Apr 4, 2019

@brbarkley That is a really good point! Created iterative/dvc.org#233 . Well take a look if you could clarify it a bit. Thank you for all the feedback! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants