-
Notifications
You must be signed in to change notification settings - Fork 1.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset storage improvements #1487
Comments
Why a shareable directory (NFS, S3 bucket, etc.) is not enough? (this would also cover access control)
Indeed, maybe we can store our files with the following naming:
Just to be on the same page, we are talking about |
@MrOutis "Why a shareable directory (NFS, S3 bucket, etc.) is not enough?" - great question! I'd separate the question into two:
".dvc/cache/version/path relative to root dvc + checksum" - we might have problems with duplication. The same file might be presented in a few different locations of a repo. Just a fun fact - the initial version of dvc (0.8) worked just like you described but we decided to use a single cache dir. "Datasets synchronization" - yes, @ternaus you are using a single repo for many datasets. Could you please share your motivation? @villasv and @sotte, do you guys have any thought regarding this topic? |
Considering that DVC is a command line tool, I think S3/GCP/NFS/etc should still be the backing storage, which doesn't mean that their issues can't be addressed. I think Git is a role model in terms of "don't worry about what's inside 1.1 Git has the concept of submodules. Admittedly, I find it a huge PITA and avoid it like the plague, but it also shows that it's feasible to reuse projects by composition/linkage. I've never seen how it works internally though, so I'm not sure if it's really something possible to replicate. In fact, git submodules sucks so much IMO much that I'd slap someone that suggest DVC to build on it. Assuming that importing data from another project's cache would be read-only, this might be simpler than it looks. 1.2 Listing all datasets requires us to disambiguate which files in the cache represent the same dataset, and probably be selective about versions as well. But once versioning is figured out, I think this is given.
3.2 Yes I would like to have UI for that, just after saying that I want the cache directory flat. Nothing fancy, though. Just list all refs, an aggregate count of versioned files without refs, total size. A stakeholder-friendly version of this might be a web version with nice-to-know things, like GitHub's useless graph of my most active days on the week. 4 Diff is tricky. I've scripted my own, but it's something very opinionated and dataset comparison is specially complex because sometimes the base object of comparison is not the row, but rows identified by some ID (assuming the dataset is ordered, otherwise it may be just implausible). Also, diffing might change by file format. 5 Security is a hairy topic. It's achievable with S3 access policy bending, but that's not stakeholder-friendly. Maybe combining data reuse from point 1 and a UI for versions... though I can't see how one could achieve meaningful access control without deeply integrating with the remote storage chosen. My concerns:
|
Great feedback @villasv ! Thank you. You are a DVC expert and it looks like you have an opinion on how to organize projects with DVC. Can I ask you a general question? How would you organize projects for a team which has 20 datasets and 30 projects (ml, analytics, data processing)? One project can use many datasets. One dataset can be reused in many projects. Options that I see:
Which features in DVC are missing to support this scenario? Details about your feedback Totally agree with the cache directory structure and sync mechanism. Yes, we should think about projects composition/linkage. The question is - should each dataset be presented as a single module/repo (so, 20+ repos/datasets for a single team) or we should come up with a structure where a single repo can naturally fit all the datasets (1 repo). Re tags - this might require a separate tagging mechanism in addition to Git-tags. This is what I don't like but it might be the only solution. "Human readable cache". Right, shareholders visibility is a major motivation. The minor one - it might lower the entrance bar to dvc (good for newbies). Yeah, some UI would be great if we keep cache structure. "Diff" - agree. It might be enough just to have file counts (new, deleted, modified) and sizes. Regarding your concerns:
|
Ah, thanks for clarifying. I had the impression that versioning would also be targeting data derivatives. Indeed, that makes all my three concerns vanish, because input files are the root of the dependency trees so checking out a specific version of them can be done in such a way to keep everything consistent. I didn't mean to use git tags though, what I mean is that DVC could somewhat copy how git tags are done, which is basically a refs directory with alias files. E.g. a file named Q: How would you organize projects for a team which has 20 datasets and 30 of projects (ml, analytics, data processing)? If a group of projects code in scripts or data derivatives, I'd go for a mono-repo. I think that even if these shared only the dataset/sources I might go with a mono-repo, if those datasets/sources aren't very stable, because it's the safest way to ensure everyone is interpreting that data the same way. Separate repos would be fine if those projects are mostly code and data, or they derive from a "golden dataset" that changes never or very slowly. At my work, we have two repos using DVC but they share nothing directly (one is for curation and the other is the result of that curated data being used in production, so one affects the other, but the files are totally independent because the second has a database dump as source). In fact, the first of these projects - the one for curation - even uses the same repository for the NodeJS API that serves that content and other stuff like Elasticsearch mappings to index the derived data. Those are very coupled and I see no reason to separate them. Q: What could be different if I had a way to version datasets and share data between repositories? The first scenario (shared scripts and data derivatives) wouldn't change IMO, if I can't version and distribute derived data as well. I think the scenario of single-source-of-truth is the one that really benefits here. Why |
I think having a monorepo or 30 repos depends on other factors apart from dvc. Dvc could offer full non-duplicated support if we store all dvc files somewhere else such as ~/.dvc (user) or /var (global). Docker already does this by storing all images, and cache layers in /var, and i find that it works quite well |
I will give my impressions on your questions:
True, this is valuable. I suggest using a per-user or per-machine dvc directory (~/.dvc or /var/dvc). See comment above.
Simple to implement if everything is in one place
I dont like this very much. The version is the git commit hash + comment on the related *.dvc file. Use git for all your versioning/identification problems. Could be helpful to name the dvc cache files with something human readable (something like
Just git log related *.dvc file. If you want to store metadata such as size, or number of lines. do it in .dvc file. Its just a simple text file, all git tooling works beautifully.
git checkout branch/tag; dvc pull;
This would be very nice indeed. More metadata would be stored in .dvc files, but it is worth it! Maybe we could show a preview of the data such as the shape, columns and the first ten lines.
use git? i dont understand this. I usually have one dataset per .dvc file so it is very simple.
Yes. For big files S3 loses connection sometimes. Maybe there is a more robust s3 download/upload recoverable experience. Also i think most files are some form of tabular format. We could give special support to some formats (such as csv) and upload only deltas. This would be VERY NICE. Think
Just use S3/filesystem access control. Leave filesystem permissions to filesystems and we avoid a difficult subject.
If i think of anything new I will edit! |
@villasv thank you for the clarification! Please correct me if I misunderstand your point. You are saying that a global dataset repo\place is not a perfect solution because it does not provide code and data lineage (snapshots) and the environment becomes fragile. In this case, a mono-repo does not have such kind of problems as you said. If the above is correct, how about using "data repos" with versions or checksums like
|
@polvoazul great feedback! Awesome idea with per-user or per-machine dvc directory. And great analogy with Docker images. I kind of agree with you about Regarding the S3 connection loss - right, there is a corresponding issue #829. |
I wasn't favoring a mono-repo each with its own cache against having a global cache like polvoazul suggested, only against many-repos each with its own cache. And caching is not the argument here, but sharing and lineage. I didn't think about a central cache. At first glance, I have nothing against it... but it doesn't change much in the scenarios I've encountered so far. I'd still use mono-repos for projects that share data lineage because the transformations scripts are part of that lineage. I'm associating the word "repository" here with actual git repositories, I don't see a "dvc repository" dissociated from it. I haven't looked at dvc as a data registry solution so far, only as an integrated LFS/Annex that properly handles reproducibility. Data registry is a nice thing though. It seems that much of this thread is about making the cache into a proper registry. I think this can be done without actually changing anything about the cache inner structure. |
@villasv I agree, we will probably add support for a few cache locations like in #1450 . So that you could have your local cache at |
@villasv yeah, data registry is a great term. Let's start using it. I've created a proposal based on the discussion. The proposal assumes that we keep dataset metainformation such as dataset name and tags into dvc-files. And it opens a set of command to manipulate with the dataset as a separate type of objects. But in fact, it is just a syntax sugar on top of data files. The proposed dataset API looks like a pretty reasonable solution which can bring DVC to a next level and does not break distributed philosophy of DVC and Git like many other dataset versioning solutions do. The only problem that the proposal does not solve is storage visibility. The cache is not humanly readable which is probably not a surprise. And we should go with some additional UI (3.2.) probably as an addition to DVC project. The proposal: https://gist.github.com/dmpetrov/136dd5df9bcf6de90980cec22355437a Looking forward to your feedback, guys. Any comments are welcome. PS: @drorata you had some issues with files versioning and updating versions. You might be interested in participating in this discussion. |
It's a little hard to join such an in-depth discussion. I will try to share some of the thoughts that I had while reading the comments. For me, one of the most important concerns which DVC addresses is reproducibility. When it comes to datasets, reproducibility means IMHO the combination of what raw data was used and what transformations were applied. If some (processed) dataset is used by several projects, it becomes in some sense a project of its own, and thus it would make sense to keep it separately. The depending projects, which use this set as an input, would then tap to the same source and enjoy a fixed reference. I found The problem of diffs of datasets is somewhat enigmatic. If the dataset is binary one is rather lost and metadata associated to the set might be the only remedy. Personally, I like a lot the close coupling between DVC and git. If at some point, I realize that part of the project/data can be used by other projects, it means it has to be extracted into its own world. It is a bit like the flow when you code something in Jupyter, and then you realize it is used across different parts of the project, so you extract it as a module. Then you figure out it is actually useful in different projects, so you turn it into a package on its own. Along with this line, I don't understand how linking DVC to the user/machine is going to work; would not it make the link between the data and the project much less clear? @dmpetrov Thanks for pinging me 👍 |
Sorry I'm a bit late to the party. The topic is super interesting! Here are my 2 cents. What is a dataset? What is a version of a dataset? Often you get some data from some business unit and it's dumped in some bucket (if you're lucky).
This does not mean that one bucket corresponds to one "dataset". A "dataset" can consist of many buckets with different "versions" (the date part in the prev. example). A "dataset" can also expand (maybe next time I'll use some weather data to improve my ML model). And for a different project I'll use a different combination of parts and versions. Because it's not easy to define what a "dataset" is and because it changes depending on your problem So far we've only talked about the "raw" data you get from the business side. But now the data transformation begins :) Q: The "raw dataset" has to be transformed and cleaned. Is this also tracked in the data registry? Should this be tracked in a sep. dvc/git project? Storing Data Regarding storage: I think it makes sense to store the data in a content addressable way (like git or IPFS). git-submodules as data-projects I feels right to import data as "modules". But I don't like git submodules. I just want to say that there is also git subtree as alternative to git-submodules. Misc and personal pet peeve One thing I don't like about DVC is that newer datasets are supposed to replace old ones in order for |
@drorata Thank you for the feedback! I have a question for you about reproducibility. Does it make sense to keep the code as a part of a dataset (if it was generated by this code) or we can decouple these two parts (dataset and dvc project with code which consumes some dataset)? git-submodules seems like a good idea. But it has some limitations. We need to re-think this part. The diffs that I was talking about are not binary diffs - just size and number of files difference. I agree that diff, in general, is an enigmatic problem and I don't see any easy solution.
The analogy with packaging systems is great. We just need to figure out how to connect the peaces. Important questions are:
|
Hey @scotte, it is great to see you in this discussion!
Ha... you have raised one of the most important questions. Would you mind if I ask you to answer your own question? :) What is your take on this?
I agree with “content addressable way” for storage format (this is how DVC works right now) as well as the common disbelief to git-submodules.
That is awesome feedback! This is what I meant in - 2.3. How to checkout a specific version of a dataset in a convenient way? Your description/problem definition is much better. How an ideal DVC API will look like to solve the problem? At least in idea level. PS: @tdeboissiere and @ophiry - guys, this discussion is getting more and more interesting and we need your opinions. |
I suspect that the notion of "data registry" is the entry sign to a slippery slope ending in some sort of yet-another-data(base)-storage-solution. To me, it is crystal clear that decoupling the data from a project would mean that when trying to reuse the same dataset for a different project, it would turn out it has to go through a major (2nd order) preprocessing before it could be used. And what would you do then? Extract this "new" dataset into its own entry in the registry? I think that coupling the code and the data is crucial, and failing to do that undermines the ability to reproduce results. Each project has its own needs, and the very same data source (e.g. some database) would be used in different ways in different projects. I would say that the starting point of each project is one or more data sources which are external from the data scientist/ML /etc. standpoint. Transforming this raw data into something you can work with is part of the project. If at some later point in time, one realizes that the same raw source can be used for two different projects using the same pipeline of transformations, then, it makes sense to extract this pipeline along the relation to raw data source into its own data solution. In other words, create a flow which takes the tables, for instance, apply the needed transformations/joins/etc. and save it into a new table. By the way, in many cases, I would say that the entry point of the project is not a static dataset, but a code which extracts the data from some source/database. DVC, in my mind, helps to persist this point in time when the data is taken from the source and used by the project, for future reference. It becomes harder for me to distinguish between some data-storage solution and the direction implied by this thread. What is the problem which is being solved here? Regarding git-submodules, I am using it in some production solution, but I don't the milage to reckon whether it is a viable solution (assuming I know the problem). |
FWIW anaconda is building it's cataloging package and it would be great to have 2.1, 2.2 and 4 indeed. |
@Casyfill thanks for joining this discussion! :) Quick question, just to get more sense of you thought process. 2.1, 2.2, 4 (assigning/listing tags and diffs for datasets). In what scenario would you use them? To manage sources, external data that is used to start the project? Or some intermediate artifacts/end results - models, preprocessed data, etc. Do you think about doing this within a project or to share/provide visibility to different versions to other ppl in your team? @drorata thanks! that is a really great and deep answer! First of all I'd like to clarify that by datasets management we actually mean "any data artifact" management. From DVC perspective any artifact is just a file or a directory. So, to be precise, dataset == input data (in some cases it's a snapshot indeed), intermediate/processed data, end results - models, reports. It feels there are at least two broad segments we are trying to touch here:
Guys, what do you think about this split and what kind of problems described have you seen in your workflow? Any feedback please :) |
Exciting discussion ! I do not feel qualified to answer all of the use cases given that my team has a very specific workflow. Instead I will briefly give some context on how we work, what solutions we came up with and what we think about @dmpetrov 's points. Team description and requirements:
How we use DVC
Questions 1.1 Reusage:
1.2 List all datasets
So it is easy to list all datasets. 2.1 Assign a label to a specific dataset This isn't something we really need at the moment (recall our datasets usually change very little over time). OTOH, I would implement it along the lines of: dvc run-and-tag v1
git commit
dvc run-and-tag v2
git commit
# Need to revert to v1 of the pipeline !
dvc show-commit v1
git checkout $(dvc show commit v1)
dvc checkout 2.2 See list of versions/tags/labels for a dataset Should be straightforward. 2.3 How to checkout a specific version of a dataset in a convenient way? Again, not something my team really needs at the moment as our datasets are very stable. On topic: I recall from earlier discussions that if I do something like:
Then the output of the first dvc command is removed from the remote/cache. This is not ideal if in the end I decide to use the first command and would like to use its output right away without relaunching it 2.4 Getting data without Git Not an issue we face. 3.1. Human readable cache would be great.
Bonus: As mentioned by others, I think accessibility should not be handled by DVC itself apart from a bit of syntactic sugar (for instance, assuming ownership groups can be created on the remote storage, allow the |
@dmpetrov , @efiop I think we can support Human readable cache in object storage that supports some kind of "linking" or "referencing" mechanism. For example, with SSH, we could have There's also a hacky implementation for S3: https://stackoverflow.com/questions/35042316/amazon-s3-multiple-keys-to-one-object It looks like we already have some information that we can work with: https://github.com/iterative/dvc/blob/master/dvc/remote/local.py#L619-L623 NOTE: I'm actually not 100% sure on this one, but just leaving it here for reference |
The most difficult part of introducing this new dataset experience is to align it with DVC philosophy. A couple of quotes from @drorata:
After collecting more feedback offline and discussing implementation details we come up with a solution which improve the dataset experience without breaking DVC fundamentals. ✅ New
This idea was initially mentioned in the issue description and @villasv emphasized on the idea:
Also, some random guy from the Internet in a recent Hacker News discusssion mentioned this:
✅ Introduce This command covers (4). ✅ Introduce ❌ I'd suggest to keep semantic versioning (like 2.1.8) outside of the scope of this issue because the concept of versioning does not align well with Git philosophy. We should remember that Git is actually not versioning system, it is just a "the stupid content tracker" (please check ❌ It looks like human readable cache (3.1) is no something that we can solve. There are some potential solutions for a specific file storges as @MrOutis mentioned. But it does not sound like a generalizable approach. From another point of view, the proposed solution with new tag and pkg experience improves human readability through dvc commands which might be even better approach for developers. Next steps So this is the plan to improve dataset storage experience and close the current issue:
I'll be marking the items when they are done. What do you think guys about this solution? Do you see any conserns or potential pitfalls? |
I have one question. Let's assume for a second that |
@drorata Good question! Git tag is global - it marks all files in a repository. DVC tag is local (per data file\dir) - it is specific to a data artifact. Thus, using Git-tags you quickly pollute the namespace of tags and can easily get lost on which tag belongs to which dataset\model. DVC tag localizes this tagging experience in a data file level. It can easily answer questions like:
DVC tag simplifies Another important reason is optimization. Ideally, we should not use Git-history to get a data file (checksum from dvc cache) with a specific tag. It is important for the model deployment scenario (2.4) when there is no access to Git, only to files (from HEAD). With custom tags we can aggregate this info in dvc-files or keep it separately somewhere like |
Assuming still that git is available, one can easily use the provided tags mechanism without polluting anything by merely adhering to some conventions/best-practices. For example, something like a prefix I can imagine the setting when git is not available is realistic, but in that case and to that end, tagging is a lighter issue. Isn't it? |
@drorata I agree with you, in many cases, git-tag conventions\best-practices are enough. Usually, it means that people use a repository per problem\model which includes 1-2 input datasets and a reasonable amount (within a couple of dozens) of experiments. I'd even say this is best practice for DVC. Git-tags are not enough in mono-repo scenarios when "People tend to use a single DVC repo for all their datasets" (from the issue description). A close to real example - a single git\DVC repo with ~10 datasets and ~5 separate projects inside the repo. The projects might reuse datasets (image-net is reused by 3 projects). Datasets and models are evolving. Some datasets are changing in a bi-weekly base. In this kind of settings, you can quickly end up with a few hundred git-tags. |
I am not familiar with dvc's vision/roadmap, but mono-repo for data is indeed something else and I was not aware that it is something dvc is aiming at. |
@drorata we see that a significant number of users uses DVC in such way. And some companies prefer mono-repo over a set of repositories. |
If I understand correctly, this is a rather different use case. Won't some artifact managing solution be a natural choice for this case? I have maven in mind, but others might be better candidates. Or am I missing something? |
Maven is a good analogy but not because of the new With Can we utilize an existing tools for this use cases? Probably not, because systems from the industry are mostly focused on code files while DVC has a different data file centric model. System from academia like WDL are too abstract and didn't have enough traction so far (we will spend more time on integration and supporting their API then for actual work). Is it a good idea to implement this kind of module\library\package scenarios into a single tool? From one side, Java (language), Ant (building system), and Maven (dependencies) are separate projects. From another side, these projects were created in different epoches. When Java and Ant were created there were no urgent need for Maven. I think it is a good idea to develope all the pieces of the ecosystem together as a single tool\language\project. Modern languages (like Go) pursuing this approach. |
@drorata just an example: https://discordapp.com/channels/485586884165107732/485596304961962003/550094574916337669
This is a regular question in our Discord channel. |
@dmpetrov just to clarify, which one of the commands solve "2.4. Ability to get a dataset (with specified version) without Git." and how will interface for that look like? A global version of |
@shcheklein I mentioned briefly that git submodules won't work partially because of (2.4). I expect (2.4) to be a part of the |
How about using git-lfs as a remote storage? Why is that not possible? |
It seems that I had managed to stay silent, and DataLad wasn't mentioned. FWIW I think all desired (and many others) use cases for data access, management, etc can be instrumented via git submodules + tags, git-annex and/or straight datalad API add interface to those. FWIW LFS is too tight to git, too centralized, doesn't support removal from it. FWIW git annex (and thus DataLad) supports LFS as a special remote. See http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html?highlight=LFS#use-github-for-sharing-content |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
There were many requests related to datasets storing which might require a redesign of DVC internals and the cli API. I'll list the requirements here in the issue description. It would be great to discuss possible solutions in comments.
1.1. Reusage. How to reuse these datasets from different projects and even repos?
1.2. List all datasets.
2.1. Assign a version/tag/label like 1.3 to a specific dataset. Git tag won't work since we don't need a global tag for all files.
2.2. See list of versions/tags/labels for a dataset.
2.3. How to checkout a specific version of a dataset in a convenient way?
2.4. Ability to get a dataset (with specified version) without Git. ML model deployment scenario when Git is not available in production servers.
3.1. Human readable cache would be great. Thus manager can see datasets and models through S3 web.
3.2. If 3.1. is not possible - some UI is needed.
Bonus question:
The list can be extended.
UPDATE 1/15/19: Added 2.4.
The text was updated successfully, but these errors were encountered: