DVC for Data-Centric AI #6455

iesahin · 2021-08-17T13:00:13Z

iesahin
Aug 17, 2021

Artificial Intelligence and Machine Learning techniques can be described by several aspects. Model-centric and Data-centric is one them.

Model-centric AI is about finding better hyperparameters, models, training methodologies whereas data-centric AI is working on the data, labels and related processes to achieve a better set of data.

According to Andrew Ng, Model-centric AI receives around 99% of the research, whereas the most improvements in AI systems is expected on the data side. Data is like the ingredients, models are like the utensils for cooking, but we focus extensively on utensils and leave the work on ingredients as a side topic. Cooking the best meals requires the best ingredients more than the best kitchenware.

DVC can help to achieve goals of data-centric AI with its current feature-set and with some improvements, it can become the tool for it.

We may need to emphasize/streamline the user experience for the following points:

Data collection: All ML production systems rely on data collection from various sources: Sensors, user activity, user data, etc. Time to time these may fail. Sensors may fail, they may not send data at all. Systems may fail, they may not return the data in expected format. Software systems are complex that change in one part (e.g. UI) may affect the data collection in unforeseen way. These all are points that DVC may solve.
Data validation: The data that's used to train the model may have some attributes. A particular data distribution, a light condition for images, a level of background noise level for audio, a set of required fields for structured data may be implicit in the training data. Data in the production systems must be validated for these assumptions. DVC pipelines can be set up in the training phase can also retrofit to validate production data, by checking whether the production data conforms various metrics extracted in the training.
Data drift: Some of the data may have seasonality (e.g. fashion related data, health related data), user preferences may change and data from user activity follows, data collection systems change and data quality follows (e.g. image resolution, audio quality), new attributes emerge (e.g. security related data), etc. These all affect the model performance and should be addressed in AI systems.
Concept drift: Concepts may change in time. Book concept was related to dead-tree documents in the past, now we have books that are only digital. Wedding concept involved one bride and one groom, now we have weddings with two brides or two grooms, etc. This change doesn't happen as fast as the data drift but requires a similar level of attention.
Fairness: AI systems shouldn't behave differently for different demographics. New demographics may emerge or previously neglected groups become more salient in time.
Model decay: Above factors and technological changes cause model decay. Models no longer to perform as good as their freshly-trained versions. Production models require periodical retraining with fresh data.
Labeling: This is probably the major feature DVC lacks as of now. The other aspects are mostly related with the pipelines, versioning and experimentation, but data collection activities (e.g. direct labeling) or advanced labeling techniques (e.g. active learning) require a labeling facility. The most suitable project for this feature seems Studio to me.

I'll add some points of discussion / open questions below.

@dberenbaum @shcheklein

iesahin · 2021-08-19T11:18:13Z

iesahin
Aug 19, 2021
Author

I think the gist of data-centric view is that the data is as fluent/transient as models, maybe more so. In model-centric view, we consider data as static and models as dynamic. DVC experimentation features make this assumption.

If data is more dynamic than it appears, then what do we need?

Data Collection

Data versioning. We have this for the unstructured data but versioning structured data is another problem. We may forgo database versioning as it probably has mature tools but we can have keep track of versions of pandas datasets, numpy npz files, pickled data etc. Also we need some more support for zipped files as directories.
Data input plugins: e.g. downloading results of a YouTube query, web search, twitter feed etc. directly into a folder. (It can have legal restrictions but as a feature we can support this.) These are extensions to current dvc import and dvc get facilities. We can have more of these and I think some of these can be monetized.
In data input/output front, we can aim to be an IFTTT for data. This may not always coincide with our "Git for data" goal, but probably will be more relevant as cloud-based features become more emphasized.

1 reply

dberenbaum Aug 19, 2021
Collaborator

A lot of these seem like good ideas for other potential products on top of DVC. It might be worth discussing some of these with @volkfox, particularly with regards to data input/output.

iesahin · 2021-08-19T11:43:39Z

iesahin
Aug 19, 2021
Author

Data Verification

We can add predefined filters to data to be used in the pipelines. These can be defined in dvc.yaml and test the data for validity. As we have outs section in the data, we can check the data in certain aspects. The pipeline can ring alarms if something goes wrong.
Tests/filters defined in the experimentation can work in production. Data development can mean determining filters and tests for the data. During development we learn the aspects of data, we integrate these to the pipeline and use them in the production. We can emphasize these.

0 replies

iesahin · 2021-08-19T11:57:06Z

iesahin
Aug 19, 2021
Author

Data / Concept Drift

Statistically drift means the data distribution in training time no longer holds against the data in production. A new category of films emerged and our recommendation systems doesn't know about it. A new kind of apparel is produced and named after an existing kind, but has new features and only certain groups use it. A pandemic happens and we all have masks and face detection doesn't work, etc.
We can use statistical filters to detect these. The input distribution (dependencies in dvc.yaml) and output distribution (outs in dvc.yaml) should have some certain relationship. For example number of detected bodies within a time period should be more or less around the same amount of number of detected faces. If for some reason (masks?) our face detection system fails, we need to aware of these in the production.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC for Data-Centric AI #6455

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DVC for Data-Centric AI #6455

iesahin Aug 17, 2021

Replies: 3 comments · 1 reply

iesahin Aug 19, 2021 Author

Data Collection

dberenbaum Aug 19, 2021 Collaborator

iesahin Aug 19, 2021 Author

Data Verification

iesahin Aug 19, 2021 Author

Data / Concept Drift

iesahin
Aug 17, 2021

Replies: 3 comments 1 reply

iesahin
Aug 19, 2021
Author

dberenbaum Aug 19, 2021
Collaborator

iesahin
Aug 19, 2021
Author

iesahin
Aug 19, 2021
Author