DVC for Data-Centric AI #6455
Replies: 3 comments 1 reply
-
I think the gist of data-centric view is that the data is as fluent/transient as models, maybe more so. In model-centric view, we consider data as static and models as dynamic. DVC experimentation features make this assumption. If data is more dynamic than it appears, then what do we need? Data Collection
|
Beta Was this translation helpful? Give feedback.
-
Data Verification
|
Beta Was this translation helpful? Give feedback.
-
Data / Concept Drift
|
Beta Was this translation helpful? Give feedback.
-
Artificial Intelligence and Machine Learning techniques can be described by several aspects. Model-centric and Data-centric is one them.
Model-centric AI is about finding better hyperparameters, models, training methodologies whereas data-centric AI is working on the data, labels and related processes to achieve a better set of data.
According to Andrew Ng, Model-centric AI receives around 99% of the research, whereas the most improvements in AI systems is expected on the data side. Data is like the ingredients, models are like the utensils for cooking, but we focus extensively on utensils and leave the work on ingredients as a side topic. Cooking the best meals requires the best ingredients more than the best kitchenware.
DVC can help to achieve goals of data-centric AI with its current feature-set and with some improvements, it can become the tool for it.
We may need to emphasize/streamline the user experience for the following points:
Data collection: All ML production systems rely on data collection from various sources: Sensors, user activity, user data, etc. Time to time these may fail. Sensors may fail, they may not send data at all. Systems may fail, they may not return the data in expected format. Software systems are complex that change in one part (e.g. UI) may affect the data collection in unforeseen way. These all are points that DVC may solve.
Data validation: The data that's used to train the model may have some attributes. A particular data distribution, a light condition for images, a level of background noise level for audio, a set of required fields for structured data may be implicit in the training data. Data in the production systems must be validated for these assumptions. DVC pipelines can be set up in the training phase can also retrofit to validate production data, by checking whether the production data conforms various metrics extracted in the training.
Data drift: Some of the data may have seasonality (e.g. fashion related data, health related data), user preferences may change and data from user activity follows, data collection systems change and data quality follows (e.g. image resolution, audio quality), new attributes emerge (e.g. security related data), etc. These all affect the model performance and should be addressed in AI systems.
Concept drift: Concepts may change in time. Book concept was related to dead-tree documents in the past, now we have books that are only digital. Wedding concept involved one bride and one groom, now we have weddings with two brides or two grooms, etc. This change doesn't happen as fast as the data drift but requires a similar level of attention.
Fairness: AI systems shouldn't behave differently for different demographics. New demographics may emerge or previously neglected groups become more salient in time.
Model decay: Above factors and technological changes cause model decay. Models no longer to perform as good as their freshly-trained versions. Production models require periodical retraining with fresh data.
Labeling: This is probably the major feature DVC lacks as of now. The other aspects are mostly related with the pipelines, versioning and experimentation, but data collection activities (e.g. direct labeling) or advanced labeling techniques (e.g. active learning) require a labeling facility. The most suitable project for this feature seems Studio to me.
I'll add some points of discussion / open questions below.
@dberenbaum @shcheklein
Beta Was this translation helpful? Give feedback.
All reactions