Data Summary and Quality description spec #909
Replies: 19 comments
-
See https://github.com/frictionlessdata/data-quality-spec This was extracted out of the first version of goodtables, into a form that can be used in any code or just used as a reference. This extracted form of the data quality spec is now used as is in the updated goodtables codebase, and I'd love to get some more feedback on the spec itself, and see more potential use cases. |
Beta Was this translation helpful? Give feedback.
-
Carrying over suggestion from #324: Add a quantitative metric for "data points" counts. The idea would be to allow for a better measure of portal-wide data quantity, other than the current favourite of "total number of datasets or resources". This would be most useful per-resource and perhaps per-dataset. The definition of a datapoint would presumably be non-empty values in columns indicated to contain data. |
Beta Was this translation helpful? Give feedback.
-
Some other potential values:
Really like this idea. Would the values be published in a Readme.md, or a separate data-quality.csv file in the data package? |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock @pwalsh I'd like to make a start on this. I'm thinking:
Would a file like Am I on the right track? One problem is that the |
Beta Was this translation helpful? Give feedback.
-
cc @roll @Stephen-Gates we've been introducing such quality checks in goodtables in the last weeks, so I'm cc'ing @roll in case he sees a cross over, and, how this can tie in. |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock @Stephen-Gates @pwalsh |
Beta Was this translation helpful? Give feedback.
-
@MAliNaqvi I can see benefits in both approaches. If csv:
If json:
The spec at present has options to support data in-line, in a path or at a url, so I'm sure we could cater for a few options here. |
Beta Was this translation helpful? Give feedback.
-
great point. that's reason enough. imho json means "you are looking where you're not supposed to" for non-technical people. |
Beta Was this translation helpful? Give feedback.
-
I'm happy to draft something if I can get a "steer" from @rufuspollock, @pwalsh, or @roll on the general direction to take. I think you could capture at least 6 types of measures:
I was speaking with others yesterday about if Data Curator should allow you to publish a data package with data that doesn't validate against the schema. We ended at, letting the publisher decide to publish but add the residual validation errors to the Looking forward your thoughts 🤔 |
Beta Was this translation helpful? Give feedback.
-
OK, I've made a start on a Data Quality Pattern. It is still a work in progress but probably good enough to get some feedback and work out if I'm going in the right direction. |
Beta Was this translation helpful? Give feedback.
-
@Stephen-Gates great to get this started. Comments:
User storiesAs a Consumer of a Dataset I want to know how many rows there are without having to load and parse the whole dataset myself so that I can display this information to others or ... As a Publisher or Consumer of a Dataset I want to know what validation errors there are with this dataset without having to validate this dataset myself so I know [Consumer] what issues to expect or [Publisher team member] I know what i need to fix Note: we already have a data validation reporting format in the form of good tables stuff. I've also just opened this epic about unifying validation reporting in a consistent way here https://github.com/frictionlessdata/implementations/issues/30. |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock happy for a quality assessment tool to produce JSON, and then produce a CSV. What I wrote is focussed on sharing the data quality measures with others. Happy to have a consistent approach between data validation and quality assessment, so thanks for the Epic. Requirements are in the document, just not written in user story format.
I'm not sure about your statement,
That's what I've proposed. Unless you mean you want to assess everything in the data package at once and place all the results in one measurement file? I'm not sure this works as different types of data in the same package could be measured by different metrics (e.g. spatial vs tabular). I started in HackMD but created a repo to help me think things thru and spilt up a document that was becoming too big and to provide examples. |
Beta Was this translation helpful? Give feedback.
-
HackMD version of pattern from GitHub - https://hackmd.io/s/BJeKJgW8G ✏️ Comments and edits are welcome |
Beta Was this translation helpful? Give feedback.
-
Just an FYI: hackmdio/codimd#579 |
Beta Was this translation helpful? Give feedback.
-
Thanks @patcon @rufuspollock the Data Quality Pattern was posted to HackMD on your advice. Are you considering alternate platforms going forward given hackmdio/codimd#579? Happy to collaborate on https://github.com/Stephen-Gates/data-quality-pattern if people aren't happy with HackMD |
Beta Was this translation helpful? Give feedback.
-
I wonder if it's worth folding in the ideas in #281 |
Beta Was this translation helpful? Give feedback.
-
No pressure to bikeshed the tool for this specific doc :) Just recalled introducing it to someone here, and wanted to ensure folks at okfn had full context going forward |
Beta Was this translation helpful? Give feedback.
-
Added user stories https://github.com/Stephen-Gates/data-quality-pattern/blob/master/user-stories.md
Stories include reporting validation results |
Beta Was this translation helpful? Give feedback.
-
See also these recent pandas discussion pandas-dev/pandas#22819 |
Beta Was this translation helpful? Give feedback.
-
This would be a spec for describing summary information about data -- often on a per field (column) or per resource (table) basis. Things like:
Quality side:
Note: I think we probably have a proto version of this spec, at least for reporting errors against schema in the goodtables library
Background
We have talked about this for years - in fact all the way back to early versions of CKAN.
It is common practice to acquire and present this data in data analytics and data warehouse applications.
Beta Was this translation helpful? Give feedback.
All reactions