Skip to content
This repository has been archived by the owner on Dec 25, 2021. It is now read-only.

[epic] Exemplar and test data files and data packages #28

Closed
12 of 13 tasks
rufuspollock opened this issue Dec 31, 2017 · 22 comments
Closed
12 of 13 tasks

[epic] Exemplar and test data files and data packages #28

rufuspollock opened this issue Dec 31, 2017 · 22 comments

Comments

@rufuspollock
Copy link
Contributor

rufuspollock commented Dec 31, 2017

Creating a single unified source of test and exemplar data for the frictionless data community.

UPDATES: we need two repos one for test data (packages) and one for exemplar data (packages):

Hackmd for drafting READMEs etc:

https://hackmd.io/CYBgnBBGCMCmC0BjYwBm8AsIQA56TFQVTAFZTUcBmCWAQwDYg===

User Stories

Thinking about requirement for example (exemplar and test) data packages:

  1. As a Tutorial writer I want a set of data files and data packages I can use in my tutorials so that i can embed them and point users to them to play with them themselves
  2. As a Developer writing a library I want to have a set of standard test data files and data packages as a reference for my implementation tests
  3. As a new Publisher of data packages i want to see examples that i can copy and use so that I can move quickly and understand what is involved
  4. As a Consumer of data packages I want to see some examples for use

My sense is that the "exemplar" and "test" use cases are somewhat different. 1+3+4 are exemplar and want "nice' data packages". 2 (+1) are more test and are about testing the real range of sitautions and being super simple for testing.

I think we should focus on is the test (lib developer) case to start with.

TODO: create a separate issue for exemplar data files and packages.

Comment: probably want versioning and ability to git submodule so that users of the test data can pin the data they are developing against (e.g. if data package spec gets upgraded they can still keep old spec versions if they need them).

Acceptance criteria

  • A repo exists at fd/test-data (or similar) with common data by tests
  • It contains common test data
    • Data files/resources for reading e.g. csvs, xls etc (stuff for e.g. tabulator)
    • Data files/resources for validating
    • Data packages (esp metadata structure)
  • It has a README explaining versioning policy either via branches or by directories
  • At least one "client" repo is submoduling this data for its testing

Tasks

  • Boot the repo @rufuspollock (or @roll)
  • Identify existing potential sources (we can probably copy paste for most of this)
  • Identify a "client" repo that would use this data
  • Copy over "data files" (resources)
  • Copy over data packages
  • Add README documenting structure so others can contribute in future (maybe add this earlier in process!)

Research

@Stephen-Gates
Copy link

Test data files and data packages could be organised in many different ways. In my data-package-examples repo I created a data package for each data type, format and constraint combination, e.g. Integer as I was focused on validating data. Would that be useful here?

I created a goodtables.yml file that I could change to validate one, some or all the data packages. Unfortunately GoodTables can only provide a badge for the whole repo rather than a badge for each data package in the repo (see forum question).

Other data packages could test for

  • missingValues
  • primaryKeys
  • foreignKeys
  • missing required metadata properties
  • and much more...

Feedback on what's useful is welcome

@pwalsh
Copy link
Member

pwalsh commented Dec 31, 2017

How is this different from the test suite repositories we have here?

@rufuspollock
Copy link
Contributor Author

@pwalsh which test suite repositories? Getting a list of existing ones would be useful 😄 (and was your comment directed to general issue or @Stephen-Gates ?)

For me the motivation of this issue is having one clear reference repo with test data -- I know there are various different sources and I'm not sure which is the best one, and I've personally ended up creating my own test data in e.g. https://github.com/datahq/data.js

@Stephen-Gates
Copy link

@pwalsh I had ignored testsuite-basic and testsuite-extended as they were labelled [Experimental] and from my perspective not well documented.

Specifically for Data Curator, as the focus was on local processing, I needed datapackage.zip files as that's the only way it can create or open a data package at present.

@rufuspollock
Copy link
Contributor Author

@Stephen-Gates those are useful and i've added to the research section. If you have any sample data in your repos please link that.

@rufuspollock
Copy link
Contributor Author

@roll can i get some kind of comment of where we could boot this and start work on it - this is something i think community members (including myself) could contribute to if it was clear where we can start.

Also could you please document all the sources you know of please 😄

@roll
Copy link
Member

roll commented Jan 10, 2018

Existent resources:

I think the best repo for this work will be - https://github.com/frictionlessdata/example-data-packages - we could just continue Dan's work.

Related to the Paul's words about duplication with testsuite data. For now the list of datasets for the test suite are very limited. I think the test suite could re-use community-driven example datasets instead of having it separately.

@Stephen-Gates
Copy link

Would you like me to contribute the data packages that go with the Point location data in CSV files guide into https://github.com/frictionlessdata/example-data-packages?

@roll
Copy link
Member

roll commented Jan 10, 2018

If everyone (esp. @rufuspollock as a facilitator of this work) are agreed on the repo selection it would be great! Please let me (or OKI) know if we need to grant some github rights to simplify the process.

@Stephen-Gates
Copy link

I've started a PR #2 that currently contains 6 of the 7 examples in the Point location data in CSV files guide. I've changed the goodtables.yml file to only test these new data packages.

geopoint type and object format data seems to throw the validation and I can't understand why. Advice appreciated.

@rufuspollock would you like .zip files for each data package? Perhaps a directory called zip containing all the data package zip files?

Lastly these packages have been hand crafted as I thought it would be good for the property order to mimic the specification and the json "beautified". (Data Curator doesn't do that yet.)

If you have thoughts on how you'd like contributions, perhaps you could update the readme.md?

@rufuspollock
Copy link
Contributor Author

@roll can you clarify the role and purpose of https://github.com/frictionlessdata/testsuite-extended - the README is not super informative to me.

@Stephen-Gates as per the original issue thread I think we probably want two distinct sets of stuff:

  • test-data: this is a repo purely for test data for people doing frictionless data work
  • exemplars / examples: this is a repo for examples for people to see

I think the best thing right now would be to draft the README for these repos (even though they don't exist yet) from the point of view of someone using them 😄

I've booted a hackmd here:

https://hackmd.io/CYBgnBBGCMCmC0BjYwBm8AsIQA56TFQVTAFZTUcBmCWAQwDYg===

Please dive in and contribute.

Once we've got a repo set up we'll move there.

@Stephen-Gates
Copy link

@rufuspollock just to confirm - https://github.com/frictionlessdata/example-data-packages is for example/exemplar data packages, and not test data?

Suggest there should be data packages supporting

  • guides
  • patterns

Should the repo contain "experimental" data packages, for example the concepts being proposed in the Spatial Data Package Research?

Should the repo contain .zip versions of each package?

Started making notes in readme.

@roll
Copy link
Member

roll commented Jan 17, 2018

@rufuspollock
We have [two levels of the implementations]:(https://github.com/frictionlessdata/implementations#implementation)

  • basic
  • extended

The structure of test suites reflects it:

  • there is testsuite-basic for the specs related tests only based on https://frictionlessdata.io/specs/implementation/ (Python/JavaScript/Ruby/PHP)
  • there is testsuite-extended for the Python extended implementation - tabulator, tableschema-sql/bigquery/etc, goodtables etc. It could be also mentioned as integration tests.

@rufuspollock
Copy link
Contributor Author

@Stephen-Gates somehow i missed your comment when you wrote it last week!

@rufuspollock just to confirm - https://github.com/frictionlessdata/example-data-packages is for example/exemplar data packages, and not test data?

Yes, that is right.

Should the repo contain "experimental" data packages, for example the concepts being proposed in the Spatial Data Package Research?

I think that makes sense.

Should the repo contain .zip versions of each package?

If needed for guidance, sure but I note we have not yet resolved on our bundle spec (we really should resolve on a pattern for that frictionlessdata/datapackage#132

@rufuspollock
Copy link
Contributor Author

@Stephen-Gates @roll I've now booted a test data repo here: https://github.com/frictionlessdata/test-data

We can start opening issues and consolidating material there for test data.

@rufuspollock rufuspollock changed the title [epic] Examplar and test data files and data packages [epic] Exemplar and test data files and data packages Jan 25, 2018
@Stephen-Gates
Copy link

Thanks @rufuspollock I will focus on example data packages for now. There are 2 PRs for review although one has .zips I made before your answer above. I personally will find having .zips helpful and I believe most implementations support it.

I also need some help with instructions for local validation as discussed in https://github.com/frictionlessdata/example-data-packages/issues/6

I've fixed the data packages referenced by the Guides but there are others that need to be fixed / archived / deleted frictionlessdata/examples#4 - thoughts?

@Stephen-Gates
Copy link

@roll @rufuspollock as suggested in #28 (comment),
would I be able to get rights to https://github.com/frictionlessdata/example-data-packages so it's easier to contribute.

If not, please look at the PR frictionlessdata/examples#8 and the license proposed in frictionlessdata/examples#5

@rufuspollock
Copy link
Contributor Author

FIXED.

We've started work on test data repo: https://github.com/frictionlessdata/test-data

@Stephen-Gates @roll please take a look and open issues for any suggested improvements.

We've got an in progress exemplar repo: https://github.com/frictionlessdata/example-data-packages

@Stephen-Gates
Copy link

Would still like request above resolved #28 (comment)

@roll
Copy link
Member

roll commented Feb 5, 2018

@Stephen-Gates
I've sent you an invite

@rufuspollock
Copy link
Contributor Author

@Stephen-Gates just wanted to flag https://github.com/datapackage-examples - this are more focused on view examples but may be useful too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants