You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aside: may want to split this into individual ideas for each import source
User Stories
Persona:
Data User - less sophisticated (uses Excel but may not know what an API is)
Data Wrangler - more sophisticated (knows what an API is)
Import File and get Data API
As a Data Wrangler I want to provide my file and have it imported into CKAN so that I get a Data API
What kind of file?
CSV file
Excel file
GeoJSON file
...
How do I provide
web interface
API (POST/GET url string or POST file content)
Questions:
Do we validate the file?
Do we have some process for e.g. tweaking the field types
What is the mapping between file and Dataset / Resource
Implementation
DataPusher already does most of this
What's missing is any kind of edit metadata step
No user interface
As a XXX I want to push my data file to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date
This is very similar to import file - only difference is we get push notifications (github webhooks). so merge this with that example.
Github import
As a XXX I want to push my tabular data package to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date
As it is already a data package importing should be very simple
If file is large we may need to worry about queues etc but probably keep it simple for present
How do we determine dataset to associate this with in CKAN?
One-Click Create a Dataset
As a XXX I want to provide my file and have it imported into CKAN so that I get a nice Dataset
what distinguishes from existing system? Ans: one-click nature
Automated regular import
As a Data Wrangler I want to have my data file automatically re-imported at regular intervals so that the DataStore (and Data API) stays up to date with my data.
Discussion
Datastore is a great feature of CKAN and it would be great to support getting data into it. In fact, one could go as far as to say that DataStore is the "killer" feature as by having data in the DataStore you get several major value-adds such as:
A Data API
Improved quality
In addition, for data to get into the data store it has to be of a reasonable quality so that data from datasets in the DataStore is likely to be of a higher quality. Whilst, obviously not a feature itself of the DataStore, this is an indirect benefit as the DataStore can help both "label" and drive data quality.
How Should It Work?
Import: Automatic or User Initiated
In getting data into the DataStore there are various choices about how it works. One key choice is whether:
Import happens automatically, i.e. happens automatically once a dataset resource (of appropriate type e.g. excel or csv) is added to the "Catalog"
Import is user initiated in one way or another (though the actual process once initiated may be fairly automatic)
Would argue that second option is best - i.e. import should user initiated import.
Why? There is huge variability in data quality. Without reasonable quality of data (i.e. no blank lines at top of CSV etc) data import is likely to result in a poor outcome and/or be very hard to automate.
Assuming user initiated there are still several options (not mutually exclusive):
Integrated into DataHub UI (e.g. "import to DataStore")
import.datahub.io: Create a bespoke UI outside of primary datahub site for doing imports
Leave it to users to push data into the DataStore via their own tools (tools we could help create)
Implementation
Any implementation with a UI (i.e. ignoring API usage) likely has 3 parts:
UI for import initiation, reporting and management
Summary: super-simple one click (automated) import of data into CKAN (and its DataStore)
And/or integration with specific services e.g.
Aside: may want to split this into individual ideas for each import source
User Stories
Persona:
Import File and get Data API
As a Data Wrangler I want to provide my file and have it imported into CKAN so that I get a Data API
What kind of file?
How do I provide
Questions:
Implementation
As a XXX I want to push my data file to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date
Github import
As a XXX I want to push my tabular data package to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date
One-Click Create a Dataset
As a XXX I want to provide my file and have it imported into CKAN so that I get a nice Dataset
Automated regular import
As a Data Wrangler I want to have my data file automatically re-imported at regular intervals so that the DataStore (and Data API) stays up to date with my data.
Discussion
Datastore is a great feature of CKAN and it would be great to support getting data into it. In fact, one could go as far as to say that DataStore is the "killer" feature as by having data in the DataStore you get several major value-adds such as:
In addition, for data to get into the data store it has to be of a reasonable quality so that data from datasets in the DataStore is likely to be of a higher quality. Whilst, obviously not a feature itself of the DataStore, this is an indirect benefit as the DataStore can help both "label" and drive data quality.
How Should It Work?
Import: Automatic or User Initiated
In getting data into the DataStore there are various choices about how it works. One key choice is whether:
Would argue that second option is best - i.e. import should user initiated import.
Why? There is huge variability in data quality. Without reasonable quality of data (i.e. no blank lines at top of CSV etc) data import is likely to result in a poor outcome and/or be very hard to automate.
Assuming user initiated there are still several options (not mutually exclusive):
Implementation
Any implementation with a UI (i.e. ignoring API usage) likely has 3 parts:
Import worker - DataPusher
Probably want to use DataPusher: https://github.com/ckan/datapusher though not hard to roll one's own.
Basic steps on how to do this are in the docs: http://docs.ckan.org/en/latest/maintaining/datastore.html#datapusher-automatically-add-data-to-the-datastore and http://docs.ckan.org/projects/datapusher/en/latest/
I got DataPusher deployed on Heroku about a month ago - see ckan/datapusher#23
The text was updated successfully, but these errors were encountered: