-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-pass datapusher #150
Comments
@jqnatividad first I note this is really about creating a proper ETL platform for CKAN - see e.g. #18. Some thoughts follow: Connect / Reuse Frictionless Data and Data PackageConnect this with the Frictionless Data / Data Package work. That already has:
What Workflow We WantOverall the workflow is something like this: (see also the "perfect workflow" section in this OpenSpending user story):
Aside: I note we have working spike of this approach for command line only related to OpenSpending: https://github.com/openspending/oscli-poc Aside: I also assume folks saw things like http://okfnlabs.org/blog/2014/09/11/data-api-for-data-packages-with-dpm-and-ckan.html - this is automated load of data into CKAN DataStore - with type guessing ... Focus Right NowRight now my suggestion would be to:
Descriptive StatisticsDefinitely useful but probably separate activity. Again I would connect with Tabular Data Package stuff - see e.g. http://data.okfn.org/roadmap#tabular-stats-tool General Enrichment ServicesAgain, fantastic, let's just create a model for these - and we probably want to have integrated UX but separate services - ie. this runs "standalone" from an implementation perspective but UX is integrated - this has been discussed for a number of services already e.g. the link checker (dead or alive) and is key for datapusher generally i think. |
I'm thinking of a simple extensible version of this:
Building a nice interface to allow users to to fancy things like confirming column types or ETL to clean up data can happen after we have these low-level actions in place. |
Thanks @rgrp, @wardi for your feedback. This seems to be a recurring requirement as evidenced by the multiple references in this repo. Combined WorkflowMaybe we can combine approaches, using pgloader.io instead of native COPY? Keeping the new workflow as close to the current CKAN 2.3 upload workflow?
no change to current CKAN workflow, so far On Manage/DataStore tab of Resource.
only modifying the DataStore workflow In this way, we leverage not only Frictionless Data/Data Package, but we also use pgloader.io which seems to be an excellent way to async-load data into postgres, even better than the native COPY command. I consciously aligned my user-story to the current CKAN upload workflow as I think it needs to be refactored insofar as the Datapusher/Datastore is concerned. Descriptive StatsAnd as @rgrp suggested, the descriptive stats stuff can come later, but once a dataset is loaded into Postgres, computing these stats is pretty straightforward with SQL (i.e. min, max, distinct, etc.). The JSON Table Schema field constraints can even be repurposed to show the stats, though they are not really "constraints" per se, but a description of the loaded table. I still see this as a related project to the work above in a future iteration. General Enrichment ServicesAn extensible framework would be great! Once we have a more robust way of onboarding data into the Datastore, having an enrichment framework would really go a long way towards in enhancing CKAN as open infrastructure on top of which enterprise-class solutions can be built. #dogfooding 😉 And I can envision a whole class of data enrichment services that the community can develop. This deserves its own ticket, but only makes sense once we have a more robust datapusher, especially since the enrichment services will actually require the JSON Table Schema. So the JSON Table Schema is not only there to support the Data Dictionary view, and to support the schema defn interface during load time, it will also support these enrichment services. |
@jqnatividad I like it. Can we integrate with pgloader safely with the command line or is it going to take lots of fragile script generation and output parsing? |
@wardi, only one way to find out 😄 I only found out about pgloader.io this week, but it seems to be widely used and the author is a PostgreSQL major contributor. As the current datapusher is a stand-alone CKAN service and runs asynchronously, I think its a natural fit. The pgloader CLI options seems to be quite extensive, and maybe we can have a template DSL we can parameterize and create a next-gen datapusher using https://github.com/ckan/ckan-service-provider. BTW, found this paper by the pgloader.io author that's quite interesting - http://rmod.lille.inria.fr/archives/dyla13/dyla13_3_Implementing_pgloader.pdf |
As a side benefit of using pgloader, we get to support additional file formats like fixed, Postgres COPY, Dbase, IXF, and SQLite. It's also interesting that pgloader can directly connect to MySQL and MS SQL Server using a connection string. For data publishers, this is a great way to direct-load data from transaction systems skipping an intermediate ETL step. Perhaps, pgloader support in the ckanapi and/or ckan-import as well? It could be a thin wrapper that directly uses the pgloader DSL and just uses the Datastore API to register the resource with CKAN and associate it with a package. |
Some interesting projects from github: |
And the pieces are starting to fall in place. json-editor looks really cool! @rgrp mentioned that schema editing is coming soon. Should we wait for that, or should we use/leverage json-editor instead? |
@jqnatividad i think we should still get a proper architecture diagram here and really agree how different pieces fit together so people can then go off and work on specific bits. As I said, I'd suggest using JSON Table Schema as a core piece which other bits can then work off. Also cc'ing @pwalsh as he is doing a lot of work on Data Package and Frictionless Data stuff with Open Knowledge atm. |
Interesting discussion. About schema editing - yes, we are working on one right now as a fairly generic interface to edit Data Package schemas, and we do use json-editor for that: https://github.com/okfn/datapackagist and http://datapackagist.okfnlabs.org/ - we have some UX work to do there, but functionality-wise, there is a quite complete application there already. It is still WIP so YMMV. As it might not be immediately apparent from the DataPackagist UI, there are additional things here beyond schema editing as such:
For JSON Table Schema, https://github.com/okfn/json-table-schema-js and https://github.com/okfn/json-table-schema-py (jtskit on PyPI, as we only just last week took over the json-table-schema package) are libs for handling JTS, including inferring from data. These are more in line with current JTS spec than say, the JTS features in messytables. Also, it might be interesting to note here the differences between GoodTables and MessyTables. MessyTables tries to work with anything you throw at, which is a great approach when working with known messy data and you want to fix it in an automated fashion, or, you don't care about the quality of the actual source file, as long as you can get data out of it. GoodTables takes a different approach, where we do want to ensure that the source data is well formed, and if it is not, we want specific notifications about where and how it is not, so we can take action to fix the source. GoodTables is implemented as a processing pipeline and has two processors at present: one for structural validation, and one for validation against a JSON Table Schema. It is possible to hook in custom processors, although admittedly the docs are lacking in that department. |
Sorry, life happened (vacation, catch-up, work) and disconnected. Would love to pick this up again, and as @rgrp suggested, go about speccing it out so folks can start work. Will take a stab at rough spec on GDocs and link it here so we can collaborate on it. |
On 16 July 2015, the CSV on the Web Working Group published Candidate Recommendations for Tabular Data on the Web. How should this inform the proposed work? |
I don't think this changes things - it was close to (and based on) the tabular data package and the key thing we probably want here is the json table schema part which we have factored out. |
Touching this issue to highlight https://github.com/timwis/csv-schema from @timwis. |
Here's a variant that @wardi and I are looking at, to tackle problems getting the column types right during import with DataPusher:
This would be done in a new extension, replacement for ckanext-datapusher. They would be run on a queue (ckanext-rq). I believe the last step will need a new action function in DataStore, as a replacement for datastore_create but without actually storing the data. Admins can look at the resource's DataStore tab and now, along with seeing the status & log of the import, in addition they can fix things like the encoding, column types (e.g. phone numbers need to be string to keep the leading zero) etc and trigger a reimport. |
Closing this now that @davidread wrote Express Loader (https://github.com/davidread/ckanext-xloader) :) |
As discussed with @wardi and @davidread at IODC/CKANcon:
Pass 1:
Pass 2:
Pass 3 (optional):
Pass 4, etc.
BENEFITS:
The text was updated successfully, but these errors were encountered: