Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-pass datapusher #150

Closed
jqnatividad opened this issue May 30, 2015 · 16 comments
Closed

Multi-pass datapusher #150

jqnatividad opened this issue May 30, 2015 · 16 comments

Comments

@jqnatividad
Copy link
Contributor

As discussed with @wardi and @davidread at IODC/CKANcon:

Pass 1:

  • scan CSV using messytables
  • guessed column datatypes are stored as data dictionary
  • user has the ability to override guessed datatypes (e.g. column is not number, but date in ISO-8601 format; string date should be converted to native date type in ISO-8601)
  • user has the ability to add column descriptions to computed data dictionary
  • user has the ability to assign alias (SQL view name)

Pass 2:

  • push CSV to datastore (consider using Postgres COPY command? )
  • data should be pushed more reliably as a result of Pass 1

Pass 3 (optional):

  • run ANALYZE command on resulting datastore table (helps with compiling descriptive stats) (OPTIONAL)
  • compile descriptive statistics, e.g.
    • top N values
    • range of values (e.g. range of values is 0-1000; 1 thru 5; dates are from 1990-01-01 to 1991-12-31, etc.)
    • sparseness of column
    • histogram of values in column (ala OpenRefine)
    • if ANALYZE command was run, maybe even bring in some stats computed by Postgres also (e.g. n_distinct, most_common_vals, most_common_freqs, etc.)
  • amend computed data dictionary with descriptive statistics
  • populate some Project Open Data fields like the temporal field (e.g. start and end date of applicability for the data)
  • optionally create indices

Pass 4, etc.

  • data pusher extensions 😄 installed on the instance can do more instance-specific stuff (e.g. entity resolution, reconciliation, geocoding, enrichment, etc.)

BENEFITS:

  • a computed data dictionary that data consumers can use
  • computed metadata to help populate POD/DCAT fields
  • data dictionary can be used for automated mapping between tables - takes advantage of Postgres/PostGIS capabilities (real joins, storing queries as views, materialized views, computed columns, store extras as JSONB, querying JSONB, etc.), and well known DB management techniques (e.g. having read-only replicas to support high-volume API use, etc.)
  • takes CKAN beyond just cataloging to become fundamental open data infrastructure on which enterprise-class opendata driven applications can be built
@rufuspollock
Copy link
Member

@jqnatividad first I note this is really about creating a proper ETL platform for CKAN - see e.g. #18. Some thoughts follow:

Connect / Reuse Frictionless Data and Data Package

Connect this with the Frictionless Data / Data Package work. That already has:

What Workflow We Want

Overall the workflow is something like this: (see also the "perfect workflow" section in this OpenSpending user story):

  • Give me a CSV / Excel file
  • I guess schema (JSON Table Schema)and present it to you with some of your data in a nice way - e.g. I show you the table in a table structure with drop down for column types, ability add descriptions etc
  • you edit and confirm
  • we check that schema against the data and report issues
  • import happens

Aside: I note we have working spike of this approach for command line only related to OpenSpending: https://github.com/openspending/oscli-poc

Aside: I also assume folks saw things like http://okfnlabs.org/blog/2014/09/11/data-api-for-data-packages-with-dpm-and-ckan.html - this is automated load of data into CKAN DataStore - with type guessing ...

Focus Right Now

Right now my suggestion would be to:

  • Agree on using JSON Table Schema for the data dictionary
  • Agree on a convention for where we store the JSON Table Schema in a resource attributes - i.e. a simple agree on a given "extra" field name where we store this
  • Describing the user stories and properties of the "ETL" service you want and leave the question of whether this is datapusher or something else
    • personally I think you really want something else than datapusher - or multiple routes (e.g. command line, openrefine etc etc)
    • e.g. on frictionless data front we've focused on node / js as your creator app can then be embedded anywhere)

Descriptive Statistics

Definitely useful but probably separate activity. Again I would connect with Tabular Data Package stuff - see e.g. http://data.okfn.org/roadmap#tabular-stats-tool

General Enrichment Services

Again, fantastic, let's just create a model for these - and we probably want to have integrated UX but separate services - ie. this runs "standalone" from an implementation perspective but UX is integrated - this has been discussed for a number of services already e.g. the link checker (dead or alive) and is key for datapusher generally i think.

@wardi
Copy link
Contributor

wardi commented May 30, 2015

I'm thinking of a simple extensible version of this:

  1. Raw/schemaless datastore import by adding support for postgres arrays to datastore field types
    • import complete XLS or CSV contents into datastore table with as a single column containing a postgres array of strings
    • guaranteed to work for any XLS or CSV
    • fast: use COPY in large batches of rows
    • collect and update minimal information while copying: # columns, min/max float/int value seen for each column (or None if column contains entries invalid for the type) this could happen entirely client side (datapusher)
  2. datastore_alter: Alter table in postgres
    • copy data from raw array into real columns with types, apply indexes etc.
    • fast because it's all in postgres operating on data it already has (no looping in python)

Building a nice interface to allow users to to fancy things like confirming column types or ETL to clean up data can happen after we have these low-level actions in place.

related tickets: #123 #124

@jqnatividad
Copy link
Contributor Author

Thanks @rgrp, @wardi for your feedback.

This seems to be a recurring requirement as evidenced by the multiple references in this repo.

Combined Workflow

Maybe we can combine approaches, using pgloader.io instead of native COPY? Keeping the new workflow as close to the current CKAN 2.3 upload workflow?

  1. Give me a CSV / Excel file
  2. Upload raw file to Filestore as is

no change to current CKAN workflow, so far

On Manage/DataStore tab of Resource.

  1. I guess schema (JSON Table Schema) and present it to you with some of your data in a nice way - e.g. I show you the table in a table structure with drop down for column types, ability add descriptions etc
  2. you edit and confirm schema using a simple interface. Schema definition is stored in "JSON Table Schema" resource attribute extra field. (Aside: perhaps, using PostgreSQL's 9.4's new JSONB data type?)
  3. Schema definition is used to generate pgloader.io commands
  4. import is done using pgloader.io.
  5. based on pgloader.io results, report issues, and optionally go back to 2 if the load failed
  6. if pgloader.io is successful, create a "Data Dictionary" resource view that renders the JSON Table Schema in a nice human-readable format.

only modifying the DataStore workflow

In this way, we leverage not only Frictionless Data/Data Package, but we also use pgloader.io which seems to be an excellent way to async-load data into postgres, even better than the native COPY command. I consciously aligned my user-story to the current CKAN upload workflow as I think it needs to be refactored insofar as the Datapusher/Datastore is concerned.

Descriptive Stats

And as @rgrp suggested, the descriptive stats stuff can come later, but once a dataset is loaded into Postgres, computing these stats is pretty straightforward with SQL (i.e. min, max, distinct, etc.). The JSON Table Schema field constraints can even be repurposed to show the stats, though they are not really "constraints" per se, but a description of the loaded table. I still see this as a related project to the work above in a future iteration.

General Enrichment Services

An extensible framework would be great! Once we have a more robust way of onboarding data into the Datastore, having an enrichment framework would really go a long way towards in enhancing CKAN as open infrastructure on top of which enterprise-class solutions can be built. #dogfooding 😉

And I can envision a whole class of data enrichment services that the community can develop. This deserves its own ticket, but only makes sense once we have a more robust datapusher, especially since the enrichment services will actually require the JSON Table Schema.

So the JSON Table Schema is not only there to support the Data Dictionary view, and to support the schema defn interface during load time, it will also support these enrichment services.

@wardi
Copy link
Contributor

wardi commented Jun 2, 2015

@jqnatividad I like it.

Can we integrate with pgloader safely with the command line or is it going to take lots of fragile script generation and output parsing?

@jqnatividad
Copy link
Contributor Author

@wardi, only one way to find out 😄 I only found out about pgloader.io this week, but it seems to be widely used and the author is a PostgreSQL major contributor.

As the current datapusher is a stand-alone CKAN service and runs asynchronously, I think its a natural fit. The pgloader CLI options seems to be quite extensive, and maybe we can have a template DSL we can parameterize and create a next-gen datapusher using https://github.com/ckan/ckan-service-provider.

BTW, found this paper by the pgloader.io author that's quite interesting - http://rmod.lille.inria.fr/archives/dyla13/dyla13_3_Implementing_pgloader.pdf

@jqnatividad
Copy link
Contributor Author

As a side benefit of using pgloader, we get to support additional file formats like fixed, Postgres COPY, Dbase, IXF, and SQLite.

It's also interesting that pgloader can directly connect to MySQL and MS SQL Server using a connection string. For data publishers, this is a great way to direct-load data from transaction systems skipping an intermediate ETL step.

Perhaps, pgloader support in the ckanapi and/or ckan-import as well? It could be a thin wrapper that directly uses the pgloader DSL and just uses the Datastore API to register the resource with CKAN and associate it with a package.

@maxious
Copy link
Member

maxious commented Jun 11, 2015

Some interesting projects from github:
https://github.com/opencivicdata/pupa is a Python library/DSL for scraping HTML pages in an ETL way
https://github.com/jdorn/json-editor is a Javascript library for generating UIs to edit JSON based on JSON Schema definitions

@jqnatividad
Copy link
Contributor Author

And the pieces are starting to fall in place. json-editor looks really cool! @rgrp mentioned that schema editing is coming soon. Should we wait for that, or should we use/leverage json-editor instead?

@rufuspollock
Copy link
Member

@jqnatividad i think we should still get a proper architecture diagram here and really agree how different pieces fit together so people can then go off and work on specific bits. As I said, I'd suggest using JSON Table Schema as a core piece which other bits can then work off. Also cc'ing @pwalsh as he is doing a lot of work on Data Package and Frictionless Data stuff with Open Knowledge atm.

@pwalsh
Copy link
Member

pwalsh commented Jun 14, 2015

Interesting discussion.

About schema editing - yes, we are working on one right now as a fairly generic interface to edit Data Package schemas, and we do use json-editor for that: https://github.com/okfn/datapackagist and http://datapackagist.okfnlabs.org/ - we have some UX work to do there, but functionality-wise, there is a quite complete application there already. It is still WIP so YMMV.

As it might not be immediately apparent from the DataPackagist UI, there are additional things here beyond schema editing as such:

For JSON Table Schema, https://github.com/okfn/json-table-schema-js and https://github.com/okfn/json-table-schema-py (jtskit on PyPI, as we only just last week took over the json-table-schema package) are libs for handling JTS, including inferring from data. These are more in line with current JTS spec than say, the JTS features in messytables.

Also, it might be interesting to note here the differences between GoodTables and MessyTables.

MessyTables tries to work with anything you throw at, which is a great approach when working with known messy data and you want to fix it in an automated fashion, or, you don't care about the quality of the actual source file, as long as you can get data out of it.

GoodTables takes a different approach, where we do want to ensure that the source data is well formed, and if it is not, we want specific notifications about where and how it is not, so we can take action to fix the source. GoodTables is implemented as a processing pipeline and has two processors at present: one for structural validation, and one for validation against a JSON Table Schema. It is possible to hook in custom processors, although admittedly the docs are lacking in that department.

@jqnatividad
Copy link
Contributor Author

Sorry, life happened (vacation, catch-up, work) and disconnected.

Would love to pick this up again, and as @rgrp suggested, go about speccing it out so folks can start work.

Will take a stab at rough spec on GDocs and link it here so we can collaborate on it.

@jqnatividad
Copy link
Contributor Author

On 16 July 2015, the CSV on the Web Working Group published Candidate Recommendations for Tabular Data on the Web.

How should this inform the proposed work?

@rufuspollock
Copy link
Member

I don't think this changes things - it was close to (and based on) the tabular data package and the key thing we probably want here is the json table schema part which we have factored out.

@jqnatividad
Copy link
Contributor Author

Touching this issue to highlight https://github.com/timwis/csv-schema from @timwis.

timwis/csv-schema#4

@davidread
Copy link

Here's a variant that @wardi and I are looking at, to tackle problems getting the column types right during import with DataPusher:

  1. Detect CSV properties (encoding, csv dialect, no. header lines, columns renamed in datastore, column types) using messytables. Store most of these as resource extras (the column types will be stored in postgres in the next step).
  2. Use Postgres' COPY command to quickly import the CSV into DataStore. Avoid value type errors by importing them all as VARCHAR at this point.
  3. Try and convert each column to the preferred type. If a column fails, it fails, but at least the data is in Datastore.
  4. DataStore is told about new table in its db and so it is made available to users.

This would be done in a new extension, replacement for ckanext-datapusher. They would be run on a queue (ckanext-rq). I believe the last step will need a new action function in DataStore, as a replacement for datastore_create but without actually storing the data.

Admins can look at the resource's DataStore tab and now, along with seeing the status & log of the import, in addition they can fix things like the encoding, column types (e.g. phone numbers need to be string to keep the leading zero) etc and trigger a reimport.

@jqnatividad
Copy link
Contributor Author

Closing this now that @davidread wrote Express Loader (https://github.com/davidread/ckanext-xloader) :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants