Multi-pass datapusher #150

jqnatividad · 2015-05-30T06:00:52Z

As discussed with @wardi and @davidread at IODC/CKANcon:

Pass 1:

scan CSV using messytables
guessed column datatypes are stored as data dictionary
user has the ability to override guessed datatypes (e.g. column is not number, but date in ISO-8601 format; string date should be converted to native date type in ISO-8601)
user has the ability to add column descriptions to computed data dictionary
user has the ability to assign alias (SQL view name)

Pass 2:

push CSV to datastore (consider using Postgres COPY command? )
data should be pushed more reliably as a result of Pass 1

Pass 3 (optional):

run ANALYZE command on resulting datastore table (helps with compiling descriptive stats) (OPTIONAL)
compile descriptive statistics, e.g.
- top N values
- range of values (e.g. range of values is 0-1000; 1 thru 5; dates are from 1990-01-01 to 1991-12-31, etc.)
- sparseness of column
- histogram of values in column (ala OpenRefine)
- if ANALYZE command was run, maybe even bring in some stats computed by Postgres also (e.g. n_distinct, most_common_vals, most_common_freqs, etc.)
amend computed data dictionary with descriptive statistics
populate some Project Open Data fields like the temporal field (e.g. start and end date of applicability for the data)
optionally create indices

Pass 4, etc.

data pusher extensions 😄 installed on the instance can do more instance-specific stuff (e.g. entity resolution, reconciliation, geocoding, enrichment, etc.)

BENEFITS:

a computed data dictionary that data consumers can use
computed metadata to help populate POD/DCAT fields
data dictionary can be used for automated mapping between tables - takes advantage of Postgres/PostGIS capabilities (real joins, storing queries as views, materialized views, computed columns, store extras as JSONB, querying JSONB, etc.), and well known DB management techniques (e.g. having read-only replicas to support high-volume API use, etc.)
takes CKAN beyond just cataloging to become fundamental open data infrastructure on which enterprise-class opendata driven applications can be built

rufuspollock · 2015-05-30T08:34:00Z

@jqnatividad first I note this is really about creating a proper ETL platform for CKAN - see e.g. #18. Some thoughts follow:

Connect / Reuse Frictionless Data and Data Package

Connect this with the Frictionless Data / Data Package work. That already has:

Data file analysis and schema generation in node and python: http://data.okfn.org/tools/create for one example
Schema validation - e.g. http://data.okfn.org/tools/validate plus libraries
Data Validation - http://goodtables.okfnlabs.org/ plus libraries
(Coming Soon) Schema editing (in a nice way) - plus round-tripping checking on the data

What Workflow We Want

Overall the workflow is something like this: (see also the "perfect workflow" section in this OpenSpending user story):

Give me a CSV / Excel file
I guess schema (JSON Table Schema)and present it to you with some of your data in a nice way - e.g. I show you the table in a table structure with drop down for column types, ability add descriptions etc
you edit and confirm
we check that schema against the data and report issues
import happens

Aside: I note we have working spike of this approach for command line only related to OpenSpending: https://github.com/openspending/oscli-poc

Aside: I also assume folks saw things like http://okfnlabs.org/blog/2014/09/11/data-api-for-data-packages-with-dpm-and-ckan.html - this is automated load of data into CKAN DataStore - with type guessing ...

Focus Right Now

Right now my suggestion would be to:

Agree on using JSON Table Schema for the data dictionary
Agree on a convention for where we store the JSON Table Schema in a resource attributes - i.e. a simple agree on a given "extra" field name where we store this
Describing the user stories and properties of the "ETL" service you want and leave the question of whether this is datapusher or something else
- personally I think you really want something else than datapusher - or multiple routes (e.g. command line, openrefine etc etc)
- e.g. on frictionless data front we've focused on node / js as your creator app can then be embedded anywhere)

Descriptive Statistics

Definitely useful but probably separate activity. Again I would connect with Tabular Data Package stuff - see e.g. http://data.okfn.org/roadmap#tabular-stats-tool

General Enrichment Services

Again, fantastic, let's just create a model for these - and we probably want to have integrated UX but separate services - ie. this runs "standalone" from an implementation perspective but UX is integrated - this has been discussed for a number of services already e.g. the link checker (dead or alive) and is key for datapusher generally i think.

wardi · 2015-05-30T19:47:04Z

I'm thinking of a simple extensible version of this:

Raw/schemaless datastore import by adding support for postgres arrays to datastore field types
- import complete XLS or CSV contents into datastore table with as a single column containing a postgres array of strings
- guaranteed to work for any XLS or CSV
- fast: use COPY in large batches of rows
- collect and update minimal information while copying: # columns, min/max float/int value seen for each column (or None if column contains entries invalid for the type) this could happen entirely client side (datapusher)
datastore_alter: Alter table in postgres
- copy data from raw array into real columns with types, apply indexes etc.
- fast because it's all in postgres operating on data it already has (no looping in python)

Building a nice interface to allow users to to fancy things like confirming column types or ETL to clean up data can happen after we have these low-level actions in place.

related tickets: #123 #124

jqnatividad · 2015-06-02T20:43:26Z

Thanks @rgrp, @wardi for your feedback.

This seems to be a recurring requirement as evidenced by the multiple references in this repo.

Combined Workflow

Maybe we can combine approaches, using pgloader.io instead of native COPY? Keeping the new workflow as close to the current CKAN 2.3 upload workflow?

Give me a CSV / Excel file
Upload raw file to Filestore as is

no change to current CKAN workflow, so far

On Manage/DataStore tab of Resource.

I guess schema (JSON Table Schema) and present it to you with some of your data in a nice way - e.g. I show you the table in a table structure with drop down for column types, ability add descriptions etc
you edit and confirm schema using a simple interface. Schema definition is stored in "JSON Table Schema" resource attribute extra field. (Aside: perhaps, using PostgreSQL's 9.4's new JSONB data type?)
Schema definition is used to generate pgloader.io commands
import is done using pgloader.io.
based on pgloader.io results, report issues, and optionally go back to 2 if the load failed
if pgloader.io is successful, create a "Data Dictionary" resource view that renders the JSON Table Schema in a nice human-readable format.

only modifying the DataStore workflow

In this way, we leverage not only Frictionless Data/Data Package, but we also use pgloader.io which seems to be an excellent way to async-load data into postgres, even better than the native COPY command. I consciously aligned my user-story to the current CKAN upload workflow as I think it needs to be refactored insofar as the Datapusher/Datastore is concerned.

Descriptive Stats

And as @rgrp suggested, the descriptive stats stuff can come later, but once a dataset is loaded into Postgres, computing these stats is pretty straightforward with SQL (i.e. min, max, distinct, etc.). The JSON Table Schema field constraints can even be repurposed to show the stats, though they are not really "constraints" per se, but a description of the loaded table. I still see this as a related project to the work above in a future iteration.

General Enrichment Services

An extensible framework would be great! Once we have a more robust way of onboarding data into the Datastore, having an enrichment framework would really go a long way towards in enhancing CKAN as open infrastructure on top of which enterprise-class solutions can be built. #dogfooding 😉

And I can envision a whole class of data enrichment services that the community can develop. This deserves its own ticket, but only makes sense once we have a more robust datapusher, especially since the enrichment services will actually require the JSON Table Schema.

So the JSON Table Schema is not only there to support the Data Dictionary view, and to support the schema defn interface during load time, it will also support these enrichment services.

wardi · 2015-06-02T21:36:49Z

@jqnatividad I like it.

Can we integrate with pgloader safely with the command line or is it going to take lots of fragile script generation and output parsing?

jqnatividad · 2015-06-02T22:39:04Z

@wardi, only one way to find out 😄 I only found out about pgloader.io this week, but it seems to be widely used and the author is a PostgreSQL major contributor.

As the current datapusher is a stand-alone CKAN service and runs asynchronously, I think its a natural fit. The pgloader CLI options seems to be quite extensive, and maybe we can have a template DSL we can parameterize and create a next-gen datapusher using https://github.com/ckan/ckan-service-provider.

BTW, found this paper by the pgloader.io author that's quite interesting - http://rmod.lille.inria.fr/archives/dyla13/dyla13_3_Implementing_pgloader.pdf

jqnatividad · 2015-06-03T00:48:57Z

As a side benefit of using pgloader, we get to support additional file formats like fixed, Postgres COPY, Dbase, IXF, and SQLite.

It's also interesting that pgloader can directly connect to MySQL and MS SQL Server using a connection string. For data publishers, this is a great way to direct-load data from transaction systems skipping an intermediate ETL step.

Perhaps, pgloader support in the ckanapi and/or ckan-import as well? It could be a thin wrapper that directly uses the pgloader DSL and just uses the Datastore API to register the resource with CKAN and associate it with a package.

maxious · 2015-06-11T02:47:53Z

Some interesting projects from github:
https://github.com/opencivicdata/pupa is a Python library/DSL for scraping HTML pages in an ETL way
https://github.com/jdorn/json-editor is a Javascript library for generating UIs to edit JSON based on JSON Schema definitions

jqnatividad · 2015-06-12T20:07:30Z

And the pieces are starting to fall in place. json-editor looks really cool! @rgrp mentioned that schema editing is coming soon. Should we wait for that, or should we use/leverage json-editor instead?

rufuspollock · 2015-06-14T00:41:54Z

@jqnatividad i think we should still get a proper architecture diagram here and really agree how different pieces fit together so people can then go off and work on specific bits. As I said, I'd suggest using JSON Table Schema as a core piece which other bits can then work off. Also cc'ing @pwalsh as he is doing a lot of work on Data Package and Frictionless Data stuff with Open Knowledge atm.

pwalsh · 2015-06-14T05:04:46Z

Interesting discussion.

About schema editing - yes, we are working on one right now as a fairly generic interface to edit Data Package schemas, and we do use json-editor for that: https://github.com/okfn/datapackagist and http://datapackagist.okfnlabs.org/ - we have some UX work to do there, but functionality-wise, there is a quite complete application there already. It is still WIP so YMMV.

As it might not be immediately apparent from the DataPackagist UI, there are additional things here beyond schema editing as such:

Each resource (data file) is validated against the GoodTables web service (goodtables lib/cli)
Each resource has a schema inferred on load, which can then obviously be tweaked in the schema editor

For JSON Table Schema, https://github.com/okfn/json-table-schema-js and https://github.com/okfn/json-table-schema-py (jtskit on PyPI, as we only just last week took over the json-table-schema package) are libs for handling JTS, including inferring from data. These are more in line with current JTS spec than say, the JTS features in messytables.

Also, it might be interesting to note here the differences between GoodTables and MessyTables.

MessyTables tries to work with anything you throw at, which is a great approach when working with known messy data and you want to fix it in an automated fashion, or, you don't care about the quality of the actual source file, as long as you can get data out of it.

GoodTables takes a different approach, where we do want to ensure that the source data is well formed, and if it is not, we want specific notifications about where and how it is not, so we can take action to fix the source. GoodTables is implemented as a processing pipeline and has two processors at present: one for structural validation, and one for validation against a JSON Table Schema. It is possible to hook in custom processors, although admittedly the docs are lacking in that department.

jqnatividad · 2015-07-17T21:57:53Z

Sorry, life happened (vacation, catch-up, work) and disconnected.

Would love to pick this up again, and as @rgrp suggested, go about speccing it out so folks can start work.

Will take a stab at rough spec on GDocs and link it here so we can collaborate on it.

jqnatividad · 2015-07-28T17:09:23Z

On 16 July 2015, the CSV on the Web Working Group published Candidate Recommendations for Tabular Data on the Web.

How should this inform the proposed work?

rufuspollock · 2015-07-28T17:15:45Z

I don't think this changes things - it was close to (and based on) the tabular data package and the key thing we probably want here is the json table schema part which we have factored out.

jqnatividad · 2016-05-12T15:10:51Z

Touching this issue to highlight https://github.com/timwis/csv-schema from @timwis.

timwis/csv-schema#4

davidread · 2017-06-02T15:55:00Z

Here's a variant that @wardi and I are looking at, to tackle problems getting the column types right during import with DataPusher:

Detect CSV properties (encoding, csv dialect, no. header lines, columns renamed in datastore, column types) using messytables. Store most of these as resource extras (the column types will be stored in postgres in the next step).
Use Postgres' COPY command to quickly import the CSV into DataStore. Avoid value type errors by importing them all as VARCHAR at this point.
Try and convert each column to the preferred type. If a column fails, it fails, but at least the data is in Datastore.
DataStore is told about new table in its db and so it is made available to users.

This would be done in a new extension, replacement for ckanext-datapusher. They would be run on a queue (ckanext-rq). I believe the last step will need a new action function in DataStore, as a replacement for datastore_create but without actually storing the data.

Admins can look at the resource's DataStore tab and now, along with seeing the status & log of the import, in addition they can fix things like the encoding, column types (e.g. phone numbers need to be string to keep the leading zero) etc and trigger a reimport.

jqnatividad · 2018-08-02T13:14:56Z

Closing this now that @davidread wrote Express Loader (https://github.com/davidread/ckanext-xloader) :)

jqnatividad mentioned this issue Jun 2, 2015

Data Enrichment Extensions #151

Open

maxious mentioned this issue Jul 21, 2015

Datastore: failing to insert big data ckan/ckan#2525

Closed

jqnatividad referenced this issue in talos/nyc-stabilization-unit-counts Aug 14, 2015

remove pgloader, load directly into postgres using ./reparse.sh

48995c8

jqnatividad mentioned this issue Sep 16, 2015

Proposal: Messytables 2 okfn/messytables#142

Open

rufuspollock mentioned this issue Apr 23, 2016

Allow customisation of field types before sending to datapusher #175

Open

This was referenced Mar 31, 2017

Load data into datastore better, using pgloader ckan/ckan#3517

Closed

Datapusher inferring column type incorrectly. ckan/datapusher#48

Closed

davidread mentioned this issue Apr 17, 2017

pgloader [WIP] ckan/datapusher#130

Closed

4 tasks

jqnatividad closed this as completed Aug 2, 2018

jqnatividad mentioned this issue Aug 22, 2018

Update Datastore Datapusher documentation to mention Express Loader ckan/ckan#4415

Open

jqnatividad mentioned this issue Apr 27, 2022

Question: how close will this project be to datapusher? dathere/datapusher-plus#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-pass datapusher #150

Multi-pass datapusher #150

jqnatividad commented May 30, 2015

rufuspollock commented May 30, 2015

wardi commented May 30, 2015

jqnatividad commented Jun 2, 2015

wardi commented Jun 2, 2015

jqnatividad commented Jun 2, 2015

jqnatividad commented Jun 3, 2015

maxious commented Jun 11, 2015

jqnatividad commented Jun 12, 2015

rufuspollock commented Jun 14, 2015

pwalsh commented Jun 14, 2015

jqnatividad commented Jul 17, 2015

jqnatividad commented Jul 28, 2015

rufuspollock commented Jul 28, 2015

jqnatividad commented May 12, 2016

davidread commented Jun 2, 2017

jqnatividad commented Aug 2, 2018

Multi-pass datapusher #150

Multi-pass datapusher #150

Comments

jqnatividad commented May 30, 2015

rufuspollock commented May 30, 2015

Connect / Reuse Frictionless Data and Data Package

What Workflow We Want

Focus Right Now

Descriptive Statistics

General Enrichment Services

wardi commented May 30, 2015

jqnatividad commented Jun 2, 2015

Combined Workflow

Descriptive Stats

General Enrichment Services

wardi commented Jun 2, 2015

jqnatividad commented Jun 2, 2015

jqnatividad commented Jun 3, 2015

maxious commented Jun 11, 2015

jqnatividad commented Jun 12, 2015

rufuspollock commented Jun 14, 2015

pwalsh commented Jun 14, 2015

jqnatividad commented Jul 17, 2015

jqnatividad commented Jul 28, 2015

rufuspollock commented Jul 28, 2015

jqnatividad commented May 12, 2016

davidread commented Jun 2, 2017

jqnatividad commented Aug 2, 2018