-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Datastore Datapusher documentation to mention Express Loader #4415
Comments
@jqnatividad This seems sensible, given that I'd not heard of express-loader until just now. Of course, I'm apparently not as informed as I should be, but I spend most of my time in the docs. |
I'm supportive of this. Since xloader was launched it's been used in a dozen or so CKANs in OpenGov. And we had some good validation of it being used and improved (PR1 PR2 by @metaodi who switched the Open Data Zurich site from datapusher to xloader. I have two concerns about xloader being the recommended choice:
It would be good to get comments from @amercader @wardi and the rest of the tech team on this. If we make xloader the official choice, we should also move it from github.com/davidread/ckanext-xloader to github.com/ckan/ckanext-xloader (I am supporting it). |
As all our sites have the Datastore installed, before Express Loader, the single largest source of ongoing support issues was the Datapusher. One day, it will work fine without a hitch (albeit very slowly), and then the following day, it won't because of some data quality issue in the last row of a large CSV, which aborts the whole job, which is even more frustrating for large CSVs as it does so when it encounters the problem hours into the datapusher job! @davidread, your first concern is a feature not a bug. Uploading everything as text, and casting the column after its loaded is a much better workflow, with a more graceful error handling (thanks to the more verbose logging messages of xloader too) in my book. As for your second concern, we have some insight that we add to ckan/ckanext-xloader#41 that should help. |
I've had likewise experiences, and not just with CKAN. Guessing column types always causes more problems then it solves, especially when it starts doing things like truncating 0s because it thinks that UPC is a number. IMO the "better" solution is to continue expanding the data dictionary feature. We should only load the data after the data dictionary has been defined. Load the first couple of rows using something like papaparser to get column names and type hints. We could also allow selecting previously defined data dictionaries (for example we have one site that uploads many datasets of energy usage samples, one per province. The data dictionary would be identical for every one) The pros as I see it:
|
@jqnatividad switching away from datapusher should be trivial, xloader is mostly a drop-in replacement. I would vote for immediate deprecation and removal on the next release unless there's a strong reason to keep it. We would backport security fixes only. Replace the documentation with xloader along with a footnote to see earlier versions of the documentation for help with legacy datapusher installs. |
@TkTech, xloader's genesis was this discussion 3 years ago :) And I'm all for expanding the data dictionary too to help not just with xloading, but as as way to get descriptive statistics metadata as well. But instead of using csvkit to compile the descriptive statistics, leverage the pg_statistic_ext statistics compiled by Postgres after an ANALYZE. |
Regarding the statistics, I'm all for it but rather then using csvkit and an asynchronous job it should be just a part of the loading process. Once the data is in postgres we can easily get all of that information without having to use something like csvkit.
|
@TkTech LOL! Our comments re using Postgres statistics crossed path in the internets :) |
I like the idea of simply including the data dictionary workflow as an obvious next step in the resource addition process. (Create dataset -> add resource -> define dictionary) Rather than loading just a few rows, or maybe in addition to that, we could try and surface data validation exceptions on the client before upload. For example, a user uploads a 100MB+ csv, and we gradually validate the columns to surface any data that wouldn't work given the user's selected datatype. |
I'm fine with this as long as it's 100% client side and running transparently in the background, but I'd give it a pretty low priority given restricted time. Reading the first few rows is just to populate the initial column names and type guesses on the form. |
Thanks all for this very useful discussion. There's lots to unpack here so I'll summarize my views separately.
|
Hi @amercader, I've been tracking the Frictionless Data project and we've been planning on installing ckanext-datapackager as a standard extension especially since we've announced support for Fiscal Data Package and Data Packages in general. Can ckanext-validation and ckanext-datapackager coexist? |
+1M to data schemas before uploading data open.canada.ca has been doing something like this internally with ckanext-recombinant for a couple years now, and we have departments lined up to use it because the users love it and the data quality is awesome. Recombinant benefits:
Recombinant drawbacks:
Data schemas might benefit from some of the lessons we've learned from recombinant, e.g.
|
This got lost in my long-winded comment: ckanext-validation and the the approach @amercader describes looks awesome and super useful without any changes! Wait, is this an issue about xloader? Yes, I think we should recommend xloader over datapusher :-) |
Covering off the caveats that @amercader raises: i. No type guessing potentially affecting vizualizations. There doesn't seem to be a problem with it interpreting the text as numbers: However there is an issue with dates (e.g. '2018-01-25') which end up just as a year, distorting the gold prices example: So this seems a concern for datasets which are time series like this one. ii. Private dataset support. I've now tested this and it works fine with private datasets. 👍 iii. Performance issues.
Indeed. I'll look at this more no though. |
I've finished the work on CKAN performance that was the key concern on this ticket. Yes there is the concern about recline not graphing dates until the Data Dictionary is edited to say they are a date. But I reckon we can live with that for the improved loading. (And btw I'm fine with renaming Data Dictionary to Data Schema - I think the latter is more commonly used.) I'll do a PR to change the recommendation in the docs from DataPusher to Express Loader. I'll also look at moving Express Loader into the Datastore extension (i.e. in the core ckan repo), as @amercader suggests. I'll ensure there is an option to disable it, so that datapusher or some other loader can still be used. |
This PR fixes the issue: ckan#4415
I have updated and mentioned |
The current Datastore documentation makes no mention of Express Loader (https://github.com/davidread/ckanext-xloader), which is a marked improvement over Datapusher.
https://ckan.org/2017/11/20/express-loader-for-ckan-datastore/
For legacy purposes, I guess Datapusher needs to be maintained. But for new installations, shouldn't Express Loader be the recommended mechanism?
cc @davidread
The text was updated successfully, but these errors were encountered: