-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataStore to CSV service, for download of large resources. #34
Comments
+1 |
This sounds most useful |
+1 - think this seems sensible. Shouldn't be hard to link this with the filestore so one auto-pushes there ... |
This sounds like it could use a celery like service which we're talking about in #66. |
bumping this up now that we have background jobs http://docs.ckan.org/en/latest/maintaining/background-tasks.html. And rely instead on running a native Postgres COPY command asynchronously which is much faster, doesn't need to load everything in memory, skipping the 100,000 row limit. We have several clients who want to use the Datastore transactionally for large resources, and the current dump mechanism is presenting a problem. If we do implement it as a background task, I also suggest putting into place a caching mechanism so that this relatively expensive process is not unnecessarily started if the table has not changed. I'd also add a way to export to other formats other than CSV as well (e.g. JSON, XLSX, XML) as other data portal solutions allow files to be uploaded as CSV and downloaded using different formats. Perhaps, using tools like https://github.com/lukasmartinelli/pgclimb cc @wardi @amercader |
Streaming data on request is a nice approach too. That gives you live data and doesn't multiply your storage requirements. edit: I've found openpyxl dumps XLSX data quite efficiently and has constant memory overhead with its |
https://github.com/frictionlessdata/jsontableschema-py and https://github.com/frictionlessdata/tabulator-py can be used for various aspects of this. |
Here's a simple fix that reduces memory usage and allows large CSV dumps from datastore: ckan/ckan#3344 |
Explanation:
I would like to use the DataStore via the API as primary data-source. This works without a problem already.
However, if people wants to download the entire resource as a CSV, via /dump/, it only downloads 100K records (this is hardcoded into CKAN).
It also takes quite a long time to generate the CSV file.
I have resources with over 10+ mio. Rows and would like to offer a complete download via CSV. But changing the hardcoded 100K row limit puts a lot of pressure on the system.
It would be very nice to have a feature where, using the API for the DataStore, would update a corresponding CSV-file for download. So download wouldn’t need to generate the file.
The text was updated successfully, but these errors were encountered: