Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve header row handling in multipart tabular resource chunks #256

Closed
paulgirard opened this issue Feb 5, 2020 · 0 comments · Fixed by #257
Closed

Improve header row handling in multipart tabular resource chunks #256

paulgirard opened this issue Feb 5, 2020 · 0 comments · Fixed by #257

Comments

@paulgirard
Copy link
Contributor

Overview

A multipart resource concatenates chunks as-is which is what we want for generic binary files.
But for tabular resource this default behaviour implies that if the first chunk does have one header row, the other chunks must not.
We discussed this issue here: https://github.com/frictionlessdata/forum/issues/1
The proposition is to change this behaviour for tabular-data-package: tabular chunks should all have headers or none depending on dialect.header.

Implementation

Multipart chunks are handled by the _MultipartSource class which build an iterator of chunks' rows iterator. My approach is simply to discard first row of chunks (but the first) iterator when the resource is tabular with header.
@roll pointed possible issues with:

  • resource.raw_read: I've checked this method uses _MultipartSource iterator so should be safe
  • datapackage.save: I am not sure to understand the risk in there since data are only read from resources. Saving only writes the datapackage not the data itself right ?

@pwalsh and @akariv your points of view and advices are more than welcome.


Please preserve this line to notify @roll (lead of this repository)

paulgirard added a commit to paulgirard/datapackage-py that referenced this issue Feb 5, 2020
@roll roll changed the title header row in multipart tabular resource chunks Improve header row handling in multipart tabular resource chunks Mar 25, 2020
@roll roll closed this as completed Apr 6, 2020
@roll roll reopened this Apr 6, 2020
@roll roll added review and removed review labels Apr 20, 2020
@roll roll closed this as completed in #257 Aug 14, 2020
roll pushed a commit that referenced this issue Aug 14, 2020
* discarding header row of tabular multipart chunks
relates #256

* using streams correctly in python 3.x

* Remove first row in chunk only if != header\n see https://github.com/frictionlessdata/datapackage-py/pull/257#pullrequestreview-375804017\n and https://github.com/frictionlessdata/forum/issues/1

* bug in row iteration

* better warning message

* use iterator for streams in multipart fixes #246

* linter compatible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants