Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading of a data source #11

Closed
6 tasks done
loleg opened this issue Nov 28, 2017 · 7 comments
Closed
6 tasks done

Reading of a data source #11

loleg opened this issue Nov 28, 2017 · 7 comments
Assignees

Comments

@loleg
Copy link
Collaborator

loleg commented Nov 28, 2017

In addition to the basic streamed loading of data sources into a Table as per #6, provide the ability to read through a Table with cast on iteration so that the individual cells of the table are formatted according to the Schema provided.

  • A read API* should be accessible for a Table class.
  • Standard library CSV loading as well as external loading supported.
  • Headers are loaded and validated.
  • Errors from loading are handled in a standard way.
  • Tests cover the basic design and several reading scenarios.
  • Tests cover cast on iteration, including exception handling.
@roll
Copy link
Member

roll commented Nov 29, 2017

@loleg
Take a look on reading capabilities for the Table class - https://github.com/frictionlessdata/tableschema-py#table

It reads data source and cast data according to the schema (if provided). Also it doesn't load the whole data file into memory (but one row per time - streaming).

loleg added a commit to loleg/TableSchema.jl that referenced this issue Dec 5, 2017
@loleg
Copy link
Collaborator Author

loleg commented Dec 7, 2017

As per #4 I think we should continue discussion of streaming library choice here. In reviewing the implementation notes, I wondered:

iter(keyed/extended=False, cast=True, relations=False) -> (generator) (keyed/extended)row[]

Is this really a generator, or an iterator? In Julia, iterators are made by applying protocol interfaces to the type.

Since in our case generators are being used for streaming data, I am inclined to think that Channels would be more appropriate and performant.

But the most sensible course of action right now would probably be simply to implement what the approach we take to loading data recommends.

@roll
Copy link
Member

roll commented Dec 8, 2017

@loleg
I think technically it's a generator - a function we need to call to get an iterator. That because we need an ability to read the data source multiple times. But TBH I don't think it's a good idea to rely to much onto this terminology because it could differ language to language.

rows = table.iter(keyed=True)
for row in rows:
  print(row)

Probably a good idea could be to implement the simplest table.read (read into memory all rows) function first. At least it will be working implementation in case of streaming problems which could block the whole work.

@loleg loleg changed the title Streaming and reading of a data source through a table schema with cast on iteration Streaming and reading of a data source Dec 8, 2017
@loleg
Copy link
Collaborator Author

loleg commented Dec 8, 2017

@roll I have updated this issue description with my current understanding, please confirm.

loleg added a commit to loleg/TableSchema.jl that referenced this issue Dec 8, 2017
@roll
Copy link
Member

roll commented Dec 8, 2017

@loleg
It sounds good. One question is still do we need to support both standard library CSV loading as well as external loading supported.

@loleg
Copy link
Collaborator Author

loleg commented Dec 8, 2017

I've prototyped some approaches to include an external library, but don't like to have heavy dependencies or slow-to-run unit tests. Therefore I have made the Table source typeless and implemented and tested (see a1eca25) initial support to read the table from an IOBuffer or String, which can be used by any framework. For example, the headers and top rows of a DataFrame could be used for loading the Table, validation would happen through standard iteration.

loleg added a commit to loleg/TableSchema.jl that referenced this issue Dec 11, 2017
@loleg
Copy link
Collaborator Author

loleg commented Mar 7, 2018

Still a bit stuck on the way forward here, I've been considering implementing a DataStreams.jl interface or supporting IterableTables.jl.

@loleg loleg changed the title Streaming and reading of a data source Reading of a data source Mar 24, 2018
@loleg loleg closed this as completed Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants