Skip to content
This repository has been archived by the owner on Feb 25, 2020. It is now read-only.

Query directly from raw files #528

Open
jdegoes opened this issue Oct 5, 2013 · 1 comment
Open

Query directly from raw files #528

jdegoes opened this issue Oct 5, 2013 · 1 comment
Assignees

Comments

@jdegoes
Copy link
Contributor

jdegoes commented Oct 5, 2013

To make Precog much more accessible and user-friendly to local installs, as well as prepare for work on a distributed version of Precog, we should allow querying directly on files which are stored in formats for which we have an input adapter, similar to how Hive and Pig handle data analysis.

This ticket is to refactor the query engine so that we are able to allow querying directly over JSON data files, CSV files and, of course, NIHDB 'files', in a file system containing a variety of file formats.

To do this, we need to define a suitable input adapter which exposes a Table-oriented view of a file format, and propagate information necessary to use a particular adapter (e.g. for CSV files or possibly even JSON files, the input may be ambiguous and require information such as delimiters in order to unambiguously interpret as a Table).

Some file "formats" may in fact be directories containing many files; we should think about how to handle these.

Note that as per @nuttycom's comment, we already have JSON-backed and even JDBC-backed table adapters. The exact functionality we lack is the ability to discriminate between alternate representations at runtime based on the actual string paths passed to the table load function, as well as an architecture that makes it easy to add new input adapters and rules for selecting them during runtime loads.

This ticket will be considered complete when it is possible to create a Quirrel script that loads data from a JSON file, a CSV file, and a NIHDB file, and joins them all together; and when the associated architecture allows cleanly adding support and selection criteria for new input adapters (by defining the input adapter and describing the rules that dictate when the input adapter is used for dynamically loaded data -- e.g. when the file extension or mime type is such and such).

@nuttycom
Copy link
Contributor

nuttycom commented Oct 6, 2013

This already exists, and is very thoroughly supported by the
ColumnarTableModule abstraction. It's used extensively in the tests, and
the same technique is used for MongoDB and JDBC back-ends.

On Sat, Oct 5, 2013 at 5:48 PM, John A. De Goes [email protected]:

To make Precog much more accessible and user-friendly to local installs,
as well as prepare for work on a distributed version of Precog, we should
allow querying directly on files which are stored in formats for which we
have an input adapter, similar to how Hive and Pig handle data analysis.

This ticket is to refactor the query engine so that we are able to allow
querying directly over JSON data files, CSV files and, of course, NIHDB
'files'.

To do this, we need to define a suitable input adapter which exposes a
Table-oriented view of a file format, and possibly propagate information
necessary to use a particular adapter (e.g. for CSV files or possibly even
JSON files, the input may be ambiguous and require information such as
delimiters in order to unambiguously interpret as a Table).

Some file "formats" may in fact be directories containing many files; we
should think about how to handle these.


Reply to this email directly or view it on GitHubhttps://github.com//issues/528
.

@ghost ghost assigned jdegoes Dec 5, 2013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants