Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate with loaders.gl 2.2 #1156

Merged
merged 84 commits into from
Jul 1, 2020
Merged

Integrate with loaders.gl 2.2 #1156

merged 84 commits into from
Jul 1, 2020

Conversation

manassra
Copy link
Collaborator

@manassra manassra commented Jun 17, 2020

Most changes in this PR come from #1135 by @heshan0131. The changes this PR brings in addition to those are mostly in src/processors/file-handler.js, where we went through a few iterations of integration with Loaders.gl.

Key New Changes:

  • Uses the metadata flag to assist with parsing objects
  • Uses the jsonpaths property to guide the parseInBatches function to what to stream
  • This bug fix in Loader.gl, allow Kepler to ingest CSVs that have no header row (header is autogenerated as column1, column2, ....)

Next Steps:

  • Tests for the various dataset types supports (followup PR)
  • Fix for overflowing upload panel when uploading multiple files simultaneously (followup PR)

jsonpaths: [
'$', // JSON Row array
'$.features', // GeoJSON
'$.datasets' // KeplerGL JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets is an array of datasets, which could be less than 10, we can make it be $.datasets.0.data so it always tries to stream the first dataset. And while we save in kepler, we can save the dataset with the longest rows first

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The reason I left it as $.datasets for now is because it looks like the array indexing in jsonpaths isn't handled properly in loaders.gl.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you can just add a test case in jsonpaths.spec and streaming-json-parser.spec in loaders.gl/json and it should be easy to pin down this little bug (at a guess, some string/number conversion issue perhaps, or openobject, openarray treated differently).

@@ -200,9 +178,9 @@ export function processFileData({content, fileCache}) {
}

if (format && processor) {
console.time('process file content');
Console.time('process file content');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was me testing, it can be removed

@ibgreen ibgreen changed the title [WIP] Integrate with Loaders 2.2 Integrate with loaders.gl 2.2 Jun 18, 2020
Copy link
Collaborator

@ibgreen ibgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Apart from some nits, I think the post processing could merit another look, especially the CSV functions, it feels like there is overlap with loaders and it would be nice to push generic post-processing code upstream into loaders.gl so we can keep the logic here in the app as clean and minimal as possible.

Edit: That would be for another PR.

src/components/common/file-uploader/file-upload.js Outdated Show resolved Hide resolved
src/components/modals/load-data-modal.js Outdated Show resolved Hide resolved
if (!Array.isArray(result) || result.length < 2) {
// looks like an empty file, throw error to be catch
throw new Error('Read File Failed: CSV is empty');
export function processCsvData(rawData, header) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to duplicate the CSVLoader with header detection etc. Can we unify?

Is this unique to CSV data? not other tables? Rename the functions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think there might be an opportunity to remove this functionality altogether, but it seems to be used for loading sample datasets and other places (if I understand this correctly). @heshan0131 what do you think?

@@ -157,6 +69,10 @@ export function isRowObject(json) {
return Array.isArray(json) && isPlainObject(json[0]);
}

export function isCsvRows(content) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case does not happen any more with the changes in @loaders.gl/csv?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! this isn't needed anymore because a CSV gets transformed to a JSON row array by Loaders.gl. removed.


let format;
let processor;
if (isCsvRows(data)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to know if this was originally CSV then we can use the metadata batch instead of sniffing like this. The idea with loaders.gl is that we can add additional streaming parsers do we need to keep track of what type things were originally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check isn't actually needed since, as you said, the CSV gets parsed as an array of records (will get captured by the isRowObject(data) instead.

src/reducers/vis-state-updaters.js Outdated Show resolved Hide resolved
src/reducers/vis-state-updaters.js Show resolved Hide resolved
gnandretta and others added 13 commits June 17, 2020 22:35
Parse JSON using loaders.gl

Parse CSV using loaders.gl

Let loaders.gl handle file detection, not working though

Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
manassra and others added 9 commits June 18, 2020 00:10
Parse JSON using loaders.gl

Parse CSV using loaders.gl

Let loaders.gl handle file detection, not working though

Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
heshan0131 and others added 8 commits June 18, 2020 00:26
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Shan He <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>
Signed-off-by: Wesam Manassra <[email protected]>

let format;
let processor;
if (isCsvRows(data)) {
format = DATASET_FORMATS.csv;
processor = processCsvData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not delete this. kepler processCsvData does 2 things:

  1. type detection, specifically to find time geo as well as the exact time format yyyy-hh-dd mm:ss
  2. basic data cleaning. lots of times people have string values like N/A Null N/ mixed up with numbers. kepler finds them and fill it in with null so type detect won't fail.

For case1, there is a fall back in reducer, if format if time is note detect, keler will call type detection again. Not sure loaders.gl will handle case 2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now loaders doesn't do this, but it will have to when we switch the loaders to return columnar data batches (typed arrays) and arrow batches.

I think we should look at moving the data cleaning out as a second step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to Ib's point, once the dataset is processed via loaders.gl, the output format for CSV files and JSON Row objects becomes the same (an array of records), so the isCsvRows function doesn't get evaluated. Any ideas on how to tackle this?

@heshan0131 heshan0131 merged commit 01484c6 into master Jul 1, 2020
@delete-merged-branch delete-merged-branch bot deleted the loaders2.2 branch July 1, 2020 08:00
@heshan0131 heshan0131 mentioned this pull request Mar 20, 2021
@heshan0131
Copy link
Contributor

#76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants