Possible to avoid the LoadCSV step? #46

soneymathew · 2019-04-03T19:48:13Z

Hi @jexp / maintainers ,
I am trying to online import a Database which contains JSON values in it's columns,
I notice that when it get's loaded into CSV it breaks the CSV format.

Can we modify this to use https://neo4j-contrib.github.io/neo4j-apoc-procedures/#load-jdbc to overcome this?

I was able to leverage this strategy by handrolled cypher successfully. it would be great if this strategy can be adopted by the ETL tool.

Happy to help raise a PR with some guidance as well.

soneymathew · 2019-04-06T21:44:17Z

I tested the following strategy for a table with ~1 million rows to complete the load in ~46 seconds

set the DB_URL so that it can be refered later

call apoc.static.set('DB_URL', '<connection url>')

not sure if this matters but I set the constraints required for the Node first

CREATE CONSTRAINT ON (t:<TableName>) ASSERT t.id IS UNIQUE;

use apoc procedures to load the data in parallel with a batchsize of 10k

CALL apoc.periodic.iterate('
CALL apoc.static.get("DB_URL") yield value WITH apoc.convert.toString(value) AS DB_URL
call apoc.load.jdbc(DB_URL,"<TableName>") YIELD row RETURN row',
'CREATE (c:<TableName>) SET c += row', {batchSize:10000, parallel:true})

jexp · 2019-04-07T09:16:45Z

it's odd the csv export / import should handle json correctly. I have a deeper look at the issue

What is the data type in your relational database?

Yep apoc.periodic.iterate rocks it :)

jexp · 2019-04-07T09:17:36Z

It's actually an interesting proposal. Let's discuss this.

So instead of doing the transformation + batching (optionally parallel) in Java, we would do it in apoc instead.

soneymathew · 2019-04-07T23:14:53Z

My initial exploration was to manually do what the ETL tool is facilitating currently.
but as I went thru multiple stages I learned that the APOC approach can be amazingly performant and skips the translation overhead of CSV creation and loading.

my sample dataset is a postgres schema containing a mix of the following

normal tables/columns with column values as database supported base data types
some columns are of datatype text with content as multiline JSON ( this is the data that messed with the ETL tool usage)
some columns are of datatype text with content as XML

on a side note, I noticed that the etl tool silently ignores unsupported datatypes Examples are oid and tsvector it will be cool to log it as errors or choices that can be pondered upon perhaps?

mroiter-larus · 2019-04-09T09:30:43Z

@soneymathew Can you help me to reproduce this issue? For example can you told me the sample database that you use?

Thank you

soneymathew · 2019-04-10T08:03:30Z

@mroiter-larus sorry I won't be able to share my database.
you could probably reproduce this by storing a big pretty printed json file content in a text column
Example: The mapping json generated is a good candidate for test data.

mroiter-larus · 2019-04-10T08:10:12Z

@soneymathew I had a test case but it didn't reproduce the error. Anyway i'll try your suggestion.

soneymathew · 2019-04-10T11:17:38Z

@mroiter-larus is it important that you need to reproduce it?

Users could be having bad data in database with non-printable characters or any multiline content that can break your CSV. There is a possible risk to data loss along the translations steps DB -> CSV -> LOADCSV step.

I raised this issue to explore if neo4j-etl can be updated to bypass the CSV generation step all together.

I believe it will help your users from the following perspectives.

Space savings (CSVs generated out of a large DB can be pretty large as it is uncompressed)
Avoid risk of data lost in translation by cutting off the CSV step
possibly better performance as you avoid the intermediate process and straight away import to neo4j from RDBMS
leverage/experiment with better strategies like concurrent access, batching etc to increase the overall speed in importing

jexp assigned jexp and conker84 and unassigned jexp Apr 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to avoid the LoadCSV step? #46

Possible to avoid the LoadCSV step? #46

soneymathew commented Apr 3, 2019

soneymathew commented Apr 6, 2019

jexp commented Apr 7, 2019

jexp commented Apr 7, 2019

soneymathew commented Apr 7, 2019 •

edited

Loading

mroiter-larus commented Apr 9, 2019

soneymathew commented Apr 10, 2019

mroiter-larus commented Apr 10, 2019

soneymathew commented Apr 10, 2019 •

edited

Loading

Possible to avoid the LoadCSV step? #46

Possible to avoid the LoadCSV step? #46

Comments

soneymathew commented Apr 3, 2019

soneymathew commented Apr 6, 2019

jexp commented Apr 7, 2019

jexp commented Apr 7, 2019

soneymathew commented Apr 7, 2019 • edited Loading

mroiter-larus commented Apr 9, 2019

soneymathew commented Apr 10, 2019

mroiter-larus commented Apr 10, 2019

soneymathew commented Apr 10, 2019 • edited Loading

soneymathew commented Apr 7, 2019 •

edited

Loading

soneymathew commented Apr 10, 2019 •

edited

Loading