Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581

szarnyasg · 2017-09-05T21:08:16Z

Fixes #489. Submitting this for review. Not ready to merge.

jexp · 2017-09-06T09:28:48Z

@szarnyasg your branch seems to be messed up
can you do a clean rebase of your changes ?
and perhaps squash your commits?

Thanks a lot

szarnyasg · 2017-09-06T09:34:32Z

Ok - fixed

szarnyasg · 2017-11-15T21:01:07Z

It seems that some Heisenbugs occur during APOC tests - I committed a single whitespace and it the build went from red to green. However, the LOAD CSV code is not ready to review.

szarnyasg · 2018-01-09T00:19:02Z

Hey @jexp, please review this if you get a chance.

jexp · 2018-01-27T22:24:25Z

Hi @szarnyasg will have another look.
Do you think we can simplify the sheer amount of files this PR adds?
Perhpas generate the csvs, e.g. from a large one that's split on a specifc line-marker
Or zip them and extract them on demand?

jexp · 2018-01-27T22:32:23Z

I took a quick look.

Some thoughts.

did we discuss back then to forego cypher altogether and create the data with the Java API possibly in parallel?
ID type can also be string depending on config option
if we use cypher, all statements should use parameters not string interpolation (except where needed for rel-types)
as we're in apoc we could also use the apoc calls for creating dynamic nodes and rels but then see 1.
add potential cache for node-lookups (see maxdemarzi)
the csv files should be generated by the tests, then they are also close together and don't have to be checked in

jexp · 2018-01-27T22:32:52Z

I'm happy to take your work and refactor it quickly next week.

szarnyasg · 2018-01-27T23:22:02Z

Hey @jexp, thanks for the feedback.

did we discuss back then to forego cypher altogether and create the data with the Java API possibly in parallel?

We discussed this, but I thought it'd be a lot easier to implement one on top of Cypher. It turns out that this solution has dismal performance. I once tested it with a data set of ~10.000 records and it took ~5 mins.

ID type can also be string depending on config option

Correct, this is missing.

if we use cypher, all statements should use parameters not string interpolation (except where needed for rel-types)

Good point, CSV injection is actually a thing.

the csv files should be generated by the tests, then they are also close together and don't have to be checked in

Agreed.

I'm happy to take your work and refactor it quickly next week.

It would be great if you could refactor the code, esp. with the performance-related aspects.

szarnyasg · 2018-06-03T23:17:36Z

I took another look at this problem.

Historically, my approach was to generate LOAD CSV commands. I got that working by the end of last year, but it turned out to be extremely slow, taking ~10 minutes for a small LDBC dataset with ~50k nodes/relationships.

So the next step was to ditch LOAD CSV, and instead, use a custom CSV loader and create the entities in the graph manually with createNode and createRelationship. While doing this, I realized that it no longer make sense to use separate commands for each node/relationship file. Instead, the schema of the procedure is now apoc.import.csv(nodes, relationships, config.

Both nodes and relationships are lists:

nodes contains {fileName: ..., labels: [...]} maps,
relationships contains {fileName: ..., type: ...} maps.

While this seems a bit cumbersome at first sight, it makes things a lot easier in the long run: during the first phase of the loading process - loading nodes - we can easily cache the mapping between CSV ids and internal node ids. We can then use these for lookups based on the start id end id of relationships. The resulting code is a lot faster, taking only a few seconds for the LDBC dataset that previously took minutes.

As the code is now using APOC's LoadCsv class and its mapping, I made a couple of its inner classes public: 4818de8#diff-f37b66be7dcb534530f538cc87e2ea15

So overall, things are starting to look better.

There are a few minor things to be do, e.g. I haven't yet addressed your earlier comment:

the csv files should be generated by the tests, then they are also close together and don't have to be checked in

Also, the whole testing needs to be more thorough, e.g. we should assert on property values of the loaded entities. (Instead of just counting nodes/relationships which only catches the most basic errors.)

szarnyasg · 2018-06-20T13:45:44Z

the csv files should be generated by the tests, then they are also close together and don't have to be checked in

CSVs are now generated by the tests on-the-fly in the csv-inputs/ directory and the content of the directory is in gitignore.

szarnyasg · 2018-06-20T14:50:44Z

I addressed one of my earlier comments:

Also, the whole testing needs to be more thorough, e.g. we should assert on property values of the loaded entities. (Instead of just counting nodes/relationships which only catches the most basic errors.)

I now added tests for each unit tests. I believe the code is good to go for reviews.

jexp · 2018-07-05T19:30:43Z

Cool, thanks so much !! Sorry for the delay.

szarnyasg · 2018-07-08T10:54:37Z

In the meantime, I updated the docs in #844.

I also found an important design decision that's worth discussing. Currently, the whole loading process happens in a single transaction. This implies that the data set size is limited to what a single transaction can handle, i.e. ~20k nodes. However, if we were to split the load process into multiple transactions, then handling failures would become very difficult: if a transaction fails, we would need to roll back previous transactions, otherwise, we have inconsistent data.

I think an acceptable workaround would be to introduce a "force transactions" flag that says "I have a lot of data to load but I know what I am doing and I understand that incorrect CSVs might corrupt my database".

(Of course, this whole problem does not exist for the offline import tool, which can simply crash if it encounters too many errors.)

This allows users to load graphs to an online database from CSVs with headers that follow the conventions of the neo4j-import tool

jexp · 2018-07-14T01:12:58Z

I think we should do the same as we do in other tools, provide config options for batches and parallel execution so the user can choose to make their batch-size large enough to fit all data.

szarnyasg · 2018-07-15T22:42:10Z

I introduced batch transactions in a separate commit. I based it on the ExportGraphML class, but I am not very experienced with this part of the APOC API - please let me know if my code needs any fixes.

For parallel execution, we'll need to parallelize the loading processes for the nodes and the relationships separately. Of course, within both phases, individual CSV files can be split into chunks and parallelized.
In any case, it doesn't seem trivial to implement this correctly and I could not find a piece of APOC code to base this on. Any suggestions? Or maybe leave this for a future release along with other improvements?

jexp · 2018-07-21T10:40:51Z

We should use the same batching as we use in periodic.iterate
Export Functions have a different kind of batches.

Let's leave it for a future PR. Thanks so much for your work, I'll merge this and your docs update now.

…ort tool's specification (#581) * Add 'apoc.import.csv' procedure This allows users to load graphs to an online database from CSVs with headers that follow the conventions of the neo4j-import tool * Introduce batch transactions to 'apoc.import.csv'

szarnyasg changed the title ~~3.2 load csv~~ Fixes #489 - Add support for loading CSVs formatted according the import tool's specification Sep 6, 2017

szarnyasg mentioned this pull request Jun 29, 2018

Initial documentation for 'apoc.import.csv' #844

Merged

3 tasks

Add 'apoc.import.csv' procedure

0c42ac2

This allows users to load graphs to an online database from CSVs with headers that follow the conventions of the neo4j-import tool

Introduce batch transactions to 'apoc.import.csv'

4c2c4b5

jexp merged commit 4690ff9 into neo4j-contrib:3.2 Jul 21, 2018

szarnyasg mentioned this pull request Oct 7, 2018

Use embedded databases for testing ldbc/ldbc_snb_interactive_v1_impls#58

Closed

szarnyasg mentioned this pull request Jan 25, 2021

apoc.import.csv documentation needs improvements #1785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581

Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581

szarnyasg commented Sep 5, 2017

jexp commented Sep 6, 2017

szarnyasg commented Sep 6, 2017

szarnyasg commented Nov 15, 2017 •

edited

Loading

szarnyasg commented Jan 9, 2018 •

edited

Loading

jexp commented Jan 27, 2018

jexp commented Jan 27, 2018

jexp commented Jan 27, 2018

szarnyasg commented Jan 27, 2018 •

edited

Loading

szarnyasg commented Jun 3, 2018 •

edited

Loading

szarnyasg commented Jun 20, 2018 •

edited

Loading

szarnyasg commented Jun 20, 2018

jexp commented Jul 5, 2018

szarnyasg commented Jul 8, 2018

jexp commented Jul 14, 2018

szarnyasg commented Jul 15, 2018

jexp commented Jul 21, 2018

Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581

Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581

Conversation

szarnyasg commented Sep 5, 2017

jexp commented Sep 6, 2017

szarnyasg commented Sep 6, 2017

szarnyasg commented Nov 15, 2017 • edited Loading

szarnyasg commented Jan 9, 2018 • edited Loading

jexp commented Jan 27, 2018

jexp commented Jan 27, 2018

jexp commented Jan 27, 2018

szarnyasg commented Jan 27, 2018 • edited Loading

szarnyasg commented Jun 3, 2018 • edited Loading

szarnyasg commented Jun 20, 2018 • edited Loading

szarnyasg commented Jun 20, 2018

jexp commented Jul 5, 2018

szarnyasg commented Jul 8, 2018

jexp commented Jul 14, 2018

szarnyasg commented Jul 15, 2018

jexp commented Jul 21, 2018

szarnyasg commented Nov 15, 2017 •

edited

Loading

szarnyasg commented Jan 9, 2018 •

edited

Loading

szarnyasg commented Jan 27, 2018 •

edited

Loading

szarnyasg commented Jun 3, 2018 •

edited

Loading

szarnyasg commented Jun 20, 2018 •

edited

Loading