Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581

Merged
merged 2 commits into from
Jul 21, 2018

Conversation

szarnyasg
Copy link
Contributor

Fixes #489. Submitting this for review. Not ready to merge.

@jexp
Copy link
Member

jexp commented Sep 6, 2017

@szarnyasg your branch seems to be messed up
can you do a clean rebase of your changes ?
and perhaps squash your commits?

Thanks a lot

@szarnyasg
Copy link
Contributor Author

Ok - fixed

@szarnyasg szarnyasg changed the title 3.2 load csv Fixes #489 - Add support for loading CSVs formatted according the import tool's specification Sep 6, 2017
@szarnyasg
Copy link
Contributor Author

szarnyasg commented Nov 15, 2017

It seems that some Heisenbugs occur during APOC tests - I committed a single whitespace and it the build went from red to green. However, the LOAD CSV code is not ready to review.

@szarnyasg
Copy link
Contributor Author

szarnyasg commented Jan 9, 2018

Hey @jexp, please review this if you get a chance.

@jexp
Copy link
Member

jexp commented Jan 27, 2018

Hi @szarnyasg will have another look.
Do you think we can simplify the sheer amount of files this PR adds?
Perhpas generate the csvs, e.g. from a large one that's split on a specifc line-marker
Or zip them and extract them on demand?

@jexp
Copy link
Member

jexp commented Jan 27, 2018

I took a quick look.

Some thoughts.

  1. did we discuss back then to forego cypher altogether and create the data with the Java API possibly in parallel?
  2. ID type can also be string depending on config option
  3. if we use cypher, all statements should use parameters not string interpolation (except where needed for rel-types)
  4. as we're in apoc we could also use the apoc calls for creating dynamic nodes and rels but then see 1.
  5. add potential cache for node-lookups (see maxdemarzi)
  6. the csv files should be generated by the tests, then they are also close together and don't have to be checked in

@jexp
Copy link
Member

jexp commented Jan 27, 2018

I'm happy to take your work and refactor it quickly next week.

@szarnyasg
Copy link
Contributor Author

szarnyasg commented Jan 27, 2018

Hey @jexp, thanks for the feedback.

  1. did we discuss back then to forego cypher altogether and create the data with the Java API possibly in parallel?

We discussed this, but I thought it'd be a lot easier to implement one on top of Cypher. It turns out that this solution has dismal performance. I once tested it with a data set of ~10.000 records and it took ~5 mins.

  1. ID type can also be string depending on config option

Correct, this is missing.

  1. if we use cypher, all statements should use parameters not string interpolation (except where needed for rel-types)

Good point, CSV injection is actually a thing.

  1. the csv files should be generated by the tests, then they are also close together and don't have to be checked in

Agreed.

I'm happy to take your work and refactor it quickly next week.

It would be great if you could refactor the code, esp. with the performance-related aspects.

@szarnyasg
Copy link
Contributor Author

szarnyasg commented Jun 3, 2018

I took another look at this problem.

Historically, my approach was to generate LOAD CSV commands. I got that working by the end of last year, but it turned out to be extremely slow, taking ~10 minutes for a small LDBC dataset with ~50k nodes/relationships.

So the next step was to ditch LOAD CSV, and instead, use a custom CSV loader and create the entities in the graph manually with createNode and createRelationship. While doing this, I realized that it no longer make sense to use separate commands for each node/relationship file. Instead, the schema of the procedure is now apoc.import.csv(nodes, relationships, config.

Both nodes and relationships are lists:

  • nodes contains {fileName: ..., labels: [...]} maps,
  • relationships contains {fileName: ..., type: ...} maps.

While this seems a bit cumbersome at first sight, it makes things a lot easier in the long run: during the first phase of the loading process - loading nodes - we can easily cache the mapping between CSV ids and internal node ids. We can then use these for lookups based on the start id end id of relationships. The resulting code is a lot faster, taking only a few seconds for the LDBC dataset that previously took minutes.

As the code is now using APOC's LoadCsv class and its mapping, I made a couple of its inner classes public: 4818de8#diff-f37b66be7dcb534530f538cc87e2ea15

So overall, things are starting to look better.

There are a few minor things to be do, e.g. I haven't yet addressed your earlier comment:

  1. the csv files should be generated by the tests, then they are also close together and don't have to be checked in

Also, the whole testing needs to be more thorough, e.g. we should assert on property values of the loaded entities. (Instead of just counting nodes/relationships which only catches the most basic errors.)

@szarnyasg
Copy link
Contributor Author

szarnyasg commented Jun 20, 2018

  1. the csv files should be generated by the tests, then they are also close together and don't have to be checked in

CSVs are now generated by the tests on-the-fly in the csv-inputs/ directory and the content of the directory is in gitignore.

@szarnyasg
Copy link
Contributor Author

I addressed one of my earlier comments:

Also, the whole testing needs to be more thorough, e.g. we should assert on property values of the loaded entities. (Instead of just counting nodes/relationships which only catches the most basic errors.)

I now added tests for each unit tests. I believe the code is good to go for reviews.

@jexp
Copy link
Member

jexp commented Jul 5, 2018

Cool, thanks so much !! Sorry for the delay.

@szarnyasg
Copy link
Contributor Author

In the meantime, I updated the docs in #844.

I also found an important design decision that's worth discussing. Currently, the whole loading process happens in a single transaction. This implies that the data set size is limited to what a single transaction can handle, i.e. ~20k nodes. However, if we were to split the load process into multiple transactions, then handling failures would become very difficult: if a transaction fails, we would need to roll back previous transactions, otherwise, we have inconsistent data.

I think an acceptable workaround would be to introduce a "force transactions" flag that says "I have a lot of data to load but I know what I am doing and I understand that incorrect CSVs might corrupt my database".

(Of course, this whole problem does not exist for the offline import tool, which can simply crash if it encounters too many errors.)

This allows users to load graphs to an online database from CSVs
with headers that follow the conventions of the neo4j-import tool
@jexp
Copy link
Member

jexp commented Jul 14, 2018

I think we should do the same as we do in other tools, provide config options for batches and parallel execution so the user can choose to make their batch-size large enough to fit all data.

@szarnyasg
Copy link
Contributor Author

I introduced batch transactions in a separate commit. I based it on the ExportGraphML class, but I am not very experienced with this part of the APOC API - please let me know if my code needs any fixes.

For parallel execution, we'll need to parallelize the loading processes for the nodes and the relationships separately. Of course, within both phases, individual CSV files can be split into chunks and parallelized.
In any case, it doesn't seem trivial to implement this correctly and I could not find a piece of APOC code to base this on. Any suggestions? Or maybe leave this for a future release along with other improvements?

@jexp
Copy link
Member

jexp commented Jul 21, 2018

We should use the same batching as we use in periodic.iterate
Export Functions have a different kind of batches.

Let's leave it for a future PR. Thanks so much for your work, I'll merge this and your docs update now.

@jexp jexp merged commit 4690ff9 into neo4j-contrib:3.2 Jul 21, 2018
jexp pushed a commit that referenced this pull request Jul 23, 2018
…ort tool's specification (#581)

* Add 'apoc.import.csv' procedure

This allows users to load graphs to an online database from CSVs
with headers that follow the conventions of the neo4j-import tool

* Introduce batch transactions to 'apoc.import.csv'
jexp pushed a commit that referenced this pull request Jul 23, 2018
…ort tool's specification (#581)

* Add 'apoc.import.csv' procedure

This allows users to load graphs to an online database from CSVs
with headers that follow the conventions of the neo4j-import tool

* Introduce batch transactions to 'apoc.import.csv'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants