-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581
Conversation
@szarnyasg your branch seems to be messed up Thanks a lot |
Ok - fixed |
It seems that some Heisenbugs occur during APOC tests - I committed a single whitespace and it the build went from red to green. However, the LOAD CSV code is not ready to review. |
Hey @jexp, please review this if you get a chance. |
Hi @szarnyasg will have another look. |
I took a quick look. Some thoughts.
|
I'm happy to take your work and refactor it quickly next week. |
Hey @jexp, thanks for the feedback.
We discussed this, but I thought it'd be a lot easier to implement one on top of Cypher. It turns out that this solution has dismal performance. I once tested it with a data set of ~10.000 records and it took ~5 mins.
Correct, this is missing.
Good point, CSV injection is actually a thing.
Agreed.
It would be great if you could refactor the code, esp. with the performance-related aspects. |
I took another look at this problem. Historically, my approach was to generate So the next step was to ditch Both
While this seems a bit cumbersome at first sight, it makes things a lot easier in the long run: during the first phase of the loading process - loading nodes - we can easily cache the mapping between CSV ids and internal node ids. We can then use these for lookups based on the start id end id of relationships. The resulting code is a lot faster, taking only a few seconds for the LDBC dataset that previously took minutes. As the code is now using APOC's So overall, things are starting to look better. There are a few minor things to be do, e.g. I haven't yet addressed your earlier comment:
Also, the whole testing needs to be more thorough, e.g. we should assert on property values of the loaded entities. (Instead of just counting nodes/relationships which only catches the most basic errors.) |
CSVs are now generated by the tests on-the-fly in the |
I addressed one of my earlier comments:
I now added tests for each unit tests. I believe the code is good to go for reviews. |
Cool, thanks so much !! Sorry for the delay. |
In the meantime, I updated the docs in #844. I also found an important design decision that's worth discussing. Currently, the whole loading process happens in a single transaction. This implies that the data set size is limited to what a single transaction can handle, i.e. ~20k nodes. However, if we were to split the load process into multiple transactions, then handling failures would become very difficult: if a transaction fails, we would need to roll back previous transactions, otherwise, we have inconsistent data. I think an acceptable workaround would be to introduce a "force transactions" flag that says "I have a lot of data to load but I know what I am doing and I understand that incorrect CSVs might corrupt my database". (Of course, this whole problem does not exist for the offline import tool, which can simply crash if it encounters too many errors.) |
This allows users to load graphs to an online database from CSVs with headers that follow the conventions of the neo4j-import tool
I think we should do the same as we do in other tools, provide config options for batches and parallel execution so the user can choose to make their batch-size large enough to fit all data. |
I introduced batch transactions in a separate commit. I based it on the For parallel execution, we'll need to parallelize the loading processes for the nodes and the relationships separately. Of course, within both phases, individual CSV files can be split into chunks and parallelized. |
We should use the same batching as we use in periodic.iterate Let's leave it for a future PR. Thanks so much for your work, I'll merge this and your docs update now. |
…ort tool's specification (#581) * Add 'apoc.import.csv' procedure This allows users to load graphs to an online database from CSVs with headers that follow the conventions of the neo4j-import tool * Introduce batch transactions to 'apoc.import.csv'
…ort tool's specification (#581) * Add 'apoc.import.csv' procedure This allows users to load graphs to an online database from CSVs with headers that follow the conventions of the neo4j-import tool * Introduce batch transactions to 'apoc.import.csv'
Fixes #489. Submitting this for review. Not ready to merge.