-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with ETL Parallel execution #7252
Comments
Hi, sorry for the delay. My suggestion is to avoid parallel execution in this case, where you have lookups and merge. And to define an index on the property on which ETL is doing lookup. |
May I close this issue? Did you manage to solve it? |
Can you add it as a bug/enhancement? This is a very powerful feature to have working correctly. |
Are you talking about parallel execution? On 3.0.x it is improved a lot, but let me show some results on my laptop Parallel vs remote db WITH index: + extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,148 vertices (629 vertices/sec) Total time: 14885ms [0 warnings, 6 errors] Single thread vs remote db WITH index + extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,154 vertices (618 vertices/sec) Total time: 14447ms [0 warnings, 0 errors] Single thread vs remote db WITHOUT index + extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,154 vertices (9 vertices/sec) Total time: 934899ms [0 warnings, 0 errors] Parallel vs remote WOTHOUT index + extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,148 vertices (13 vertices/sec) Total time: 936480ms [0 warnings, 6 errors] Conclusions ,
"indexes": [
{
"class": "Entity",
"fields": [
"id:string"
],
"type": "UNIQUE"
}
] |
I have an etl defined like that: {
"config": {
"haltOnError": false
},
"extractor": {
"jdbc": {
"driver": "org.apache.hive.jdbc.HiveDriver",
"url": "jdbc:hive2://???:10000/????hive.resultset.use.unique.column.names=false",
"query": "SELECT * FROM edges where relation_type = 'board member'"
}
},
"transformers": [
{ "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },
{ "vertex": {"class": "Entity", "skipDuplicates": true} },
{ "edge": {
"class": "board_member",
"joinFieldName": "to_identifier",
"lookup": "Entity.identifier",
"direction": "out",
"targetVertexFields":{
"identifier": "${input.to_identifier}"
}
}
},
{ "field":
{ "fieldNames":
[ "from_identifier", "to_identifier", "relation_type", "dt"],
"operation": "remove"
}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:???/???",
"dbUser": "???",
"dbPassword": "???",
"dbType": "graph",
"useLightweightEdges": false,
"serverUser": "???",
"serverPassword": "???",
"wal": false
}
}
} I already use existing index here: { "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } }, Where am I supposed to add that section? "indexes": [
{
"class": "Entity",
"fields": [
"id:string"
],
"type": "UNIQUE"
}
] |
this is a new configuration, I guess. You need to add the index section section in the loader part, if you are creating the schema through ETL itself: "loader": {
"orientdb": {
"dbURL": "remote:",
"serverUser": "",
"serverPassword": "",
"dbAutoDropIfExists": true,
"dbAutoCreate": true,
"standardElementConstraints": false,
"tx": false,
"wal": false,
"batchCommit": 1000,
"dbType": "graph",
"classes": [
{
"name": "Entity",
"extends": "V"
},
{
"name": "Containment",
"extends": "E"
},
{
"name": "Connection",
"extends": "E"
},
{
"name": "Inheritance",
"extends": "E"
}
],
"indexes": [
{
"class": "Entity",
"fields": [
"id:string"
],
"type": "UNIQUE"
}
]
} take a look at the documentation and to the example: http://orientdb.com/docs/last/Loader.html#examples hope this helps |
@robfrank Thanks for answer. The etl configuration is for existing schema and indexes. Vertices with their indexes are already created and filled. With that etl configuration I need to connect some of them. Here is my configuration after applying your suggestions: {
"config": {
"haltOnError": false
},
"extractor": {
"jdbc": {
"driver": "org.apache.hive.jdbc.HiveDriver",
"url": "jdbc:hive2://???:10000/????hive.resultset.use.unique.column.names=false",
"query": "SELECT * FROM edges where relation_type = 'liquidator'"
}
},
"transformers": [
{ "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },
{ "vertex": {"class": "Entity", "skipDuplicates": true} },
{ "edge": {
"class": "liquidator",
"joinFieldName": "to_identifier",
"lookup": "Entity.identifier",
"direction": "out",
"targetVertexFields":{
"identifier": "${input.to_identifier}"
}
}
},
{ "field":
{ "fieldNames":
[ "from_identifier", "to_identifier", "relation_type", "dt"],
"operation": "remove"
}
}
],
"loader": {
"orientdb": {
"dbURL": "???",
"dbUser": "???",
"dbPassword": "???",
"dbType": "graph",
"useLightweightEdges": false,
"serverUser": "???",
"serverPassword": "???",
"wal": false,
"tx": false,
"batchCommit": 1000,
"classes": [
{
"name": "Entity",
"extends": "V"
},
{
"name": "liquidator",
"extends": "E"
}
],
"indexes": [
{
"class": "Entity",
"fields": [
"identifier:string"
],
"type": "UNIQUE_HASH_INDEX"
}
]
}
}
}
Here is performance for a small portion after changes. BEGIN ETL PROCESSOR
+ extracted 0 records (0 records/sec) - 0 records -> loaded 0 vertices (0 vertices/sec) Total time: 1000ms [0 warnings, 0 errors]
+ extracted 164 records (164 records/sec) - 164 records -> loaded 81 vertices (81 vertices/sec) Total time: 2s [0 warnings, 0 errors]
+ extracted 716 records (544 records/sec) - 716 records -> loaded 357 vertices (272 vertices/sec) Total time: 3014ms [0 warnings, 0 errors]
+ extracted 1,388 records (672 records/sec) - 1,388 records -> loaded 693 vertices (336 vertices/sec) Total time: 4014ms [0 warnings, 0 errors]
+ extracted 2,070 records (682 records/sec) - 2,070 records -> loaded 1,034 vertices (341 vertices/sec) Total time: 5014ms [0 warnings, 0 errors]
+ extracted 2,718 records (648 records/sec) - 2,718 records -> loaded 1,358 vertices (324 vertices/sec) Total time: 6014ms [0 warnings, 0 errors]
...
+ extracted 23,972 records (524 records/sec) - 23,972 records -> loaded 11,985 vertices (262 vertices/sec) Total time: 53025ms [0 warnings, 0 errors]
+ extracted 24,310 records (338 records/sec) - 24,310 records -> loaded 12,154 vertices (169 vertices/sec) Total time: 54025ms [0 warnings, 0 errors]
+ extracted 24,594 records (284 records/sec) - 24,594 records -> loaded 12,296 vertices (142 vertices/sec) Total time: 55025ms [0 warnings, 0 errors]
END ETL PROCESSOR
+ extracted 24,599 records (25 records/sec) - 24,599 records -> loaded 12,299 vertices (15 vertices/sec) Total time: 55225ms [0 warnings, 0 errors]
Unfortunately, there is no difference. I have to load millions of edges and for now it would take hours, which is not acceptable in my use case. |
@robfrank Hi, do you have any suggestions to my configuration from previous post to speed up loading edges? |
CAn you perform the load on plocal setup? If you need a "one shot" import, you can switch off the server, import in plocal, and the re-start the server. |
did you manage to load your datasets? May I close this issue? |
Hi, I have been on holidays for several weeks, so I can't try loading in plocal mode. The soonest I can check it is 15th of May, so please, wait at least one week from now. Thank you. |
hi, did you check? may I close? in the meanwhile we released 2.2.21 |
Hi, thank you for your patience. Unfortunately, switching off the server and importing in plocal mode cannot be performed in my use case, I have to have it alive all the time. I think we have to close that issue and wait for 3.0 to be released. Has 2.2.21 improved parallel execution because it's not mentioned in release notes (or maybe I'm blind)? |
###OrientDB Version: orientdb-community-2.2.16
###Java Version: Java 8 Update 91
###OS: Mac Os Sierra 10.12.3
###Expected behavior
error does not happen
###Actual behavior
Error in Pipeline execution: com.orientechnologies.orient.core.exception.ODatabaseException: Cannot execute the request because an asynchronous operation is in progress. Please use a different connection
oetl.sh actually completes, but not sure how well.
###Steps to reproduce
Archive.zip
The text was updated successfully, but these errors were encountered: