Error with ETL Parallel execution #7252

danciui · 2017-03-20T03:33:28Z

###OrientDB Version: orientdb-community-2.2.16
###Java Version: Java 8 Update 91
###OS: Mac Os Sierra 10.12.3

###Expected behavior
error does not happen

###Actual behavior
Error in Pipeline execution: com.orientechnologies.orient.core.exception.ODatabaseException: Cannot execute the request because an asynchronous operation is in progress. Please use a different connection

oetl.sh actually completes, but not sure how well.

###Steps to reproduce
Archive.zip

robfrank · 2017-04-04T08:50:20Z

Hi, sorry for the delay. My suggestion is to avoid parallel execution in this case, where you have lookups and merge. And to define an index on the property on which ETL is doing lookup.
This will speedup the operations a lot.

robfrank · 2017-04-07T08:42:37Z

May I close this issue? Did you manage to solve it?

danciui · 2017-04-07T14:37:33Z

Can you add it as a bug/enhancement? This is a very powerful feature to have working correctly.

robfrank · 2017-04-08T10:06:50Z

Are you talking about parallel execution? On 3.0.x it is improved a lot, but let me show some results on my laptop

Parallel vs remote db WITH index:

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,148 vertices (629 vertices/sec) Total time: 14885ms [0 warnings, 6 errors]

Single thread vs remote db WITH index

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,154 vertices (618 vertices/sec) Total time: 14447ms [0 warnings, 0 errors]

Single thread vs remote db WITHOUT index

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,154 vertices (9 vertices/sec) Total time: 934899ms [0 warnings, 0 errors]

Parallel vs remote WOTHOUT index

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,148 vertices (13 vertices/sec) Total time: 936480ms [0 warnings, 6 errors]

Conclusions
We have a problem at the startup with parallel, to be solved.
But there's no gain on performance going parallel vs single threaded.
THERE's a LOT of gain if the index is defined.
Because without the index, each lookup implies a full scan. An the full scan is proportional to the database size: the more record are inserted, the more time the db needs to scan.
I simply added this lines to the configuration:

,
      "indexes": [
        {
          "class": "Entity",
          "fields": [
            "id:string"
          ],
          "type": "UNIQUE"
        }
      ]

eltherion · 2017-04-12T14:36:34Z

I have an etl defined like that:

{
  "config": {
    "haltOnError": false
  },
  "extractor": {
    "jdbc": {
      "driver": "org.apache.hive.jdbc.HiveDriver",
      "url": "jdbc:hive2://???:10000/????hive.resultset.use.unique.column.names=false",
      "query": "SELECT * FROM edges where relation_type = 'board member'"
    }
  },
  "transformers": [
    { "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },
    { "vertex": {"class": "Entity", "skipDuplicates": true} },
    { "edge": {
        "class": "board_member",
        "joinFieldName": "to_identifier",
        "lookup": "Entity.identifier",
        "direction": "out",
        "targetVertexFields":{
          "identifier": "${input.to_identifier}"
        }
      }
    },
    { "field": 
      { "fieldNames": 
        [ "from_identifier", "to_identifier", "relation_type", "dt"], 
        "operation": "remove"
      } 
    }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:???/???",
       "dbUser": "???",
       "dbPassword": "???",
       "dbType": "graph",
       "useLightweightEdges": false,
       "serverUser": "???",
       "serverPassword": "???",
       "wal": false
    }
  }
}

I already use existing index here:

 { "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },

Where am I supposed to add that section?

"indexes": [
        {
          "class": "Entity",
          "fields": [
            "id:string"
          ],
          "type": "UNIQUE"
        }
      ]

robfrank · 2017-04-12T15:21:26Z

this is a new configuration, I guess. You need to add the index section section in the loader part, if you are creating the schema through ETL itself:

  "loader": {
    "orientdb": {
      "dbURL": "remote:",
      "serverUser": "",
      "serverPassword": "",
      "dbAutoDropIfExists": true,
      "dbAutoCreate": true,
      "standardElementConstraints": false,
      "tx": false,
      "wal": false,
      "batchCommit": 1000,
      "dbType": "graph",
      "classes": [
        {
          "name": "Entity",
          "extends": "V"
        },
        {
          "name": "Containment",
          "extends": "E"
        },
        {
          "name": "Connection",
          "extends": "E"
        },
        {
          "name": "Inheritance",
          "extends": "E"
        }
      ],
      "indexes": [
        {
          "class": "Entity",
          "fields": [
            "id:string"
          ],
          "type": "UNIQUE"
        }
      ]
    }

take a look at the documentation and to the example:

http://orientdb.com/docs/last/Loader.html#examples

hope this helps

eltherion · 2017-04-12T19:50:17Z

@robfrank Thanks for answer. The etl configuration is for existing schema and indexes. Vertices with their indexes are already created and filled. With that etl configuration I need to connect some of them. Here is my configuration after applying your suggestions:

{
  "config": {
    "haltOnError": false
  },
  "extractor": { 
    "jdbc": {
      "driver": "org.apache.hive.jdbc.HiveDriver",
      "url": "jdbc:hive2://???:10000/????hive.resultset.use.unique.column.names=false",
      "query": "SELECT * FROM edges where relation_type = 'liquidator'"
    }
  },
  "transformers": [
    { "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },
    { "vertex": {"class": "Entity", "skipDuplicates": true} },
    { "edge": {
        "class": "liquidator",
        "joinFieldName": "to_identifier",
        "lookup": "Entity.identifier",
        "direction": "out",
        "targetVertexFields":{
          "identifier": "${input.to_identifier}"
        }
      }
    },
    { "field": 
      { "fieldNames": 
        [ "from_identifier", "to_identifier", "relation_type", "dt"], 
        "operation": "remove"
      } 
    }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "???",
       "dbUser": "???",
       "dbPassword": "???",
       "dbType": "graph",
       "useLightweightEdges": false,
       "serverUser": "???",
       "serverPassword": "???",
       "wal": false,
       "tx": false,
       "batchCommit": 1000,
       "classes": [
         {
           "name": "Entity",
           "extends": "V"
         },
         {
           "name": "liquidator",
           "extends": "E"
         }
       ],
       "indexes": [
         {
           "class": "Entity",
           "fields": [
             "identifier:string"
           ],
           "type": "UNIQUE_HASH_INDEX"
         }
       ]
    }
  }
}

Here is performance for a small portion after changes.

BEGIN ETL PROCESSOR
+ extracted 0 records (0 records/sec) - 0 records -> loaded 0 vertices (0 vertices/sec) Total time: 1000ms [0 warnings, 0 errors]
+ extracted 164 records (164 records/sec) - 164 records -> loaded 81 vertices (81 vertices/sec) Total time: 2s [0 warnings, 0 errors]
+ extracted 716 records (544 records/sec) - 716 records -> loaded 357 vertices (272 vertices/sec) Total time: 3014ms [0 warnings, 0 errors]
+ extracted 1,388 records (672 records/sec) - 1,388 records -> loaded 693 vertices (336 vertices/sec) Total time: 4014ms [0 warnings, 0 errors]
+ extracted 2,070 records (682 records/sec) - 2,070 records -> loaded 1,034 vertices (341 vertices/sec) Total time: 5014ms [0 warnings, 0 errors]
+ extracted 2,718 records (648 records/sec) - 2,718 records -> loaded 1,358 vertices (324 vertices/sec) Total time: 6014ms [0 warnings, 0 errors]
...
+ extracted 23,972 records (524 records/sec) - 23,972 records -> loaded 11,985 vertices (262 vertices/sec) Total time: 53025ms [0 warnings, 0 errors]
+ extracted 24,310 records (338 records/sec) - 24,310 records -> loaded 12,154 vertices (169 vertices/sec) Total time: 54025ms [0 warnings, 0 errors]
+ extracted 24,594 records (284 records/sec) - 24,594 records -> loaded 12,296 vertices (142 vertices/sec) Total time: 55025ms [0 warnings, 0 errors]
END ETL PROCESSOR
+ extracted 24,599 records (25 records/sec) - 24,599 records -> loaded 12,299 vertices (15 vertices/sec) Total time: 55225ms [0 warnings, 0 errors]

Unfortunately, there is no difference. I have to load millions of edges and for now it would take hours, which is not acceptable in my use case.

eltherion · 2017-04-20T10:58:27Z

@robfrank Hi, do you have any suggestions to my configuration from previous post to speed up loading edges?

robfrank · 2017-04-20T12:23:09Z

CAn you perform the load on plocal setup? If you need a "one shot" import, you can switch off the server, import in plocal, and the re-start the server.

robfrank · 2017-05-09T06:36:18Z

did you manage to load your datasets? May I close this issue?

eltherion · 2017-05-10T08:12:53Z

Hi, I have been on holidays for several weeks, so I can't try loading in plocal mode. The soonest I can check it is 15th of May, so please, wait at least one week from now. Thank you.

robfrank · 2017-06-06T17:26:55Z

hi, did you check? may I close? in the meanwhile we released 2.2.21

eltherion · 2017-06-06T18:15:21Z

Hi, thank you for your patience. Unfortunately, switching off the server and importing in plocal mode cannot be performed in my use case, I have to have it alive all the time. I think we have to close that issue and wait for 3.0 to be released. Has 2.2.21 improved parallel execution because it's not mentioned in release notes (or maybe I'm blind)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with ETL Parallel execution #7252

Error with ETL Parallel execution #7252

danciui commented Mar 20, 2017

robfrank commented Apr 4, 2017

robfrank commented Apr 7, 2017

danciui commented Apr 7, 2017

robfrank commented Apr 8, 2017

eltherion commented Apr 12, 2017

robfrank commented Apr 12, 2017

eltherion commented Apr 12, 2017

eltherion commented Apr 20, 2017

robfrank commented Apr 20, 2017

robfrank commented May 9, 2017

eltherion commented May 10, 2017

robfrank commented Jun 6, 2017

eltherion commented Jun 6, 2017

Error with ETL Parallel execution #7252

Error with ETL Parallel execution #7252

Comments

danciui commented Mar 20, 2017

robfrank commented Apr 4, 2017

robfrank commented Apr 7, 2017

danciui commented Apr 7, 2017

robfrank commented Apr 8, 2017

eltherion commented Apr 12, 2017

robfrank commented Apr 12, 2017

eltherion commented Apr 12, 2017

eltherion commented Apr 20, 2017

robfrank commented Apr 20, 2017

robfrank commented May 9, 2017

eltherion commented May 10, 2017

robfrank commented Jun 6, 2017

eltherion commented Jun 6, 2017