Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with ETL Parallel execution #7252

Closed
danciui opened this issue Mar 20, 2017 · 13 comments
Closed

Error with ETL Parallel execution #7252

danciui opened this issue Mar 20, 2017 · 13 comments

Comments

@danciui
Copy link

danciui commented Mar 20, 2017

###OrientDB Version: orientdb-community-2.2.16
###Java Version: Java 8 Update 91
###OS: Mac Os Sierra 10.12.3

###Expected behavior
error does not happen

###Actual behavior
Error in Pipeline execution: com.orientechnologies.orient.core.exception.ODatabaseException: Cannot execute the request because an asynchronous operation is in progress. Please use a different connection

oetl.sh actually completes, but not sure how well.

###Steps to reproduce
Archive.zip

@robfrank robfrank self-assigned this Apr 4, 2017
@robfrank
Copy link
Contributor

robfrank commented Apr 4, 2017

Hi, sorry for the delay. My suggestion is to avoid parallel execution in this case, where you have lookups and merge. And to define an index on the property on which ETL is doing lookup.
This will speedup the operations a lot.

@robfrank
Copy link
Contributor

robfrank commented Apr 7, 2017

May I close this issue? Did you manage to solve it?

@danciui
Copy link
Author

danciui commented Apr 7, 2017

Can you add it as a bug/enhancement? This is a very powerful feature to have working correctly.

@robfrank
Copy link
Contributor

robfrank commented Apr 8, 2017

Are you talking about parallel execution? On 3.0.x it is improved a lot, but let me show some results on my laptop

Parallel vs remote db WITH index:

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,148 vertices (629 vertices/sec) Total time: 14885ms [0 warnings, 6 errors]

Single thread vs remote db WITH index

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,154 vertices (618 vertices/sec) Total time: 14447ms [0 warnings, 0 errors]

Single thread vs remote db WITHOUT index

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,154 vertices (9 vertices/sec) Total time: 934899ms [0 warnings, 0 errors]

Parallel vs remote WOTHOUT index

+ extracted 8,154 entries (0 entries/sec) - 8,154 entries -> loaded 8,148 vertices (13 vertices/sec) Total time: 936480ms [0 warnings, 6 errors]

Conclusions
We have a problem at the startup with parallel, to be solved.
But there's no gain on performance going parallel vs single threaded.
THERE's a LOT of gain if the index is defined.
Because without the index, each lookup implies a full scan. An the full scan is proportional to the database size: the more record are inserted, the more time the db needs to scan.
I simply added this lines to the configuration:

,
      "indexes": [
        {
          "class": "Entity",
          "fields": [
            "id:string"
          ],
          "type": "UNIQUE"
        }
      ]

@eltherion
Copy link

I have an etl defined like that:

{
  "config": {
    "haltOnError": false
  },
  "extractor": {
    "jdbc": {
      "driver": "org.apache.hive.jdbc.HiveDriver",
      "url": "jdbc:hive2://???:10000/????hive.resultset.use.unique.column.names=false",
      "query": "SELECT * FROM edges where relation_type = 'board member'"
    }
  },
  "transformers": [
    { "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },
    { "vertex": {"class": "Entity", "skipDuplicates": true} },
    { "edge": {
        "class": "board_member",
        "joinFieldName": "to_identifier",
        "lookup": "Entity.identifier",
        "direction": "out",
        "targetVertexFields":{
          "identifier": "${input.to_identifier}"
        }
      }
    },
    { "field": 
      { "fieldNames": 
        [ "from_identifier", "to_identifier", "relation_type", "dt"], 
        "operation": "remove"
      } 
    }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:???/???",
       "dbUser": "???",
       "dbPassword": "???",
       "dbType": "graph",
       "useLightweightEdges": false,
       "serverUser": "???",
       "serverPassword": "???",
       "wal": false
    }
  }
}

I already use existing index here:

 { "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },

Where am I supposed to add that section?

"indexes": [
        {
          "class": "Entity",
          "fields": [
            "id:string"
          ],
          "type": "UNIQUE"
        }
      ]

@robfrank
Copy link
Contributor

this is a new configuration, I guess. You need to add the index section section in the loader part, if you are creating the schema through ETL itself:

  "loader": {
    "orientdb": {
      "dbURL": "remote:",
      "serverUser": "",
      "serverPassword": "",
      "dbAutoDropIfExists": true,
      "dbAutoCreate": true,
      "standardElementConstraints": false,
      "tx": false,
      "wal": false,
      "batchCommit": 1000,
      "dbType": "graph",
      "classes": [
        {
          "name": "Entity",
          "extends": "V"
        },
        {
          "name": "Containment",
          "extends": "E"
        },
        {
          "name": "Connection",
          "extends": "E"
        },
        {
          "name": "Inheritance",
          "extends": "E"
        }
      ],
      "indexes": [
        {
          "class": "Entity",
          "fields": [
            "id:string"
          ],
          "type": "UNIQUE"
        }
      ]
    }

take a look at the documentation and to the example:

http://orientdb.com/docs/last/Loader.html#examples

hope this helps

@eltherion
Copy link

@robfrank Thanks for answer. The etl configuration is for existing schema and indexes. Vertices with their indexes are already created and filled. With that etl configuration I need to connect some of them. Here is my configuration after applying your suggestions:

{
  "config": {
    "haltOnError": false
  },
  "extractor": { 
    "jdbc": {
      "driver": "org.apache.hive.jdbc.HiveDriver",
      "url": "jdbc:hive2://???:10000/????hive.resultset.use.unique.column.names=false",
      "query": "SELECT * FROM edges where relation_type = 'liquidator'"
    }
  },
  "transformers": [
    { "merge": { "joinFieldName":"from_identifier", "lookup":"Entity.identifier" } },
    { "vertex": {"class": "Entity", "skipDuplicates": true} },
    { "edge": {
        "class": "liquidator",
        "joinFieldName": "to_identifier",
        "lookup": "Entity.identifier",
        "direction": "out",
        "targetVertexFields":{
          "identifier": "${input.to_identifier}"
        }
      }
    },
    { "field": 
      { "fieldNames": 
        [ "from_identifier", "to_identifier", "relation_type", "dt"], 
        "operation": "remove"
      } 
    }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "???",
       "dbUser": "???",
       "dbPassword": "???",
       "dbType": "graph",
       "useLightweightEdges": false,
       "serverUser": "???",
       "serverPassword": "???",
       "wal": false,
       "tx": false,
       "batchCommit": 1000,
       "classes": [
         {
           "name": "Entity",
           "extends": "V"
         },
         {
           "name": "liquidator",
           "extends": "E"
         }
       ],
       "indexes": [
         {
           "class": "Entity",
           "fields": [
             "identifier:string"
           ],
           "type": "UNIQUE_HASH_INDEX"
         }
       ]
    }
  }
}

Here is performance for a small portion after changes.

BEGIN ETL PROCESSOR
+ extracted 0 records (0 records/sec) - 0 records -> loaded 0 vertices (0 vertices/sec) Total time: 1000ms [0 warnings, 0 errors]
+ extracted 164 records (164 records/sec) - 164 records -> loaded 81 vertices (81 vertices/sec) Total time: 2s [0 warnings, 0 errors]
+ extracted 716 records (544 records/sec) - 716 records -> loaded 357 vertices (272 vertices/sec) Total time: 3014ms [0 warnings, 0 errors]
+ extracted 1,388 records (672 records/sec) - 1,388 records -> loaded 693 vertices (336 vertices/sec) Total time: 4014ms [0 warnings, 0 errors]
+ extracted 2,070 records (682 records/sec) - 2,070 records -> loaded 1,034 vertices (341 vertices/sec) Total time: 5014ms [0 warnings, 0 errors]
+ extracted 2,718 records (648 records/sec) - 2,718 records -> loaded 1,358 vertices (324 vertices/sec) Total time: 6014ms [0 warnings, 0 errors]
...
+ extracted 23,972 records (524 records/sec) - 23,972 records -> loaded 11,985 vertices (262 vertices/sec) Total time: 53025ms [0 warnings, 0 errors]
+ extracted 24,310 records (338 records/sec) - 24,310 records -> loaded 12,154 vertices (169 vertices/sec) Total time: 54025ms [0 warnings, 0 errors]
+ extracted 24,594 records (284 records/sec) - 24,594 records -> loaded 12,296 vertices (142 vertices/sec) Total time: 55025ms [0 warnings, 0 errors]
END ETL PROCESSOR
+ extracted 24,599 records (25 records/sec) - 24,599 records -> loaded 12,299 vertices (15 vertices/sec) Total time: 55225ms [0 warnings, 0 errors]

Unfortunately, there is no difference. I have to load millions of edges and for now it would take hours, which is not acceptable in my use case.

@eltherion
Copy link

@robfrank Hi, do you have any suggestions to my configuration from previous post to speed up loading edges?

@robfrank
Copy link
Contributor

CAn you perform the load on plocal setup? If you need a "one shot" import, you can switch off the server, import in plocal, and the re-start the server.

@robfrank
Copy link
Contributor

robfrank commented May 9, 2017

did you manage to load your datasets? May I close this issue?

@eltherion
Copy link

Hi, I have been on holidays for several weeks, so I can't try loading in plocal mode. The soonest I can check it is 15th of May, so please, wait at least one week from now. Thank you.

@robfrank
Copy link
Contributor

robfrank commented Jun 6, 2017

hi, did you check? may I close? in the meanwhile we released 2.2.21

@eltherion
Copy link

Hi, thank you for your patience. Unfortunately, switching off the server and importing in plocal mode cannot be performed in my use case, I have to have it alive all the time. I think we have to close that issue and wait for 3.0 to be released. Has 2.2.21 improved parallel execution because it's not mentioned in release notes (or maybe I'm blind)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants