Skip to content
This repository has been archived by the owner on Jun 4, 2019. It is now read-only.
Christian Gendreau edited this page Mar 20, 2014 · 11 revisions

Why replace all records of a resource?

When updating a resource, the Harvester will dump and replace all records. The main reason for this decision is that our community does not produce globally unique identifiers for records. Relying on identifiers local to a resource is risky; there may be duplicates or identifiers may have changed. Updating rather than dumping and replacing would eventually lead to erroneous data. Another reason for dumping and replacing is because it is more computationally expensive to discover which data have changed from one harvest to the next.

Why use a 'buffer' schema?

When updating a resource, the Harvester will first insert all new records into a 'buffer' database schema. Once all records are successfully inserted, the Harvester will transfer them into the 'public' schema. While it is true that transactions could be used, the distributed nature of the Harvester adds complexity to the management of 'rollback' and 'commit'. A node could process data from different resources, which could be offline at any moment. We felt that adding a 'buffer' schema to the workflow keeps things clean and simple.

Why use one transaction per message (on nodes)?

The main reason is explained in the previous section "Why use a 'buffer' schema". Note that one message can contains more than one record of data. We could also add that keeping a transaction open for thousands of row can consume a fair amount of resource.

What if I have referential integrity constraints?

There is not magic here, you can't have referential integrity constraints and distributed processing happening at the same time. The reason is simple: you could insert the referrer before the referee. There is 2 possible solutions to this problem. You could process all the referee records and then, process all the referrer records. This would only work when the referee and the referrer are not in the same table. The preferred solution is to not have any referential integrity constraints on the buffer schema. Of course an additional task is needed to ensure referential integrity prior move to 'public' schema but this solution will maximize data insertion speed (e.g. faster to insert without constraints).

Why not Hadoop?

Even if data processing is the main component of the Harvester, it is not the only one. We need a lightweight and flexible tool that processes fewer than a billion data records and executes various tasks. That said, Hadoop could accommodate these needs but for now, we will stick with the KISS principle. If you think this could easily be done, please let us know. If your needs exceed 1 billion records, you should investigate the work GBIF has done using Hadoop.