Skip to content
This repository has been archived by the owner on Feb 21, 2025. It is now read-only.

Scaling

JohnDaws edited this page Aug 13, 2018 · 2 revisions

Baleen contains have two mechanisms for scaling:

  • Horizontal scaling : sharing the workload between different machines and scaling by adding more machines.
  • Vertical scaling : scaling by using more resources on existing machines.

Both types of scaling are delivered through a common approach, using queues to link different pipelines, potentially running on different Baleen servers. The first part of the solution provides a mechanism to serialise the work done so it can be transferred to the next part.

This uses JSON as the serialisation format, implemented with the JsonJCasConverter class which provides the respective serialise and deserialise methods.

The serialised JCas can then be transported to another pipeline for further processing using a message queue. Implementations are provided for the following queues:

The following example shows how to use transports in pipelines. The first pipeline uses a standard Collection reader to input the data from the external source and then place the unprocessed ‘documents’ on a Redis message queue:

reader.yml

redis:
  host: localhost
  port: 6379

collectionreader:
  class: FolderReader
  folders:
  - ./files

consumers:
  - uk.gov.dstl.baleen.transports.redis.RedisTransportSender

Then the following pipeline reads from the message queue and processes the document.

processor.yml

redis:
  host: localhost
  port: 6379

collectionreader:
  class: uk.gov.dstl.baleen.transports.redis.RedisTransportReceiver

annotators:
  - myAnnotators

consumers:
  - print.Entities

This pipeline can now be multiplied either on the same Baleen server for vertical scaling or on different Baleen servers with access to the same message queue for horizontal scaling.

Each pipeline will remove a document from the message queue to process, this increases the throughput of the system. The multiplicity property in the baleen config for pipelines makes this easier to set up. For example, the following configuration file will create a pipeline for reader.yml and 3 pipelines of the processor.yml pipeline configuration. Note the processors are declared first so they will be created first and ready to receive messages from the sender but this is not required.

config.yml

Pipelines:
- name: receiver
  multiplicity: 3
  file: ./processor.yml
- name: sender
  file: ./reader.yml

Further examples of how to use the scaling transports can be found here.