-
Notifications
You must be signed in to change notification settings - Fork 39
Scaling
Baleen contains have two mechanisms for scaling:
- Horizontal scaling : sharing the workload between different machines and scaling by adding more machines.
- Vertical scaling : scaling by using more resources on existing machines.
Both types of scaling are delivered through a common approach, using queues to link different pipelines, potentially running on different Baleen servers. The first part of the solution provides a mechanism to serialise the work done so it can be transferred to the next part.
This uses JSON as the serialisation format, implemented with the JsonJCasConverter
class which provides the respective serialise
and deserialise
methods.
The serialised JCas can then be transported to another pipeline for further processing using a message queue. Implementations are provided for the following queues:
- In memory (for vertical scaling only)
- ActiveMQ ( http://activemq.apache.org/ )
- Kafka ( https://kafka.apache.org/ )
- RabbitMQ ( https://www.rabbitmq.com/ )
- Redis ( https://redis.io/ )
The following example shows how to use transports in pipelines. The first pipeline uses a standard Collection reader to input the data from the external source and then place the unprocessed ‘documents’ on a Redis message queue:
redis: host: localhost port: 6379 collectionreader: class: FolderReader folders: - ./files consumers: - uk.gov.dstl.baleen.transports.redis.RedisTransportSender
Then the following pipeline reads from the message queue and processes the document.
redis: host: localhost port: 6379 collectionreader: class: uk.gov.dstl.baleen.transports.redis.RedisTransportReceiver annotators: - myAnnotators consumers: - print.Entities
This pipeline can now be multiplied either on the same Baleen server for vertical scaling or on different Baleen servers with access to the same message queue for horizontal scaling.
Each pipeline will remove a document from the message queue to process, this increases the throughput of the system. The multiplicity
property in the baleen config for pipelines makes this easier to set up. For example, the following configuration file will create a pipeline for reader.yml and 3 pipelines of the processor.yml pipeline configuration. Note the processors are declared first so they will be created first and ready to receive messages from the sender but this is not required.
Pipelines: - name: receiver multiplicity: 3 file: ./processor.yml - name: sender file: ./reader.yml
Further examples of how to use the scaling transports can be found here.