Simple Pipeline API

Overview

From the functional perspective, all data processing applications are pretty similar and can be introduced as a pipeline of sequential operations(stages) performed on input data/stream. Generally speaking, we need to read data, apply some transformation and save results to the storage. Data sources, formats, transformations can be different. However, we can identify sequential operations applied to the inbound dataset.

The library provides major components to construct, configure and run simple data processing pipelines.

Getting Started

Run next command to build and install simple-pipeline library to the local Maven repository

mvn -f simple-pipeline-api/pom.xml clean install

After this you can include this library to the project as a regular maven dependency

<dependency>
    <groupId>os.toolset</groupId>
    <artifactId>simple-pipeline-api</artifactId>
    <version>1.0.0</version>
</dependency>

Creating a custom data processing pipeline

As mentioned before, any pipeline is nothing but a sequence of the stages executed in the predefined order.

Stages in a pipeline represent the logical steps needed to process data. Each stage represents a single high-level processing concept such as reading files, loading data from external storage, computing a product from the data, writing data to the database, etc.

Instead of predefined stages embedded into the pipeline framework plug-in architecture can be used. The main idea behind this architecture is to allow adding additional features(stages) as plugins to the core application (pipeline execution framework).

java.util.ServiceLoader is used to discover and load new stages in the runtime. Pipeline configuration defines which stages should be included to the pipeline and sets the execution order.

Simple Stage Code

/**
 * java.util.ServiceLoader requires all implementations to be registered as a service using META-INF metadata.
 * However, it is easy for to forget to update or correctly specify the service descriptors.
 * AutoService annotation processor generates this metadata for the developer, for any class annotated with @AutoService
*/
@AutoService(Stage.class)
public class GreetingStage implements Stage<Context<String>> {
    @Override
    public String name() {
        return "greet";
    }

    @Override
    public void run(Context<String> context) throws ExecutionError {
        // Loading stage configs
        StageConfig stageConfig = stageConfig(context);
        String message = stageConfig.getString("message").required();
        //Execution context can be used to share data between stages
        String username = context.getSnapshot("username");

        System.out.println(format(message, username));

    }
}

Configuration

The following configuration file can be used to configure a pipeline that reads CSV files and saves records in JSON format.

pipeline.name: "CSV to JSON Converter"
steps:[
  {
    name: read-csv
    sep: ","
    headers: true
    path: "file.csv"
    output.id: csv
  }
  {
    name: save-as-json
    input.id: csv
    output.file: "file.json"
  }
]

License

MIT License
Copyright (c) 2019 Oleksandr Sluchyk

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dataflow-pipeline		dataflow-pipeline
doc/images		doc/images
simple-pipeline-api		simple-pipeline-api
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Pipeline API

Overview

Getting Started

Creating a custom data processing pipeline

Simple Stage Code

Configuration

License

About

Releases 1

Packages

Languages

License

OSluchyk/simple-pipeline-api

Folders and files

Latest commit

History

Repository files navigation

Simple Pipeline API

Overview

Getting Started

Creating a custom data processing pipeline

Simple Stage Code

Configuration

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages