Skip to content

Latest commit

 

History

History
123 lines (99 loc) · 3.64 KB

README.org

File metadata and controls

123 lines (99 loc) · 3.64 KB

Fastfea: C++11 based feature engineering framework

Build

mkdir build && cd build && cmake .. && make

Concepts

Building blocks:

  • Transformer: Input -> Output.
  • Pipeline: Two transformers in sequence, pipeline itself is a transformer.
  • Combiner: Two transformers in parallel, combiner itself is a transformer.

Transformer

Transformer is a mathematical function. (From something to another thing). Yet the transformer (may) has to learn from data to decide its parameters. The process of learning is done in step by continuously observing samples.

For example, to scale a numerical feature (linearly to 0-1), the transformer has to know the lower bound (min) and upper bound (max) of the given feature. In the process of step, this transformer will update its parameter according to each sample.

After observing all samples in step, finalize should be called to do some (possible) finishing works.

Pipeline

Two transformers in sequence, for example:

  • Transformer A categorizes a feature
  • Transformer B binarize (1-of-K binary encoding) of a categorical variable.

To make a pipeline out of them, we overloaded +. So one just needs to:

A + B

The type of first transformer’s output should match that of the second’s input.

Combiner

Two transformers in parallel, the two’s output will be combined with std::tuple. They have to have same input types, but output types could be different.

**Special rule: When both transformers’s output are of the same std::vector type, they will be combined as a concatenated std::vector **

Combiner works well with pipeline. For example, let’s say you want to binarize (transformer C) according to the combination of two-features (transformer A and B) (e.g. 2-gram).

We overloaded |. So one just needs to:

(A | B) + C

Lazy Transformer

LazyTransformer exists to simplify a particular kind of transformer. They don’t need to learn parameters, so step and finalize are useless for them. Examples include literally taking some features, taking Log for a given feature.

Example

#include <iostream>

#include "transformer.hpp"

using transformer::make_transformer;
using transformer::make_lazy_transformer;
using transformer::Binarizer;

struct Data {
    std::string firstname;
    std::string lastname;
};

int main(int argc, char *argv[])
{
    std::function<std::string(const Data& sample)> get_firstname_lambda =
        [](const Data& sample) -> std::string {
        return sample.firstname;
    };
    std::function<std::string(const Data& sample)> get_lastname_lambda =
        [](const Data& sample) -> std::string {
        return sample.lastname;
    };

    auto get_firstname = make_lazy_transformer(get_firstname_lambda);
    auto get_lastname = make_lazy_transformer(get_lastname_lambda);
    auto binarizer = make_transformer<Binarizer<std::tuple<std::string, std::string>>>();

    Data data1{"Mike", "Jordan"};
    Data data2{"Mike", "James"};
    Data data3{"Bill", "Jordan"};
    Data data4{"Bill", "James"};

    auto pipe = (get_firstname | get_lastname) + binarizer;
    auto dataset = {data1, data2, data3, data4};
    for (const auto& data: dataset) {
        pipe->step(data);
    }
    pipe->finalize();
    for (const auto& data: dataset) {
        std::vector<double> out = pipe->transform(data);
        for (const auto& item: out) {
            std::cout<<item<<" ";
        }
        std::cout<<std::endl;
    }
    // Output will be:
    // 1 0 0 0
    // 0 1 0 0
    // 0 0 1 0
    // 0 0 0 1
    return 0;
}