Spark machine learning inventory

A curated inventory of machine learning methods available on the Apache Spark platform, both in official and third party libraries.

Table of Contents

Project inventory
Task inventory
Practical info

Project inventory

Machine learning & related libraries

Bundled with Spark

GraphX - Apache Spark's API for graphs and graph-parallel computation
MLlib - Apache Spark's built in machine learning library

Third party libraries

Aerosolve - A machine learning package built for humans
AMIDST - probabilistic machine learning
BigDL - BigDL: Distributed Deep Learning Library for Apache Spark
CoCoA - communication-efficient distributed coordinate ascent
Deeplearning4j - Deeplearning4j on Spark
DissolveStruct - Distributed Solver for Structured Prediction
DistML - DistML provide a supplement to mllib to support model-parallel on Spark
Elephas - Distributed Deep learning with Keras & Spark
Generalized K-means clustering - generalizes the Spark MLLIB Batch and Streaming K-Means clusterers in every practical way
KeystoneML - KeystoneML is a software framework, written in Scala, from the UC Berkeley AMPLab designed to simplify the construction of large scale, end-to-end, machine learning pipelines with Apache Spark
MLbase - MLbase is a platform addressing implementing and consuming Machine Learning at scale
ml-matrix - distributed matrix library
revrand - A library of scalable Bayesian generalised linear models with fancy features
spark-ts - Time series for Spark
Sparkling Water - H2O + Apache Spark
Splash - a general framework for parallelizing stochastic learning algorithms on multi-node clusters
Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark
StreamDM - Data Mining for Spark Streaming
Thunder - scalable image and time series analysis
Zen - aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN

Interfaces

CaffeOnSpark - CaffeOnSpark brings deep learning to Hadoop and Spark clusters
Elephas - Distributed Deep learning with Keras & Spark
Spark CoreNLP - CoreNLP wrapper for Spark
Spark Highcharts - Support Highcharts in Apache Zeppelin
Sparkling Water - H2O + Apache Spark
sparklyr - sparklyr provides R bindings to Spark’s distributed machine learning library
Sparkit-learn - PySpark + Scikit-learn = Sparkit-learn
Spark-TFOCS - port of TFOCS: Templates for First-Order Conic Solvers (cvxr.com/tfocs)
Hivemall-Spark - A Hivemall wrapper for Spark
Spark PMML exporter validator - Using JPMML Evaluator to validate the PMML models exported from Spark
TensorFrames - Tensorflow wrapper for DataFrames on Apache Spark

Notebooks

Apache Zeppelin - A web-based notebook that enables interactive data analytics
Beaker - The data scientist's laboratory
Spark Notebook - Interactive and Reactive Data Science using Scala and Spark
sparknotebook - running Apache Spark using Scala in ipython notebook

Visualization

Plotly - Spark Dataframes with Plotly
Spark Highcharts - Support Highcharts in Apache Zeppelin
Spark ML streaming - Visualize streaming machine learning in Spark
Vegas - The missing MatPlotLib for Scala + Spark

Others

Apache Toree - Gateway to Apache Spark
Distributed DataFrame - Simplify Analytics on Disparate Data Sources via a Uniform API Across Engines
Apache Metron - real-time Big Data security
PipelineIO - Extend ML Pipelines to Serve Production Users
Spark Jobserver - REST job server for Apache Spark
Spark PMML exporter validator - Using JPMML Evaluator to validate the PMML models exported from Spark
Spark-Ucores - Spark for Unconventional Cores
Twitter stream ML - Machine Learning over Twitter's stream. Using Apache Spark, Web Server and Lightning Graph server.
Velox - a system for serving machine learning predictions

Task inventory

MLlib - Apache Spark's built in machine learning library

Ensemble learning & parallel modelling

Libraries

DistML - DistML provide a supplement to mllib to support model-parallel on Spark
Elephas - Distributed Deep learning with Keras & Spark
spark-FM-parallelISGD - Implementation of Factorization Machines on Spark using parallel stochastic gradient descent
SparkBoost - A distributed implementation of AdaBoost.MH and MP-Boost using Apache Spark
StreamDM - Data Mining for Spark Streaming

Algorithms

Adaboost: SparkBoost
Bagging: StreamDM

Classification

Libraries

MLlib - Apache Spark's built in machine learning library
DissolveStruct - Distributed Solver for Structured Prediction
Spark kNN graphs - Spark algorithms for building k-nn graphs
Spark-libFM - implementation of Factorization Machines
Sparkling Ferns - Implementation of Random Ferns for Apache Spark
StreamDM - Data Mining for Spark Streaming

Algorithms

Decision Tree: MLlib
Factorization Machines: spark-FM-parallelISGD, Spark-libFM
Hoeffding Decision Trees: StreamDM
Gradient-boosted trees: MLlib
Linear Discriminant Analysis (LDA):
Logistic Regression: MLlib, StreamDM
Multilayer Perceptron: MLlib
Naive Bayes: MLlib, StreamDM
Perceptron: StreamDM
Random Forest: MLlib
Support Vector Machine (SVM): MLlib, StreamDM

Clustering

Libraries

MLlib - Apache Spark's built in machine learning library
Bisecting K-means - implementation of Bisecting KMeans Clustering which is a kind of Hierarchical Clustering algorithm
Generalized K-means clustering - generalizes the Spark MLLIB Batch and Streaming K-Means clusterers in every practical way
Patchwork - Highly Scalable Grid-Density Clustering Algorithm for Spark MLLib
spark-tsne - Distributed t-SNE via Apache Spark
StreamDM - Data Mining for Spark Streaming

Algorithms

CluStream: StreamDM
Bisecting K-means: MLlib, Bisecting K-means
Gaussian Mixture Model (GMM): MLlib
Hierarchical clustering: MLlib, Bisecting K-means
K-Means: MLlib, Bisecting K-means, Generalized K-means clustering
Latent Dirichlet Allocation (LDA): MLlib
Power Iteration Clustering (PIC): MLlib
StreamKM++: StreamDM
t-SNE: spark-tsne

Data Transformation, Feature Selection & Dimensionality Reduction

Libraries

MLlib - Apache Spark's built in machine learning library
Modelmatrix - Sparse feature extraction with Spark
Spark Infotheoretic Feature Selection - generic implementation of greedy Information Theoretic Feature Selection (FS) methods
Spark MLDP discetization - implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
spark-tsne - Distributed t-SNE via Apache Spark

Algorithms

Chi-Squared feature selection: MLlib
Information theoretic: Spark Infotheoretic Feature Selection
PCA: MLlib
MLDP discretization: Spark MLDP discetization
TF-IDF: MLlib
t-SNE: spark-tsne
Word2Vec: MLlib

Deep Learning

Libraries

BigDL - BigDL: Distributed Deep Learning Library for Apache Spark
CaffeOnSpark - CaffeOnSpark brings deep learning to Hadoop and Spark clusters
Deeplearning4j - Deeplearning4j on Spark
DeepSpark - A neural network library which uses Spark RDD instances
Elephas - Distributed Deep learning with Keras & Spark
Sparkling Water - H2O + Apache Spark
TensorFrames - Tensorflow wrapper for DataFrames on Apache Spark

Graph computations

Libraries

GraphX - Apache Spark's API for graphs and graph-parallel computation
Spark kNN graphs - Spark algorithms for building k-nn graphs
SparklingGraph - large scale, distributed graph processing made easy

Itemset mining, frequent pattern mining & association rules

FP-Growth: MLlib
PrefixSpan: MLlib

Linear algebra

Libraries

lazy-linalg - A package full of linear algebra operators for Apache Spark MLlib's linalg package
ml-matrix - distributed matrix library

Algorithms

Singular Value Decomposition (SVD): MLlib
Principal Component Analysis (PCA): MLlib

Matrix factorization & recommender systems

Libraries

MLlib - Apache Spark's built in machine learning library
spark-FM-parallelISGD - Implementation of Factorization Machines on Spark using parallel stochastic gradient descent
Spark-libFM - implementation of Factorization Machines
Streaming Matrix Factorization - Distributed Streaming Matrix Factorization implemented on Spark for Recommendation Systems

Algorithms

Collaborative filtering: MLlib
Factorization Machines: spark-FM-parallelISGD, Spark-libFM
Matrix factorization: Streaming Matrix Factorization

Natural language processing

Libraries

Spark CoreNLP - CoreNLP wrapper for Spark
Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark
TopicModelling - Topic Modeling on Apache Spark

Algorithms

Coreference resolution: Spark CoreNLP
Latent Dirichlet Allocation (LDA): MLlib
Named Entity Recognition (NER): Spark CoreNLP
Open information extraction: Spark CoreNLP
Part-of-speech (POS) tagging: Spark CoreNLP
Sentiment analysis: Spark CoreNLP
Topic Modelling: Spectral LDA on Spark, TopicModelling

Optimization & hyperparameter search

Libraries

MLlib - Apache Spark's built in machine learning library
Elephas - Distributed Deep learning with Keras & Spark
Spark-TFOCS - port of TFOCS: Templates for First-Order Conic Solvers (cvxr.com/tfocs)

Algorithms

Alternating Least Squares (ALS): MLlib
First-Order Conic solvers: Spark-TFOCS
Gradient descent: MLlib
Grid Search: MLlib
Iteratively Reweighted Least Squares (IRLS): MLlib
Limited-memory BFGS (L-BFGS): MLlib
Normal equation solver: MLlib
Stochastic gradient descent (SGD): MLlib
Tree of Parzen estimators (TPE -- hyperopt): Elephas - Distributed Deep learning with Keras & Spark

Regression

Libraries

MLlib - Apache Spark's built in machine learning library
revrand - A library of scalable Bayesian generalised linear models with fancy features
StreamDM - Data Mining for Spark Streaming

Algorithms

Bayesian generalised linear models: revrand
Decision tree regression: MLlib
Generalized linear regression: MLlib
Gradient-boosted tree regression: MLlib
Isotonic regression: MLlib
Linear regression: MLlib, StreamDM
Linear least squares: MLlib
Random forest regression: MLlib
Ridge regression: MLlib
Survival regression: MLlib
Support Vector Machine (SVM): MLlib

Statistics

Hypothesis testing: MLlib
Kernel density estimation: MLlib

Tensor decompositions

Libraries

Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark

Algorithms

Spectral LDA: Spectral LDA on Spark

Time series

Libraries

spark-ts - Time series for Spark
Thunder - scalable image and time series analysis

Algorithms

Practical info

License

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

To add content, feel free to open an issue or create a pull request.

Acknowledgments

This inventory is inspired by mfornos’ inventory of awesome microservices.

Table of contents generated with DocToc.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

claesenm/spark-ml-inventory

Folders and files

Latest commit

History

Repository files navigation

Spark machine learning inventory

Project inventory

Machine learning & related libraries

Bundled with Spark

Third party libraries

Interfaces

Notebooks

Visualization

Others

Task inventory

Ensemble learning & parallel modelling

Libraries

Algorithms

Classification

Libraries

Algorithms

Clustering

Libraries

Algorithms

Data Transformation, Feature Selection & Dimensionality Reduction

Libraries

Algorithms

Deep Learning

Libraries

Graph computations

Libraries

Itemset mining, frequent pattern mining & association rules

Linear algebra

Libraries

Algorithms

Matrix factorization & recommender systems

Libraries

Algorithms

Natural language processing

Libraries

Algorithms

Optimization & hyperparameter search

Libraries

Algorithms

Regression

Libraries

Algorithms

Statistics

Tensor decompositions

Libraries

Algorithms

Time series

Libraries

Algorithms

Practical info

License

Contributing

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages