A curated inventory of machine learning methods available on the Apache Spark platform, both in official and third party libraries.
Table of Contents
- Project inventory
- Task inventory
- Ensemble learning & parallel modelling
- Classification
- Clustering
- Deep learning
- Feature selection & dimensionality reduction
- Graph computations
- Linear algebra
- Matrix factorization & recommender systems
- Natural language processing
- Optimization & hyperparameter search
- Regression
- Statistics
- Tensor decompositions
- Time series
- Practical info
- GraphX - Apache Spark's API for graphs and graph-parallel computation
- MLlib - Apache Spark's built in machine learning library
- Aerosolve - A machine learning package built for humans
- AMIDST - probabilistic machine learning
- BigDL - BigDL: Distributed Deep Learning Library for Apache Spark
- CoCoA - communication-efficient distributed coordinate ascent
- Deeplearning4j - Deeplearning4j on Spark
- DissolveStruct - Distributed Solver for Structured Prediction
- DistML - DistML provide a supplement to mllib to support model-parallel on Spark
- Elephas - Distributed Deep learning with Keras & Spark
- Generalized K-means clustering - generalizes the Spark MLLIB Batch and Streaming K-Means clusterers in every practical way
- KeystoneML - KeystoneML is a software framework, written in Scala, from the UC Berkeley AMPLab designed to simplify the construction of large scale, end-to-end, machine learning pipelines with Apache Spark
- MLbase - MLbase is a platform addressing implementing and consuming Machine Learning at scale
- ml-matrix - distributed matrix library
- revrand - A library of scalable Bayesian generalised linear models with fancy features
- spark-ts - Time series for Spark
- Sparkling Water - H2O + Apache Spark
- Splash - a general framework for parallelizing stochastic learning algorithms on multi-node clusters
- Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark
- StreamDM - Data Mining for Spark Streaming
- Thunder - scalable image and time series analysis
- Zen - aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN
- CaffeOnSpark - CaffeOnSpark brings deep learning to Hadoop and Spark clusters
- Elephas - Distributed Deep learning with Keras & Spark
- Spark CoreNLP - CoreNLP wrapper for Spark
- Spark Highcharts - Support Highcharts in Apache Zeppelin
- Sparkling Water - H2O + Apache Spark
- sparklyr - sparklyr provides R bindings to Spark’s distributed machine learning library
- Sparkit-learn - PySpark + Scikit-learn = Sparkit-learn
- Spark-TFOCS - port of TFOCS: Templates for First-Order Conic Solvers (cvxr.com/tfocs)
- Hivemall-Spark - A Hivemall wrapper for Spark
- Spark PMML exporter validator - Using JPMML Evaluator to validate the PMML models exported from Spark
- TensorFrames - Tensorflow wrapper for DataFrames on Apache Spark
- Apache Zeppelin - A web-based notebook that enables interactive data analytics
- Beaker - The data scientist's laboratory
- Spark Notebook - Interactive and Reactive Data Science using Scala and Spark
- sparknotebook - running Apache Spark using Scala in ipython notebook
- Plotly - Spark Dataframes with Plotly
- Spark Highcharts - Support Highcharts in Apache Zeppelin
- Spark ML streaming - Visualize streaming machine learning in Spark
- Vegas - The missing MatPlotLib for Scala + Spark
- Apache Toree - Gateway to Apache Spark
- Distributed DataFrame - Simplify Analytics on Disparate Data Sources via a Uniform API Across Engines
- Apache Metron - real-time Big Data security
- PipelineIO - Extend ML Pipelines to Serve Production Users
- Spark Jobserver - REST job server for Apache Spark
- Spark PMML exporter validator - Using JPMML Evaluator to validate the PMML models exported from Spark
- Spark-Ucores - Spark for Unconventional Cores
- Twitter stream ML - Machine Learning over Twitter's stream. Using Apache Spark, Web Server and Lightning Graph server.
- Velox - a system for serving machine learning predictions
- MLlib - Apache Spark's built in machine learning library
- DistML - DistML provide a supplement to mllib to support model-parallel on Spark
- Elephas - Distributed Deep learning with Keras & Spark
- spark-FM-parallelISGD - Implementation of Factorization Machines on Spark using parallel stochastic gradient descent
- SparkBoost - A distributed implementation of AdaBoost.MH and MP-Boost using Apache Spark
- StreamDM - Data Mining for Spark Streaming
- Adaboost: SparkBoost
- Bagging: StreamDM
- MLlib - Apache Spark's built in machine learning library
- DissolveStruct - Distributed Solver for Structured Prediction
- Spark kNN graphs - Spark algorithms for building k-nn graphs
- Spark-libFM - implementation of Factorization Machines
- Sparkling Ferns - Implementation of Random Ferns for Apache Spark
- StreamDM - Data Mining for Spark Streaming
- Decision Tree: MLlib
- Factorization Machines: spark-FM-parallelISGD, Spark-libFM
- Hoeffding Decision Trees: StreamDM
- Gradient-boosted trees: MLlib
- Linear Discriminant Analysis (LDA):
- Logistic Regression: MLlib, StreamDM
- Multilayer Perceptron: MLlib
- Naive Bayes: MLlib, StreamDM
- Perceptron: StreamDM
- Random Forest: MLlib
- Support Vector Machine (SVM): MLlib, StreamDM
- MLlib - Apache Spark's built in machine learning library
- Bisecting K-means - implementation of Bisecting KMeans Clustering which is a kind of Hierarchical Clustering algorithm
- Generalized K-means clustering - generalizes the Spark MLLIB Batch and Streaming K-Means clusterers in every practical way
- Patchwork - Highly Scalable Grid-Density Clustering Algorithm for Spark MLLib
- spark-tsne - Distributed t-SNE via Apache Spark
- StreamDM - Data Mining for Spark Streaming
- CluStream: StreamDM
- Bisecting K-means: MLlib, Bisecting K-means
- Gaussian Mixture Model (GMM): MLlib
- Hierarchical clustering: MLlib, Bisecting K-means
- K-Means: MLlib, Bisecting K-means, Generalized K-means clustering
- Latent Dirichlet Allocation (LDA): MLlib
- Power Iteration Clustering (PIC): MLlib
- StreamKM++: StreamDM
- t-SNE: spark-tsne
- MLlib - Apache Spark's built in machine learning library
- Modelmatrix - Sparse feature extraction with Spark
- Spark Infotheoretic Feature Selection - generic implementation of greedy Information Theoretic Feature Selection (FS) methods
- Spark MLDP discetization - implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
- spark-tsne - Distributed t-SNE via Apache Spark
- Chi-Squared feature selection: MLlib
- Information theoretic: Spark Infotheoretic Feature Selection
- PCA: MLlib
- MLDP discretization: Spark MLDP discetization
- TF-IDF: MLlib
- t-SNE: spark-tsne
- Word2Vec: MLlib
- BigDL - BigDL: Distributed Deep Learning Library for Apache Spark
- CaffeOnSpark - CaffeOnSpark brings deep learning to Hadoop and Spark clusters
- Deeplearning4j - Deeplearning4j on Spark
- DeepSpark - A neural network library which uses Spark RDD instances
- Elephas - Distributed Deep learning with Keras & Spark
- Sparkling Water - H2O + Apache Spark
- TensorFrames - Tensorflow wrapper for DataFrames on Apache Spark
-
GraphX - Apache Spark's API for graphs and graph-parallel computation
-
Spark kNN graphs - Spark algorithms for building k-nn graphs
-
SparklingGraph - large scale, distributed graph processing made easy
- lazy-linalg - A package full of linear algebra operators for Apache Spark MLlib's linalg package
- ml-matrix - distributed matrix library
-
MLlib - Apache Spark's built in machine learning library
-
spark-FM-parallelISGD - Implementation of Factorization Machines on Spark using parallel stochastic gradient descent
-
Spark-libFM - implementation of Factorization Machines
-
Streaming Matrix Factorization - Distributed Streaming Matrix Factorization implemented on Spark for Recommendation Systems
- Collaborative filtering: MLlib
- Factorization Machines: spark-FM-parallelISGD, Spark-libFM
- Matrix factorization: Streaming Matrix Factorization
- Spark CoreNLP - CoreNLP wrapper for Spark
- Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark
- TopicModelling - Topic Modeling on Apache Spark
- Coreference resolution: Spark CoreNLP
- Latent Dirichlet Allocation (LDA): MLlib
- Named Entity Recognition (NER): Spark CoreNLP
- Open information extraction: Spark CoreNLP
- Part-of-speech (POS) tagging: Spark CoreNLP
- Sentiment analysis: Spark CoreNLP
- Topic Modelling: Spectral LDA on Spark, TopicModelling
-
MLlib - Apache Spark's built in machine learning library
-
Elephas - Distributed Deep learning with Keras & Spark
-
Spark-TFOCS - port of TFOCS: Templates for First-Order Conic Solvers (cvxr.com/tfocs)
- Alternating Least Squares (ALS): MLlib
- First-Order Conic solvers: Spark-TFOCS
- Gradient descent: MLlib
- Grid Search: MLlib
- Iteratively Reweighted Least Squares (IRLS): MLlib
- Limited-memory BFGS (L-BFGS): MLlib
- Normal equation solver: MLlib
- Stochastic gradient descent (SGD): MLlib
- Tree of Parzen estimators (TPE -- hyperopt): Elephas - Distributed Deep learning with Keras & Spark
- MLlib - Apache Spark's built in machine learning library
- revrand - A library of scalable Bayesian generalised linear models with fancy features
- StreamDM - Data Mining for Spark Streaming
- Bayesian generalised linear models: revrand
- Decision tree regression: MLlib
- Generalized linear regression: MLlib
- Gradient-boosted tree regression: MLlib
- Isotonic regression: MLlib
- Linear regression: MLlib, StreamDM
- Linear least squares: MLlib
- Random forest regression: MLlib
- Ridge regression: MLlib
- Survival regression: MLlib
- Support Vector Machine (SVM): MLlib
- Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark
- Spectral LDA: Spectral LDA on Spark
Please, read the Contribution Guidelines before submitting your suggestion.
To add content, feel free to open an issue or create a pull request.
This inventory is inspired by mfornos’ inventory of awesome microservices.
Table of contents generated with DocToc.