TF-IDF-Spark-Scala-implementation

This is implementation of TF-IDF algorithm in Scala and Spark.

Input - folder with text files. Each row of each file contains id and a text of a document separeted by "\t".

Output - folder with text files. Each row of each file contains word and a list of 20 pairs of document id and TF-IDF score which are the most relevant for this word.

The project is focused on efficiency and scalability thus text preprocessing is kept very simple. Punctuation and digits are removed, all text coerced to lover case. At the same time opportunities for further development are embedded in the system.

The project already has several optimizations:

Amount of shuffles is reduced to 1.
BufferTopHolder is introduced for efficient aggregation of inverted index.
All distributed variables are created in memory saving way.
Kryo serialization is used instead of Java serialization.

What else can be done:

Additional tuning after data exploration (cluster resources, partitions, data scew).
Changing input/output format (avro, parquet, sequence, msgpack, gzip) for faster reading from disc.
Native optimizations of Spark SQL (Catalyst and Tungsten modules).
TF-IDF in Spark ML or Spark MLlib.

The project should be compiled with gradle.

./gradlew assemble

This will produce a fat jar (uber jar) with all required dependencies except Spark and Scala which should be provided.

Job should be launched with spark-submit.

export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_USER_NAME=hdfs

spark-submit \ 
--class TfIdfJob \
--deploy-mode cluster \
--master yarn \
tf-idf-job/target/libs/tf-idf-job-0.0.1-SNAPSHOT.jar \
--inputFolder:/example/textCorpus \
--outputFolder:/example/invertedIndex

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
gradle/wrapper		gradle/wrapper
tf-idf-job		tf-idf-job
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
libraries.gradle		libraries.gradle
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF-Spark-Scala-implementation

About

Releases

Packages

Languages

Kudryavets/tf-idf-spark-scala

Folders and files

Latest commit

History

Repository files navigation

TF-IDF-Spark-Scala-implementation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages