spark-retrain-bug

This is a reproducible for a probable-spark bug.

To run: sbt clean -mem 1500 run

The program in a loop:

Trains based on the data files found at: data/ folder. The data set is subset of data of amazon electronic dataset which can be found at: http://jmcauley.ucsd.edu/data/amazon/
It simply calculates tf-idf of the terms after (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf)
Once the training is complete, it unpersists and starts the operation again.

After about 10 runs (takes ~20minutes), it ends with OOME.

Fix

This issue is not found if all the input files are merged to a single file.
Can only be reproduced when each document is in a separate file.

The biggest objects in the heapdump:

Further debug:

Spark 2.3.1
JDK8
SBT 1.1.6
Ubuntu 18.04
Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
inside.png		inside.png
objects.png		objects.png