- When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem.
- We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews.
relative python packages are summerized in requirements.txt
- Flink v1.13
- Python 3.7
- Java 8
- Dataset quick access on https://course.fast.ai/datasets#nlp
- 1.6 million labeled Tweets:
- Source:Sentiment140
- 280,000 training and 19,000 test samples in each polarity
- Source:Yelp Review Polarity
- 1,800,000 training and 200,000 testing samples in each polarity
- Source:Amazon product review polarity
quick try PLStream on yelp review dataset
cd PLStream
weget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
tar zxvf yelp_review_polarity_csv.tgz
mv yelp_review_polarity_csv/train.csv train.csv
- please make sure Environment Requirements mentioned above is ready.
pip install -r requirements.txt
redis-server
python PLStream.py
- The outputs' form is "original text" + "label" + "@@@@":
- With help of a split("@@@@") function we can further reorganize the labelled dataset.
to see the labelling accuracy, simply run:
python PLStream_acc.py