Based on sentdex's tutorial at https://www.youtube.com/watch?v=6rDWwL6irG0
Contains code for DNN, LSTM (RNN) and CNN.
NOTES:
- LSTM (RNN) code is buggy.
- CNN only tested with MNIST dataset.
Required dependencies:
- Anaconda python distribution:
- numpy
- scipy
- pandas
- matplotlib
- NLTK
- tensorflow (ideally with GPU support)
- tqdm (
pip install tqdm
)
Make sure you have the following folder structure (create folders if needed):
<project_dir>
|
|--- preprocessing
|
|--- large_data
| |
| |--- data
| |--- saved
| |--- temp
|
|--- small_data
|
|--- data
|--- saved
Only the small dataset is included in the repo.
Download large dataset (see bottom of this readme for link) and place it at
preprocessing/large_data/data/training.1600000.processed.noemoticon.csv
python3 preprocessing/create_sentiment_feature_sets_small.py
This takes as input the data in preprocessing/small_data/data
and saves output
to preprocessing/small_data/saved
python3 preprocessing/create_sentiment_feature_sets_large.py
This takes as input the data in preprocessing/large_data/data
and saves output
to preprocessing/large_data/saved
. Temporary files are stored in preprocessing/large_data/temp
.
python3 tf_nn.py
Use the following options in the __main__
function to specify behaviour:
SMALL_DATA = False
USE_MNIST = False
# TODO: LSTM (RNN) model is buggy (chunking part during training)
network_model = MODELS[0] # (0 - DNN, 1 - LSTM (RNN), 2 - CNN)
The original tutorial uses regular lists for storing large sparse matrices and
serializes them either using the pickle
module or as CSV files. This leads to very large files.
In one case, a generated CSV file is 19.6GB in size.
By transforming these lists into scipy
sparse matrices and serializing them
as zipped numpy arrays
we reduce the size greatly.
In the previous example, the size is reduced from 19.6GB to 26.7MB.
Using sparse matrices also means we can fit all data in RAM!
Added progress bars (using tqdm
) for:
- generating bag of words (word vectors) for large dataset.
- training epochs
- small dataset (
pos.txt
andneg.txt
): https://pythonprogramming.net/static/downloads/machine-learning-data/ - large dataset: download from http://help.sentiment140.com/for-students