The papers were implemented in using korean corpus
- preliminary
pyenv virualenv 3.7.7 nlp
pyenv activate nlp
pip install -r requirements.txt
- Usage
python build_dataset.py
python build_vocab.py
python train.py # default training parameter
python evaluate.py # defatul evaluation parameter
- Using the Naver sentiment movie corpus v1.0 (a.k.a.
nsmc
) - Configuration
conf/model/{type}.json
(e.g.type = ["sencnn", "charcnn",...]
)conf/dataset/nsmc.json
- Structure
# example: Convolutional_Neural_Networks_for_Sentence_Classification
├── build_dataset.py
├── build_vocab.py
├── conf
│ ├── dataset
│ │ └── nsmc.json
│ └── model
│ └── sencnn.json
├── evaluate.py
├── experiments
│ └── sencnn
│ └── epochs_5_batch_size_256_learning_rate_0.001
├── model
│ ├── data.py
│ ├── __init__.py
│ ├── metric.py
│ ├── net.py
│ ├── ops.py
│ ├── split.py
│ └── utils.py
├── nsmc
│ ├── ratings_test.txt
│ ├── ratings_train.txt
│ ├── test.txt
│ ├── train.txt
│ ├── validation.txt
│ └── vocab.pkl
├── train.py
└── utils.py
Model \ Accuracy | Train (120,000) | Validation (30,000) | Test (50,000) | Date |
---|---|---|---|---|
SenCNN | 91.95% | 86.54% | 85.84% | 20/05/30 |
CharCNN | 86.29% | 81.69% | 81.38% | 20/05/30 |
ConvRec | 86.23% | 82.93% | 82.43% | 20/05/30 |
VDCNN | 86.59% | 84.29% | 84.10% | 20/05/30 |
SAN | 90.71% | 86.70% | 86.37% | 20/05/30 |
ETRIBERT | 91.12% | 89.24% | 88.98% | 20/05/30 |
SKTBERT | 92.20% | 89.08% | 88.96% | 20/05/30 |
- Convolutional Neural Networks for Sentence Classification (as SenCNN)
- Character-level Convolutional Networks for Text Classification (as CharCNN)
- Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers (as ConvRec)
- Very Deep Convolutional Networks for Text Classification (as VDCNN)
- A Structured Self-attentive Sentence Embedding (as SAN)
- BERT_single_sentence_classification (as ETRIBERT, SKTBERT)
- Creating dataset from https://github.com/songys/Question_pair
- Configuration
conf/model/{type}.json
(e.g.type = ["siam", "san",...]
)conf/dataset/qpair.json
- Structure
# example: Siamese_recurrent_architectures_for_learning_sentence_similarity
├── build_dataset.py
├── build_vocab.py
├── conf
│ ├── dataset
│ │ └── qpair.json
│ └── model
│ └── siam.json
├── evaluate.py
├── experiments
│ └── siam
│ └── epochs_5_batch_size_64_learning_rate_0.001
├── model
│ ├── data.py
│ ├── __init__.py
│ ├── metric.py
│ ├── net.py
│ ├── ops.py
│ ├── split.py
│ └── utils.py
├── qpair
│ ├── kor_pair_test.csv
│ ├── kor_pair_train.csv
│ ├── test.txt
│ ├── train.txt
│ ├── validation.txt
│ └── vocab.pkl
├── train.py
└── utils.py
Model \ Accuracy | Train (6,136) | Validation (682) | Test (758) | Date |
---|---|---|---|---|
Siam | 93.00% | 83.13% | 83.64% | 20/05/30 |
SAN | 89.47% | 82.11% | 81.53% | 20/05/30 |
Stochastic | 89.26% | 82.69% | 80.07% | 20/05/30 |
ETRIBERT | 95.07% | 94.42% | 94.06% | 20/05/30 |
SKTBERT | 95.43% | 92.52% | 93.93% | 20/05/30 |
- A Structured Self-attentive Sentence Embedding (as SAN)
- Siamese recurrent architectures for learning sentence similarity (as Siam)
- Stochastic Answer Networks for Natural Language Inference (as Stochastic)
- BERT_pairwise_text_classification (as ETRIBERT, SKTBERT)