Skip to content

Latest commit

 

History

History
41 lines (27 loc) · 1.81 KB

README.md

File metadata and controls

41 lines (27 loc) · 1.81 KB

Text classification using pytorch

This is an easy to understand code for text classification using Yoon Kim's model written in pytorch.

cnn model

Data preparation

Input data must be in three files:

  • topicclass_train.txt
  • topicclass_valid.txt
  • topicclass_test.txt

Each file must contain the input examples with one line per example in the following format

<label> ||| <sentence>

for instance

Social sciences and society ||| Several of these rights regulate pre @-@ trial procedure : access to a non @-@ excessive bail , the right to indictment by a grand jury , the right to an information ( charging document ) , the right to a speedy trial , and the right to be tried in a specific venue .

We assume that the data is tokenized and we use python's split function to split it into tokens. This repository was tested on this dataset.

EDA

Some basic EDA is provided in this notebook.

class distribution sentence length distrubution

Config file

After the data is in the correct format, fill the entries in the config file. A template is provided in the repo.

Train model

python run.py -model kim_cnn -lr 0.001 -drop_prob 0.5 -batch_size 4096 -cuda -use_trainable_embed -use_fixed_embed -gpu 0 -epochs 10

Expected accuracy is 85.5% on this dataset.

Requirements

Please install the requirements file provided in the repo.