Arms deals classification

This is Fredrik Hernqvist's thesis project.

Annotations

I'm annotating about 800 texts. The unprocessed text files are stored in raw_data. As I annotate texts, these get a matching json file in annotated_data. The json files are all compiled into data.json, which is what is used for training.

To start the editor, simply run python3 editor.py. Things might have the wrong scales depending on your resolution. data.json is automatically updated upon saving.

Training

Training is done using training_lightning.py. It has a bunch of arguments which are mostly self explanatory, they can be viewed by running python3 training_lightning.py --help.

To run a basic example, run

python3 training_lightning.py data.json --max_epochs 5 --small_data --print

This will train the BERT model for 5 epochs and print the classification of the trained model. The --small_data argument crops the entire dataset to 20 texts. To view the metrics (this can be done live during training), run tensorboard --logdir lightning/lightning_logs/ and open http://localhost:6006/ in the browser.

We can also change the arguments to use ALBERT instead of BERT using the --classifier argument.

Sequence classification

When setting --task sequence we are trying to predict wether the entire text contains an arms deal or not. It outputs either 0 or 1 for each text.

Token classification

When setting --task token we are trying to predict whether each token is a certain attribute of the arms deal. For each text, it outputs an n*8 matrix. The value at index (i, j) in this matrix is an 1 if token i is an instance of attribute j, otherwise 0. The attributes are Weapon, Buyer, Buyer Country, Seller, Seller Country, Quantity, Price, Date, in that order.

Text Extractor

text_extractor.py was used to get the raw text from the pdf files. It looks at the pdf file, finds the url at the bottom, and uses the newspaper module to download the text from its source. This is not needed anymore.

Tests

Here are the commands for running the tests in the report and the results.

Sequence classification with exactly 400 character chunks of text

Command: python3 training_lightning.py --gpu --classifier albert --task sequence --max_epochs 100 --train_portion 0.8 --lr 0.00001 --batch_size 4 --max_tokens 128 --split fixed --test data.json

Result:

{'TestAccuracy': 0.7857142686843872,
 'TestF1': 0.8167938590049744,
 'TestLoss': 0.5804754495620728,
 'TestPrecision': 0.8045112490653992,
 'TestRecall': 0.8294573426246643}

Token classification with exactly 400 character chunks of text

Command: python3 training_lightning.py --gpu --classifier albert --task token --max_epochs 200 --train_portion 0.8 --lr 0.00001 --batch_size 8 --max_tokens 128 --split fixed --test data.json

Result:

{'TestAccuracy': 0.9911141395568848,
 'TestF1': 0.6636696457862854,
 'TestLoss': 0.06833155453205109,
 'TestPrecision': 0.7458832263946533,
 'TestRecall': 0.5977804660797119}

Sequence classification with variable chunks of text

Command: python3 training_lightning.py --gpu --classifier albert --task sequence --max_epochs 100 --train_portion 0.8 --lr 0.00001 --batch_size 4 --max_tokens 128 --split chunks --test data.json

Result:

{'TestAccuracy': 0.8706896305084229,
 'TestF1': 0.8110831379890442,
 'TestLoss': 0.44118180871009827,
 'TestPrecision': 0.8429319262504578,
 'TestRecall': 0.7815533876419067}

Token classification with variable chunks of text

Command: python3 training_lightning.py --gpu --classifier albert --task token --max_epochs 200 --train_portion 0.8 --lr 0.00001 --batch_size 8 --max_tokens 128 --split chunks --test data.json

Result:

{'TestAccuracy': 0.9966796636581421,
 'TestF1': 0.7691947221755981,
 'TestLoss': 0.018630821257829666,
 'TestPrecision': 0.8495346307754517,
 'TestRecall': 0.7027373909950256}

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
annotated_data		annotated_data
raw_data		raw_data
.gitignore		.gitignore
README.md		README.md
albert_parallel_token_classification.py		albert_parallel_token_classification.py
albert_sequence.json		albert_sequence.json
albert_token.json		albert_token.json
bert_parallel_token_classification.py		bert_parallel_token_classification.py
data.json		data.json
data_loader.py		data_loader.py
editor.py		editor.py
examples.txt		examples.txt
extract_stats.py		extract_stats.py
linear_repeat.py		linear_repeat.py
preprocessor.py		preprocessor.py
test_data.json		test_data.json
text_extractor.py		text_extractor.py
training_ignite.py		training_ignite.py
training_lightning.py		training_lightning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arms deals classification

Annotations

Training

Sequence classification

Token classification

Text Extractor

Tests

Sequence classification with exactly 400 character chunks of text

Token classification with exactly 400 character chunks of text

Sequence classification with variable chunks of text

Token classification with variable chunks of text

About

Releases

Packages

Languages

Hernqvist/arms-deals

Folders and files

Latest commit

History

Repository files navigation

Arms deals classification

Annotations

Training

Sequence classification

Token classification

Text Extractor

Tests

Sequence classification with exactly 400 character chunks of text

Token classification with exactly 400 character chunks of text

Sequence classification with variable chunks of text

Token classification with variable chunks of text

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages