Replicating partly A Fast and Accurate Dependency Parser Using Neural Networks’ by Danqi Chen and Chris Manning and conducting few expirements
Converts CoNLL data (train and dev) into features of the parser configuration paired with parser decisions, takes in a dependency tree and, using SHIFT-REDUCE-PARSING, determining parser actions, which will alter the parser configuration, from which the feature set can be determined.
-f
data files (default:train.orig.conll dev.orig.conll
)-trans
transition system (default:std
forarc-standrad
, other options:eager
)
prepare_data.py puts the data into csv WORD_BEFORE_DOT.converted
file with the 49
columns of information based on the following tokens:
[ 's_1', 's_2', 's_3', 'b_1', 'b_2', 'b_3', 'lc_1(s_1)', 'rc_1(s_1)', 'lc_2(s_1)', 'rc_2(s_1)', 'lc_1(s_2)', 'rc_1(s_2)', 'lc_2(s_2)', 'rc_2(s_2)', 'lc_1(lc_1(s_1))', 'rc_1(rc_1(s_1))', 'lc_1(lc_1(s_2))', 'rc_1(rc_1(s_2))' ]
where given a sentence:
s_i
corresponds to element (token)i
on its stack,b_i
corresponds to element (token)i
on its buffer,lc_i(x)
corresponds toith
left child of elementx
rc_i(x)
corresponds toith
right child of elementx
- if any of token is empty, a
NULL
token is placed instead
The 49
columns consist accordingly of 18
titled just like the notion above containing tokens word
s themselves, another 18
title similarly but prefixed with pos
containing pos tags of those those selected tokens, 12
corresponding to arc-labels of the selected tokens excluding the first 6
parent tokens (on the top of the stack and the buffer), and finally 1
column including the label of the configuration formatted as TRANSITION_TYPE
(ARC_DEPENDENCY
).
train.py trains a model given data preprocessed by preparedata.py and writes a model file train.model, including vocab data.
-t
training file (default:train.converted
)-d
validation (dev) fiile (default:dev.converted
)-E
word embedding dimension (default:50
)-e
number of epochs (default:10
)-u
number of hidden units (default:200
)-lr
learning rate (default:0.01
)-reg
regularization amount (default:1e-5
)-batch
mini-batch size (default:256
)-o
model filepath to be written (default:train.model
)-emb_w_init
embedding weights random normal scaling (default:0.01
)-gpu
use gpu (default:True
)
Given a trained model file (and possibly vocabulary file reads in CoNLL data and writes CoNLL data where fields 7 and 8 contain dependency tree info.
-m
model filepath (default:train.model
)-i
input CoNLL filepath (deault:parse.in
)-o
output CoNLL filepath (default:parse.out
)-verbose
show progress bar (default:False
)-dropb
whether to drop blocking elements while transiting (default:True
)-trans
transition system (default:std
forarc-standrad
, other options:eager
)
EXEC_FILE = train.py or EXEC_FILE = train-torch.py
python $EXEC_FILE -u $HIDDEN_UNITS -l $LEARNING_RATE -f $MAX_SEQUENCE_LENGTH -b $MINI_BATCH_SIZE -e $NUM_EPOCHS -E $EMBEDDING_FILE -i $DATASET -o $OUT_MODEL_FILE -w $WEIGHTS_INIT -d $DEBUG_FILE
-m
model filename (either start withpytorch
or without)-i
test data-set relative filepath-o
output (inference) desired relative filepath
EXEC_FILE = train.py or EXEC_FILE = train-torch.py
python $EXEC_FILE -m nb.4dim.model -i 4dim.sample.txt -o 4dim.out.txt