Skip to content

harrybraviner/text_norm_kaggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Purpose

The purpose of this is to train a model to convert from 'unnormalized' to 'normalized' English text, as described in the Text Normalization Challenge. I didn't end up getting this finished before the deadline expired, but I'm still working on the model since it's the most involved net I've written.

Usage

Download the training data for the Text Normalization Challenge and place the file en_train.csv into the data subdirectory.

Run the command

./norm_net.py

to train a model. The flag --mini_dataset can be added to use only the first 1000 entries in the training file. This is useful when making changes to the code and checking that it still runs (without having to wait while the very large training dataset is processed).

My (far from comprehensive) unit tests can be run by the command

python3 -m unittest ./norm_net.py

These mainly check that nothing crashes, and do a few checks that the geometry of the net is as we expect.

Data cleaning

The training set contains a large number of distinct characters (3080). Many of these are CJK characters which are outputted verbatim. Many more are symbols that are translated into latin characters (e.g. 'alpha').

I don't want to have a large number of outputs in the final layer, but I also don't want to prevent the system from repeating a character from the input. Therefore I used the following scheme.

We will define a set of 'vanilla characters'. These will be a-z, A-Z, and 0-9, space, and the quotation mark. (These are chosen since they are able to appear in the output without also appearing in the input. The same is true of e-acute, but only when the input is 'Pate'.) There are no more than 8 distinct non-vanilla characters in any one input in the training set. Any character that occurs fewer than 10 times in the training set is a 'rare' character. These will all be treated as a single <RARE> token. The remaining characters (those that are not alphanumeric, but have 10 or more occurrances) will each get their own embedding, but will not be available directly for output.

Processing an input will take place as follows:

  • Form the set of distinct non-vanilla characters for the input.
  • Assign distinct, randomized numbers from 0-9 to these characters.
  • Replace any rare characters with the <RARE> token.
  • Convert each character to its embedded representation.
  • Pass the embeddings, and the one-hot encodings of the 0-9 indices, to the network.
  • For the output, one-hot decode the characters appropriately.

Network architecture

The network itself will be two LSTM neural nets: The encoder takes the input string (suitably encoded). The outputs of the net are ignored, we are only interested in its final state. The final state of the encoder is fed to the decoder, a second LSTM net. The input of each step of the decoder is the output character of the previous step (with a special <START> token as the first input). The decoder is trained to produce the normalized string as output. Parameters are:

  • embedding_size - dimension of the dense embedding of characters
  • Layer sizes - number and sizes of the layers of the recurrent net
  • max_input_size, max_output_size - maximum input and output length that the net can handle
  • cell_type - only tested this as LSTM so far
  • max_nv_chars - the maximum number of distinct non-vanilla characters we can handle

The input will be the dense embedded encoding of the character (or the embedding of <RARE> if the character is not common in the input dataset) concatenated with the one-hot encoding of the non-vanilla index.

Encoder decoder architecture

In the diagram above x are the indices of the input character (or <RARE> or <STOP> tokens), w are the one-hot encodings of non-vanilla characters. y are the indices of the output we are training towards (but delayed by one timestep, otherwise the decoder would just learn to ignore the state and copy the input!) The hatted ys are the actual output of the net, which we are training to reproduce the ys. The variables in this diagram correspond to member of TranslationNet in norm_net.py as follows:

  • x is _unnorm_ix
  • w is _unnorm_nv
  • y is _norm_hint_ix
  • Hatted y is _decoder_logits_out
  • E is _encoder_state_out
  • S isn't explicitly present are a variable, it's automatically created by the called to dynamic_rnn in Encoder.connect

Todo

  • Set up decoder to produce 'sequential output' for use in test and validation.
  • Tensorboard logging of cross entropy.
  • Log random input / output pairs at intervals.
  • Loading the test set? Should be made part of the TrainingDataset object (since it wants access to the indices of the non-vanilla characters)?

About

RNN for Text Normalisation Kaggle

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published