This is a port for running a character rnn with distributed tensorflow. Based on the original code from https://github.com/sherjilozair/char-rnn-tensorflow
Step 1: First make sure your data is sharded if you want data parallel training. For doing that run the following command:
# Call with -h option for more help
python data_splitter.py --data_dir data/tinyshakespeare --num_parts 2 --out_dir sharded_data
This will create data-<num>.npy files in the out_dir
. num
is the number of partition. Note that vocabulary is not partitioned since it should be shared across all nodes and must be same globally. If vocab creation is left to runtime, it will differ for each partition.
Step 2: You need to launch each node as a different process. The command for launching any node is
python train.py --distributed --ps_hosts 127.0.0.1:8000 --worker_hosts 127.0.0.1:9000,127.0.0.1:9001 --job_name $job_name --task_index $task_index --save_dir distrib-train
OR, execute the file launch.bat or launch.sh to quickly launch a distributed experiment with default settings.
The options --job_name
takes value either ps or worker based on the node's role. Refer to this TF tutorial for more info on these roles.
Similarly --task_index
takes an integer indicating which node it is. ith worker node takes value i.
For more options run python train.py --help
. Note any options you set must be same across all nodes except for node dependent settings like job_name, task_index, etc.
To train with default parameters on the tinyshakespeare corpus, run python train.py
. To access all the parameters use python train.py --help
.
To sample from a checkpointed model, python sample.py
.
Sampling while the learning is still in progress (to check last checkpoint) works only in CPU or using another GPU.
To force CPU mode, use export CUDA_VISIBLE_DEVICES=""
and unset CUDA_VISIBLE_DEVICES
afterward
(resp. set CUDA_VISIBLE_DEVICES=""
and set CUDA_VISIBLE_DEVICES=
on Windows).
To continue training after interruption or to run on more epochs, python train.py --init_from=save
You can use any plain text file as input. For example you could download The complete Sherlock Holmes as such:
cd data
mkdir sherlock
cd sherlock
wget https://sherlock-holm.es/stories/plain-text/cnus.txt
mv cnus.txt input.txt
Then start train from the top level directory using python train.py --data_dir=./data/sherlock/
A quick tip to concatenate many small disparate .txt
files into one large training file: ls *.txt | xargs -L 1 cat >> input.txt
.
To visualize training progress, model graphs, and internal state histograms: fire up Tensorboard and point it at your log_dir
. E.g.:
$ tensorboard --logdir=./logs/
Then open a browser to http://localhost:6006 or the correct IP/Port specified.
Feel free to send pull requests. Especially related to simplifying the setup as much as possible.