Distributed Character RNN

This is a port for running a character rnn with distributed tensorflow. Based on the original code from https://github.com/sherjilozair/char-rnn-tensorflow

Requirements

~~Tensorflow 1.4~~ Tensorflow 1.8

Basic Usage

1. For launching a distributed training environment

Step 1: First make sure your data is sharded if you want data parallel training. For doing that run the following command:

# Call with -h option for more help
python data_splitter.py --data_dir data/tinyshakespeare --num_parts 2 --out_dir sharded_data

This will create data-<num>.npy files in the out_dir. num is the number of partition. Note that vocabulary is not partitioned since it should be shared across all nodes and must be same globally. If vocab creation is left to runtime, it will differ for each partition.

Step 2: You need to launch each node as a different process. The command for launching any node is

 python train.py --distributed --ps_hosts 127.0.0.1:8000 --worker_hosts 127.0.0.1:9000,127.0.0.1:9001 --job_name $job_name --task_index $task_index --save_dir distrib-train

OR, execute the file launch.bat or launch.sh to quickly launch a distributed experiment with default settings.

The options --job_name takes value either ps or worker based on the node's role. Refer to this TF tutorial for more info on these roles. Similarly --task_index takes an integer indicating which node it is. ith worker node takes value i.

For more options run python train.py --help. Note any options you set must be same across all nodes except for node dependent settings like job_name, task_index, etc.

2. For running on a single process; without the distributed mode

To train with default parameters on the tinyshakespeare corpus, run python train.py. To access all the parameters use python train.py --help.

To sample from a checkpointed model, python sample.py. Sampling while the learning is still in progress (to check last checkpoint) works only in CPU or using another GPU. To force CPU mode, use export CUDA_VISIBLE_DEVICES="" and unset CUDA_VISIBLE_DEVICES afterward (resp. set CUDA_VISIBLE_DEVICES="" and set CUDA_VISIBLE_DEVICES= on Windows).

To continue training after interruption or to run on more epochs, python train.py --init_from=save

Datasets

You can use any plain text file as input. For example you could download The complete Sherlock Holmes as such:

cd data
mkdir sherlock
cd sherlock
wget https://sherlock-holm.es/stories/plain-text/cnus.txt
mv cnus.txt input.txt

Then start train from the top level directory using python train.py --data_dir=./data/sherlock/

A quick tip to concatenate many small disparate .txt files into one large training file: ls *.txt | xargs -L 1 cat >> input.txt.

Tensorboard

To visualize training progress, model graphs, and internal state histograms: fire up Tensorboard and point it at your log_dir. E.g.:

$ tensorboard --logdir=./logs/

Then open a browser to http://localhost:6006 or the correct IP/Port specified.

Contributing

Feel free to send pull requests. Especially related to simplifying the setup as much as possible.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data/tinyshakespeare		data/tinyshakespeare
logs		logs
save		save
sharded_data		sharded_data
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
README.md		README.md
data_splitter.py		data_splitter.py
launch.bat		launch.bat
launch.sh		launch.sh
model.py		model.py
requirements.txt		requirements.txt
sample.py		sample.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Character RNN

Requirements

Basic Usage

1. For launching a distributed training environment

2. For running on a single process; without the distributed mode

Datasets

Tensorboard

Contributing

About

Releases

Packages

Languages

License

Abhishek8394/distributed_char_rnn

Folders and files

Latest commit

History

Repository files navigation

Distributed Character RNN

Requirements

Basic Usage

1. For launching a distributed training environment

2. For running on a single process; without the distributed mode

Datasets

Tensorboard

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages