Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cifar10, with BSP vs EASGD #25

Open
dlunga opened this issue Aug 4, 2017 · 5 comments
Open

Cifar10, with BSP vs EASGD #25

dlunga opened this issue Aug 4, 2017 · 5 comments

Comments

@dlunga
Copy link

dlunga commented Aug 4, 2017

launch_session.py

# launch_session.py
from theanompi import EASGD
#from theanompi import BSP

rule=EASGD()
#rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1', 'cuda2'] ,
          modelfile = 'theanompi.models.cifar10',
          modelclass = 'Cifar10_model')
rule.wait()

Which is based on:

class Cifar10_model(object): # c01b input
    
    def __init__(self, config): 

        self.verbose = config['verbose']
        self.rank = config['rank'] # will be used in sharding and distinguish rng
        self.size = config['size']
        
        import theano
        self.name = 'Cifar10_model'
        
        # data
        from theanompi.models.data import Cifar10_data
        self.data = Cifar10_data(verbose=False)
        self.channels = self.data.channels # 'c' mean(R,G,B) = (103.939, 116.779, 123.68)
        self.input_width = input_width # '0' single scale training 224
        self.input_height = input_height # '1' single scale training 224
        # if self.size>1: # only use avg
        #     self.batch_size = batch_size/self.size
        # else:
        self.batch_size = batch_size # 'b'
        self.file_batch_size = file_batch_size
        self.n_softmax_out = self.data.n_class
        
        # mini batching and other data parallel common routine
        self.data.batch_data(file_batch_size)
from theanompi import BSP
#rule=EASGD()
rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1', 'cuda2'],
          modelfile = 'models.keras_model_zoo.wresnet',
          modelclass = 'Wide_ResNet')
rule.wait()

Which uses:

class Wide_ResNet(object):
    
    '''
    Modified from:
    https://gist.github.com/kashif/0ba0270279a0f38280423754cea2ee1e
    '''
    
    def __init__(self, config):
        
        self.verbose = config['verbose']
        self.size = config['size']
        self.rank = config['rank']
        
        self.name = 'Wide_ResNet'
        
        
        # data
        from theanompi.models.keras_model_zoo.data.cifar10 import Cifar10_data  # dalton code
        self.data = Cifar10_data(verbose=False)

        self.build_model()

        # iter related
        
        self.current_t = 0
        self.last_one_t=False
        self.current_v = 0
        self.last_one_v=False
        
        self.n_subb = 1
        self.n_epochs = nb_epoch
        self.epoch=0
.
.
.
def compile_iter_fns(self, sync_type='avg'):
        self.data.batch_data(self.model, batch_size)
        self.data.extend_data(rank=self.rank, size=self.size)
        self.data.shuffle_data(mode='train', common_seed=1234)
        self.data.shuffle_data(mode='val')
        self.data.shard_data(mode='train', rank=self.rank, size=self.size) # to update data.n_batch_train
        self.data.shard_data(mode='val', rank=self.rank, size=self.size) # to update data.n_batch_val

Could you please share your test results for both models based on this dataset:

from theanompi.models.keras_model_zoo.data.cifar10 import Cifar10_data
self.data = Cifar10_data(verbose=False)
@hma02
Copy link
Collaborator

hma02 commented Aug 8, 2017

@dlunga

Ok. Let me run those examples and get back to you.

@hma02
Copy link
Collaborator

hma02 commented Aug 8, 2017

@dlunga

Here are the log files when training Cifar10_model (the toy model):
cifar10-3worker-log.txt
cifar10-1worker-log.txt
and Wide_ResNet:
wresnet-3worker-log.txt

Single worker Wide_ResNet has not finished yet.

@hma02
Copy link
Collaborator

hma02 commented Aug 9, 2017

@dlunga

The single worker Wide_ResNet log:
wresnet-1worker-log.txt

@hma02
Copy link
Collaborator

hma02 commented Aug 9, 2017

@dlunga

Note though both dataset classes are named Cifar10_data, there are some differences in the class methods between the two for feeding input data. This is due to some Keras internals which are not supposed to be used by users and therefore it's a bit of hacking Keras.

On the other hand, the Cifar10_data for Cifar10_model is implemented in the normal Theano-MPI way.

@Nqabz
Copy link

Nqabz commented Aug 10, 2017

Thanks so much. As indicated on the other channel. I will test on a different cluster to see if the issues go away. Will update once done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants