Resource exhausted: OOM when allocating tensor #10

skeletonli · 2019-07-16T01:45:16Z

Dr. Wang, thank you so much for your wonderful work.
When I run last step : python main.py, an error occored:

2019-07-15 17:38:14.500279: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

My gpu information is :
2019-07-15 17:37:54.354919: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:05:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB
2019-07-15 17:37:54.538331: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:09:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB

other information before the error occur is :
2019-07-15 17:38:14.475725: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 2265139200 totalling 2.11GiB
2019-07-15 17:38:14.480071: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:678] Sum Total of in-use chunks: 5.72GiB
2019-07-15 17:38:14.484854: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:680] Stats:
Limit: 9244818801
InUse: 6142976256
MaxInUse: 6369439488
NumAllocs: 16017
MaxAllocSize: 2265139200

I wonder how can I solve the problem,thank you

hwwang55 · 2019-07-16T01:53:06Z

Hi! This is kind of weird because the default batch size is not that large. Reducing the batch size might help.

skeletonli · 2019-07-17T07:59:46Z

Thank you for your reply.
I had tried to set batch_size to 64 and even 32, but it still get error.
I found than the problem appear in the code in train.py of function train():

        # evaluation
        **train_auc = model.eval(sess, get_feed_dict(model, train_data, 0, train_data.size))**

It loads all the train_data into the feed_dict.

In addition when I use nvidia-smi to find out how gpu exhausted, when running the codes
def train(args, train_data, test_data):
model = DKN(args)
with tf.Session() as sess:
...

My gpus almost ues all the memory as show behide:

How can I solve the problem,please?

train_data size:14747
test_data size:408
word_embs size :401650
entity_embs size:91000

skeletonli · 2019-07-18T02:50:17Z

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))

But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

zhhhzhang · 2019-12-24T15:22:25Z

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

Althought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))
But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

how do you solve this problem finally? thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource exhausted: OOM when allocating tensor #10

Resource exhausted: OOM when allocating tensor #10

skeletonli commented Jul 16, 2019

hwwang55 commented Jul 16, 2019

skeletonli commented Jul 17, 2019

skeletonli commented Jul 18, 2019

zhhhzhang commented Dec 24, 2019

Resource exhausted: OOM when allocating tensor #10

Resource exhausted: OOM when allocating tensor #10

Comments

skeletonli commented Jul 16, 2019

hwwang55 commented Jul 16, 2019

skeletonli commented Jul 17, 2019

skeletonli commented Jul 18, 2019

zhhhzhang commented Dec 24, 2019