Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource exhausted: OOM when allocating tensor #10

Open
skeletonli opened this issue Jul 16, 2019 · 4 comments
Open

Resource exhausted: OOM when allocating tensor #10

skeletonli opened this issue Jul 16, 2019 · 4 comments

Comments

@skeletonli
Copy link

Dr. Wang, thank you so much for your wonderful work.
When I run last step : python main.py, an error occored:

2019-07-15 17:38:14.500279: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

My gpu information is :
2019-07-15 17:37:54.354919: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:05:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB
2019-07-15 17:37:54.538331: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:09:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB

other information before the error occur is :
2019-07-15 17:38:14.475725: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 2265139200 totalling 2.11GiB
2019-07-15 17:38:14.480071: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:678] Sum Total of in-use chunks: 5.72GiB
2019-07-15 17:38:14.484854: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:680] Stats:
Limit: 9244818801
InUse: 6142976256
MaxInUse: 6369439488
NumAllocs: 16017
MaxAllocSize: 2265139200

I wonder how can I solve the problem,thank you

@hwwang55
Copy link
Owner

Hi! This is kind of weird because the default batch size is not that large. Reducing the batch size might help.

@skeletonli
Copy link
Author

Thank you for your reply.
I had tried to set batch_size to 64 and even 32, but it still get error.
I found than the problem appear in the code in train.py of function train():

        # evaluation
        **train_auc = model.eval(sess, get_feed_dict(model, train_data, 0, train_data.size))**

It loads all the train_data into the feed_dict.

In addition when I use nvidia-smi to find out how gpu exhausted, when running the codes
def train(args, train_data, test_data):
model = DKN(args)
with tf.Session() as sess:
...

My gpus almost ues all the memory as show behide:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 51C P8 16W / 275W | 9863MiB / 11264MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 51C P2 64W / 275W | 9429MiB / 11264MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

How can I solve the problem,please?

train_data size:14747
test_data size:408
word_embs size :401650
entity_embs size:91000

@skeletonli
Copy link
Author

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

Althought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))

But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

@zhhhzhang
Copy link

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

Althought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))

But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

how do you solve this problem finally? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants