Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak on tf.GraphModel.predict #6937

Closed
LucasMarianoVieira opened this issue Oct 13, 2022 · 9 comments · Fixed by #7490
Closed

Memory Leak on tf.GraphModel.predict #6937

LucasMarianoVieira opened this issue Oct 13, 2022 · 9 comments · Fixed by #7490
Assignees

Comments

@LucasMarianoVieira
Copy link

System information

OS Platform and Distribution: Ubuntu 18.04 LTS
TensorFlow.js installed from: npm
TensorFlow.js version: @tensorflow/tfjs^3.3.0 e @tensorflow/tfjs-node^3.3.0 
(but it seems to affect every single other version I tried)

Describe the current behavior

I'm doing a simple detection task with a TensorFlow saved model (I didn't use the tensorflow_converter tool) that was trained for a given task we have here at the company. I'm not using a GPU. I load the model normally and then load the images into the model for detection of some specific elements in the image. I have tried literally everything I could find. It seems that every time I run the model.predict method (the model being loaded as a tf.GraphModel) the program leaks the memory corresponding to the tensor fed into the method.

The application we have here, makes tens of thousands of detections per day, which causes the program's memory footprint to grow several GB's until it takes all the memory in the computer and crashes. I have already tried the usual methods, like disposing the tensors with tf.dispose, or using tf.tidy to contain code handling tensors, but it's really the model.predict call that is leaking the memory as far as I can tell. Below I give a simple test code I was using that loads a basic model from the TensorFlow Model Zoo.

Describe the expected behavior

The program shouldn't leak memory. So after tens of thousands of detections, the memory footprint should be roughly the same size.

Standalone code to reproduce the issue

Just run the following code with this model SSD ResNet50 V1 FPN 640x640 here with the image avalaible here and the memory footprint of the program will start to increase as it makes more and more detections.

const fs = require("fs");
const path = require('path');
const tf = require('@tensorflow/tfjs-node');

const modelPath = "model/saved_model";

let model = null;
let dtype = null;
let map;

let counterDetection = 0;

const init = async () => {
    await tf.setBackend('tensorflow'); 
    await tf.enableProdMode();
    tf.ENV.set('DEBUG', false);
    await tf.ready();
    let def = (await tf.node.getMetaGraphsFromSavedModel(modelPath))[0];
    let tags = def['tags'];
    let signature = Object.keys(def.signatureDefs)[0]
    let outputs = def.signatureDefs[signature].outputs;
    dtype = Object.values(def.signatureDefs[signature].inputs)[0]['dtype']
    map = {};
    for (let i in Object.keys(outputs)) map[Object.keys(outputs)[i]] = i;
    let t1 = process.hrtime.bigint();
    model = await tf.node.loadSavedModel(modelPath, tags, signature);
    console.log('Tensorflow ready!');
}

async function main(){
    // Load Model
    await init();
    
    let image_name = "image1.jpg";
    var bitmap = fs.readFileSync(image_name);
    let base64 = bitmap.toString('base64');
    let buffer = Buffer.from(base64, 'base64')
	
    let tfimage = tf.node.decodeImage(buffer);
    let expanded = tf.expandDims(tfimage, 0);
    let casted = tf.cast(expanded, dtype);
    tf.dispose(expanded);

    let imageT;
    if (dtype === 'float32') {
        imageT = tf.mul(casted, [1.0 / 255.0]);
    } else {
        imageT = tf.clone(casted);
    }
    tf.dispose(casted);
	tf.dispose(tfimage);

	while(true){
		let res = tf.tidy(() => {
			return model.predict(imageT);
		});
		tf.dispose(res);
		
		//show memory state
		console.log("Memory: ", process.memoryUsage().rss / 1024 / 1024, " MB");
		console.log("Number of tensors in memory: ", (tf.memory().numTensors));
		console.log(" Bytes in Memory: ", (tf.memory().numBytes));
		counterDetection = counterDetection + 1;
		console.log("Detection number: ", counterDetection);

	}
}
setTimeout(main, 100);

Observation

I tried a walkaround where I converted the model using tensorflow_converter, and it seems it doesn't leak memory when I use the model.executeAsync method, but I can't use that with the retrained model we have here (it's the same resnet, but trained over a dataset as instructions here), because loading it causes an TypeError Uncaught TypeError: Cannot read properties of undefined (reading 'children') which I really have no idea why this happens.

I thank you for any help, this has been making me lose my mind for weeks already, is there something I'm missing?

@rthadur rthadur self-assigned this Oct 14, 2022
@rthadur
Copy link
Contributor

rthadur commented Oct 14, 2022

@LucasMarianoVieira have you tried with latest tfjs version ?

@LucasMarianoVieira
Copy link
Author

@rthadur , yes.
I tested both with the 3.21.0 last week, and 4.0.0 version yesterday, the same problem, I keep seeing the memory leak happening.

@bartbutenaers
Copy link

Hi @rthadur,

We have been building an object detection module for the open source Node-RED system, so we can run the detections on a Coral TPU usb stick. That works amazing fast, but we now see that it is leaking memory at the predict function:

var tf = require('@tensorflow/tfjs-node');
var tflite = require('tfjs-tflite-node');
var {CoralDelegate} = require('coral-tflite-delegate');
                    
var modelUrl = "https://coral.ai/models/object-detection/";
                    
tflite.loadTFLiteModel(modelUrl, {delegates: [new CoralDelegate()]}).then(model => {
   ...
}).catch(err => {
   ...
});
                
...
// Here the memory starts growing
var detectionResult = node.model.predict(resizedImageTensor);

The tf.tidy indeed doesn't help. Since memory is filling at high rates, currently our module is not usable. We would appreciate a lot if somebody could have a look at it.

Our setup is not quite the same as the one from @LucasMarianoVieira, since we use TfLite in Tfjs which is still in alpha phase. We did not want to start duplicate issues, but don't hesistate to let us know if you want us to create a new issue!

Kind regards,
Bart

@bartbutenaers
Copy link

Hi,
Is there anybody that could help us with this? We have spend quite some time on getting object detection running on a Coral tpu stick via NodeJs, and it works amazing fast for a single image. Really cool! But due to this memory leak it is unusable for executing object detection on live IP cam streams.

When I compare two heap dumps via Chrome developer tools (with 'recording stack trace enabled'), it shows no usable information. It only tells that the memory was allocated before the profiler was started. Since my profiler was already running before I started processing images, I assume that means that the array buffer allocations are happing outside of our NodeJs processes? But that is unfortunately above my paygrade...

Thanks!!

@ahmedsabie
Copy link
Contributor

I tried it locally and got constant memory usage, haven't been able to reproduce so far

@LucasMarianoVieira
Copy link
Author

I tried again...I'm using Conda to make a virtual environment (but several of the machines we use, I install node directly, no virtual environment involved).

I also just updated Node to version 18.12.1, npm version 8.19.2, all running under Ubuntu 18.04. Same code I presented up there, now with Tensorflow JS version 4.1.0, and the memory leak is still there. It's pretty dramatic, in a matter of an hour or two, it ends up using all the computer's memory.

@bartbutenaers
Copy link

@ahmedsabie,
Sorry for the delay...
Very kind of you to try it!!

Do you know perhaps something that we can try, to find the root cause of the leak on our platform (i.e. Raspberry Pi 4). As mentioned above, the delta between two successive heap dumps doesn't contain useful information.

@LucasMarianoVieira: I assume you already have tried it, but could an explicit call of tf.dispose(imageT) make any difference for you?

@bartbutenaers
Copy link

@LucasMarianoVieira,
I think my memory leak is quite different from yours I'm afraid. Wasn't my intention to hijack your thread here...
In my case the memory leaks for every image with a few Mb, which means a complete image is leaking. Will investigate further if the problem is perhaps somehow in my own code, since you don't have it.

But I can confirm that I also have a slow memory leak like you have.
I added a return statement after the decode last evening:

var imageTensor = tf.node.decodeImage(inputImage);   
tf.dispose(imageTensor);
return;

Yesterday evening the memory usage on my Raspberry Pi 4 was between 377Mb and 399Mb:
image

After it has been running over a night, the memory usage has now increased:
image

If you only execute the decoding, is that enough to start leaking??

@LucasMarianoVieira
Copy link
Author

@bartbutenaers , oh indeed, I tried tf.dispose before.

With the same result. The memory seems to be leaking from within model.predict, at around something on the order of a MB for every run, so it adds up quickly after a few hundreds of runs. If I l don't use tf.tidy or tf.dispose on the tensors involved, then I get that memory leak, and more! XD

When I run just tf.node.decodeImage followed by a tf.dispose. I do see a very tiny increase in memory use of a few MB after a few tens of thousands of runs. So I think that's not significant or related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants