Memory Leak on tf.GraphModel.predict #6937

LucasMarianoVieira · 2022-10-13T19:12:54Z

System information

OS Platform and Distribution: Ubuntu 18.04 LTS
TensorFlow.js installed from: npm
TensorFlow.js version: @tensorflow/tfjs^3.3.0 e @tensorflow/tfjs-node^3.3.0 
(but it seems to affect every single other version I tried)

Describe the current behavior

I'm doing a simple detection task with a TensorFlow saved model (I didn't use the tensorflow_converter tool) that was trained for a given task we have here at the company. I'm not using a GPU. I load the model normally and then load the images into the model for detection of some specific elements in the image. I have tried literally everything I could find. It seems that every time I run the model.predict method (the model being loaded as a tf.GraphModel) the program leaks the memory corresponding to the tensor fed into the method.

The application we have here, makes tens of thousands of detections per day, which causes the program's memory footprint to grow several GB's until it takes all the memory in the computer and crashes. I have already tried the usual methods, like disposing the tensors with tf.dispose, or using tf.tidy to contain code handling tensors, but it's really the model.predict call that is leaking the memory as far as I can tell. Below I give a simple test code I was using that loads a basic model from the TensorFlow Model Zoo.

Describe the expected behavior

The program shouldn't leak memory. So after tens of thousands of detections, the memory footprint should be roughly the same size.

Standalone code to reproduce the issue

Just run the following code with this model SSD ResNet50 V1 FPN 640x640 here with the image avalaible here and the memory footprint of the program will start to increase as it makes more and more detections.

const fs = require("fs");
const path = require('path');
const tf = require('@tensorflow/tfjs-node');

const modelPath = "model/saved_model";

let model = null;
let dtype = null;
let map;

let counterDetection = 0;

const init = async () => {
    await tf.setBackend('tensorflow'); 
    await tf.enableProdMode();
    tf.ENV.set('DEBUG', false);
    await tf.ready();
    let def = (await tf.node.getMetaGraphsFromSavedModel(modelPath))[0];
    let tags = def['tags'];
    let signature = Object.keys(def.signatureDefs)[0]
    let outputs = def.signatureDefs[signature].outputs;
    dtype = Object.values(def.signatureDefs[signature].inputs)[0]['dtype']
    map = {};
    for (let i in Object.keys(outputs)) map[Object.keys(outputs)[i]] = i;
    let t1 = process.hrtime.bigint();
    model = await tf.node.loadSavedModel(modelPath, tags, signature);
    console.log('Tensorflow ready!');
}

async function main(){
    // Load Model
    await init();
    
    let image_name = "image1.jpg";
    var bitmap = fs.readFileSync(image_name);
    let base64 = bitmap.toString('base64');
    let buffer = Buffer.from(base64, 'base64')
	
    let tfimage = tf.node.decodeImage(buffer);
    let expanded = tf.expandDims(tfimage, 0);
    let casted = tf.cast(expanded, dtype);
    tf.dispose(expanded);

    let imageT;
    if (dtype === 'float32') {
        imageT = tf.mul(casted, [1.0 / 255.0]);
    } else {
        imageT = tf.clone(casted);
    }
    tf.dispose(casted);
	tf.dispose(tfimage);

	while(true){
		let res = tf.tidy(() => {
			return model.predict(imageT);
		});
		tf.dispose(res);
		
		//show memory state
		console.log("Memory: ", process.memoryUsage().rss / 1024 / 1024, " MB");
		console.log("Number of tensors in memory: ", (tf.memory().numTensors));
		console.log(" Bytes in Memory: ", (tf.memory().numBytes));
		counterDetection = counterDetection + 1;
		console.log("Detection number: ", counterDetection);

	}
}
setTimeout(main, 100);

Observation

I tried a walkaround where I converted the model using tensorflow_converter, and it seems it doesn't leak memory when I use the model.executeAsync method, but I can't use that with the retrained model we have here (it's the same resnet, but trained over a dataset as instructions here), because loading it causes an TypeError Uncaught TypeError: Cannot read properties of undefined (reading 'children') which I really have no idea why this happens.

I thank you for any help, this has been making me lose my mind for weeks already, is there something I'm missing?

The text was updated successfully, but these errors were encountered:

rthadur · 2022-10-14T14:39:34Z

@LucasMarianoVieira have you tried with latest tfjs version ?

LucasMarianoVieira · 2022-10-14T16:41:06Z

@rthadur , yes.
I tested both with the 3.21.0 last week, and 4.0.0 version yesterday, the same problem, I keep seeing the memory leak happening.

bartbutenaers · 2022-11-07T06:14:52Z

Hi @rthadur,

We have been building an object detection module for the open source Node-RED system, so we can run the detections on a Coral TPU usb stick. That works amazing fast, but we now see that it is leaking memory at the predict function:

var tf = require('@tensorflow/tfjs-node');
var tflite = require('tfjs-tflite-node');
var {CoralDelegate} = require('coral-tflite-delegate');
                    
var modelUrl = "https://coral.ai/models/object-detection/";
                    
tflite.loadTFLiteModel(modelUrl, {delegates: [new CoralDelegate()]}).then(model => {
   ...
}).catch(err => {
   ...
});
                
...
// Here the memory starts growing
var detectionResult = node.model.predict(resizedImageTensor);

The tf.tidy indeed doesn't help. Since memory is filling at high rates, currently our module is not usable. We would appreciate a lot if somebody could have a look at it.

Our setup is not quite the same as the one from @LucasMarianoVieira, since we use TfLite in Tfjs which is still in alpha phase. We did not want to start duplicate issues, but don't hesistate to let us know if you want us to create a new issue!

Kind regards,
Bart

bartbutenaers · 2022-11-20T07:20:54Z

Hi,
Is there anybody that could help us with this? We have spend quite some time on getting object detection running on a Coral tpu stick via NodeJs, and it works amazing fast for a single image. Really cool! But due to this memory leak it is unusable for executing object detection on live IP cam streams.

When I compare two heap dumps via Chrome developer tools (with 'recording stack trace enabled'), it shows no usable information. It only tells that the memory was allocated before the profiler was started. Since my profiler was already running before I started processing images, I assume that means that the array buffer allocations are happing outside of our NodeJs processes? But that is unfortunately above my paygrade...

Thanks!!

ahmedsabie · 2022-11-21T19:07:19Z

I tried it locally and got constant memory usage, haven't been able to reproduce so far

LucasMarianoVieira · 2022-11-23T19:25:28Z

I tried again...I'm using Conda to make a virtual environment (but several of the machines we use, I install node directly, no virtual environment involved).

I also just updated Node to version 18.12.1, npm version 8.19.2, all running under Ubuntu 18.04. Same code I presented up there, now with Tensorflow JS version 4.1.0, and the memory leak is still there. It's pretty dramatic, in a matter of an hour or two, it ends up using all the computer's memory.

bartbutenaers · 2022-11-26T22:19:00Z

@ahmedsabie,
Sorry for the delay...
Very kind of you to try it!!

Do you know perhaps something that we can try, to find the root cause of the leak on our platform (i.e. Raspberry Pi 4). As mentioned above, the delta between two successive heap dumps doesn't contain useful information.

@LucasMarianoVieira: I assume you already have tried it, but could an explicit call of tf.dispose(imageT) make any difference for you?

bartbutenaers · 2022-11-27T07:32:41Z

@LucasMarianoVieira,
I think my memory leak is quite different from yours I'm afraid. Wasn't my intention to hijack your thread here...
In my case the memory leaks for every image with a few Mb, which means a complete image is leaking. Will investigate further if the problem is perhaps somehow in my own code, since you don't have it.

But I can confirm that I also have a slow memory leak like you have.
I added a return statement after the decode last evening:

var imageTensor = tf.node.decodeImage(inputImage);   
tf.dispose(imageTensor);
return;

Yesterday evening the memory usage on my Raspberry Pi 4 was between 377Mb and 399Mb:

After it has been running over a night, the memory usage has now increased:

If you only execute the decoding, is that enough to start leaking??

LucasMarianoVieira · 2022-11-28T13:42:55Z

@bartbutenaers , oh indeed, I tried tf.dispose before.

With the same result. The memory seems to be leaking from within model.predict, at around something on the order of a MB for every run, so it adds up quickly after a few hundreds of runs. If I l don't use tf.tidy or tf.dispose on the tensors involved, then I get that memory leak, and more! XD

When I run just tf.node.decodeImage followed by a tf.dispose. I do see a very tiny increase in memory use of a few MB after a few tens of thousands of runs. So I think that's not significant or related.

rthadur self-assigned this Oct 14, 2022

rthadur added stat:awaiting response comp:core labels Oct 14, 2022

google-ml-butler bot removed the stat:awaiting response label Oct 14, 2022

rthadur assigned lina128 and unassigned rthadur Oct 15, 2022

rthadur added the stat:awaiting response label Nov 22, 2022

google-ml-butler bot removed the stat:awaiting response label Nov 23, 2022

pyu10055 mentioned this issue Mar 16, 2023

[tfjs-node] fixed summary writer memory leak #7490

Merged

pyu10055 closed this as completed in #7490 Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak on tf.GraphModel.predict #6937

Memory Leak on tf.GraphModel.predict #6937

LucasMarianoVieira commented Oct 13, 2022

rthadur commented Oct 14, 2022

LucasMarianoVieira commented Oct 14, 2022

bartbutenaers commented Nov 7, 2022

bartbutenaers commented Nov 20, 2022

ahmedsabie commented Nov 21, 2022

LucasMarianoVieira commented Nov 23, 2022

bartbutenaers commented Nov 26, 2022

bartbutenaers commented Nov 27, 2022

LucasMarianoVieira commented Nov 28, 2022

Memory Leak on tf.GraphModel.predict #6937

Memory Leak on tf.GraphModel.predict #6937

Comments

LucasMarianoVieira commented Oct 13, 2022

System information

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue

Observation

rthadur commented Oct 14, 2022

LucasMarianoVieira commented Oct 14, 2022

bartbutenaers commented Nov 7, 2022

bartbutenaers commented Nov 20, 2022

ahmedsabie commented Nov 21, 2022

LucasMarianoVieira commented Nov 23, 2022

bartbutenaers commented Nov 26, 2022

bartbutenaers commented Nov 27, 2022

LucasMarianoVieira commented Nov 28, 2022