Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change: #7951 (tfjs-v4.11.0) #7974

Closed
AmitMY opened this issue Sep 24, 2023 · 20 comments
Closed

Breaking change: #7951 (tfjs-v4.11.0) #7974

AmitMY opened this issue Sep 24, 2023 · 20 comments
Assignees
Labels
comp:webgpu type:bug Something isn't working

Comments

@AmitMY
Copy link

AmitMY commented Sep 24, 2023

  • TensorFlow.js version (use command below): tfjs-v4.11.0

Describe the current behavior
Old models fail now that #7951 is merged. Error:

ERROR Error: Uncaught (in promise): Error: {this.name} outputs 1 tensors but only one mask
Error: {this.name} outputs 1 tensors but only one mask

(This is not the case with v4.10.0)

Describe the expected behavior
Minor updates should not break functionality

Standalone code to reproduce the issue
Same as #6699

@AmitMY AmitMY added the type:bug Something isn't working label Sep 24, 2023
@gaikwadrahul8 gaikwadrahul8 self-assigned this Sep 25, 2023
@gaikwadrahul8
Copy link
Contributor

gaikwadrahul8 commented Sep 25, 2023

Hi, @AmitMY

Thank you for bringing this issue to our attention and I attempted to replicate the same issue from my end and I'm getting below error message so to confirm, are you also getting the same error message or Am I doing something wrong here, if so please help me with steps to replicate the same issue from my end which will help us to investigate this issue further.

I'm following this official documentation to replicate the same issue from my end. Thank you!

I received the below error message :

image

@AmitMY
Copy link
Author

AmitMY commented Sep 26, 2023

I'm getting the same issue now.

Here, I wrote a minimal reproduction:
What I learned, is that the problem goes away if I don't resetDropout (I don't want dropout in inference)

(Here's a directory with a new model, and an HTML file. to reproduce, I ran http-server .)
https://drive.google.com/file/d/1Y8OIdTetbDn0mcppE_sRL8bRP-vh_1tP/view?usp=drive_link

<!DOCTYPE html>
<html lang="en-US">
<head>
    <meta charset="utf-8" />
    <title>TensorFlow.js browser example</title>

    <!-- Load TensorFlow.js from a script tag -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@latest/dist/tf.min.js"></script>
</head>
<body>
<h1>TensorFlow.js example</h1>
<h2>Open the console to see the results.</h2>
<script>
    function resetDropout(layers) {
        for (const layer of layers) {
            // For Sequential models, the layers are in the layers property
            if (layer.layers) {
                resetDropout(layer.layers);
            }

            // For TimeDistributed models, the layer is in the layer property
            if (layer.layer) {
                resetDropout([layer.layer]);
            }

            // If the layer is a dropout layer, reset the rate
            if (layer.rate) {
                layer.rate = 0;
            }
        }
    }

    async function main() {
        const model = await tf.loadLayersModel('model_new.json');
        resetDropout(model.layers);

        const tensor = tf.zeros([1, 1, 256, 256, 3]).toFloat();
        tensor.print();

        // Must apply model in training=True mode to avoid using aggregated norm statistics
        model.apply(tensor, {training: true}).print();
    }

    main();
</script>
</body>
</html>

@gaikwadrahul8
Copy link
Contributor

gaikwadrahul8 commented Sep 27, 2023

Hi, @AmitMY

Thank you for providing the minimal code example to replicate the same issue from my end and I'm getting below error message so to confirm, Are you also getting the same error message or different error message if possible could you please share your error log or screenshot which will help us to investigate this issue further. Thank you!

For reference I have added screenshot below :

image

When I tried without resetDropout option that error is not occurring and it's working as expected if I'm not wrong :

Tensor
    [[[[[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       ...
       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]]]]
print.js:34 Tensor
    [[[[[-0.5485492, 0.1209364 , -0.5575122],
        [-0.4198668, 0.3109047 , -0.3952827],
        [-0.4475051, 0.3676944 , -0.4088269],
        ...,
        [-0.3347363, -0.1958253, -0.5153912],
        [-0.3527599, -0.2083139, -0.5241337],
        [-0.4063931, -0.0599973, -0.5099336]],

       [[-0.4308985, 0.2829142 , -0.4243529],
        [-0.3667891, 0.3545043 , -0.3479016],
        [-0.4654999, 0.3381675 , -0.4241154],
        ...,
        [-0.1721301, -0.3778005, -0.4985313],
        [-0.1661717, -0.3855677, -0.4854655],
        [-0.2940664, -0.1737114, -0.4872297]],

       [[-0.485551 , 0.2725164 , -0.4622128],
        [-0.4397161, 0.3448741 , -0.4254563],
        [-0.5056862, 0.3295551 , -0.469498 ],
        ...,
        [-0.185382 , -0.3900234, -0.5167618],
        [-0.192379 , -0.3988956, -0.5107889],
        [-0.2973868, -0.1739733, -0.5081293]],

       ...
       [[-0.4989672, 0.3559542 , -0.4790567],
        [-0.4982536, 0.3546511 , -0.4794993],
        [-0.4979915, 0.3549074 , -0.478335 ],
        ...,
        [-0.5017183, 0.3484358 , -0.4759241],
        [-0.4635063, 0.3229676 , -0.4686852],
        [-0.4900923, 0.3222119 , -0.4737524]],

       [[-0.4948567, 0.3582715 , -0.4819818],
        [-0.4974912, 0.3562227 , -0.4809256],
        [-0.4989477, 0.3557112 , -0.4791932],
        ...,
        [-0.4984515, 0.3406397 , -0.4796974],
        [-0.4526571, 0.2930876 , -0.4691183],
        [-0.4757496, 0.2983395 , -0.4714819]],

       [[-0.494108 , 0.3538587 , -0.4799017],
        [-0.493968 , 0.3533177 , -0.4801645],
        [-0.4948561, 0.3538442 , -0.4791983],
        ...,
        [-0.499143 , 0.3447358 , -0.4813157],
        [-0.4782982, 0.3105071 , -0.4720287],
        [-0.4577081, 0.2640458 , -0.4753132]]]]]

@AmitMY
Copy link
Author

AmitMY commented Sep 27, 2023

Yes, the error I am getting is the same:

ERROR Error: Uncaught (in promise): Error: {this.name} outputs 1 tensors but only one mask
Error: {this.name} outputs 1 tensors but only one mask

This is happening with 4.11.0 but not with 4.10.0

Just so it is clear:
When running a model in predict mode there is no dropout, but I have to use training for this model because of gathered training statistics. This is another known issue.
But in any case, this is a breaking change.

@gaikwadrahul8
Copy link
Contributor

Hi, @AmitMY

Thank you for the confirmation, I really appreciate your findings and observations which will help us to investigate this issue further. We will have to dig more into this issue and will update you soon. Thank you!

@Linchenn
Copy link
Collaborator

Hi @haoyunfeix , I saw you fixed #6699 before, could you help check this again?

@gyagp
Copy link

gyagp commented Sep 30, 2023

Yunfei already left Intel, so he may not double check this. We will take a look once we get back to office after holidays (Oct 7).

@Linchenn
Copy link
Collaborator

Thank you Yang and take your time.

@mattsoulanille mattsoulanille self-assigned this Oct 3, 2023
@mattsoulanille
Copy link
Member

mattsoulanille commented Oct 3, 2023

This is likely a bug in #7951. I'm trying to reproduce it, but I'm running into a different error than expected:

Uncaught (in promise) Error: Identity matrix initializer can only be used for 2D square matrices.
    at initializers.js:241:15
    at engine.js:469:20
    at e.value (engine.js:480:19)
    at e.value (engine.js:467:17)
    at e8 (globals.js:192:17)
    at n.value (initializers.js:239:12)
    at n.value (topology.js:1288:35)
    at n.value (recurrent.js:1703:24)
    at n.value (recurrent.js:516:17)
    at topology.js:981:14

Did the model file shared at that drive link change? The sha256sum I'm seeing for the zip file is ab24c0af3c44d6268b8914a346bd8917b17984245e9ffd6b0ee5d751cefd7167. Additionally, I'd be interested to know if this only fails with WebGPU, since if it's a bu in #7951, it's probably going to affect all backends.

@AmitMY
Copy link
Author

AmitMY commented Oct 4, 2023

I redownloaded the zip file and can confirm the same hash:

shasum -a 256 web_model.zip
ab24c0af3c44d6268b8914a346bd8917b17984245e9ffd6b0ee5d751cefd7167

However, the error I am seeing is still

Uncaught (in promise) Error: {this.name} outputs 1 tensors but only one mask

Furthermore, this repro is not using WebGPU, but WebGL:
image

@axinging
Copy link
Contributor

axinging commented Oct 7, 2023

I tried webgpu, webgl, cpu, all complain same errors:

topology.js:1409 Uncaught (in promise) Error: {this.name} outputs 1 tensors but only one mask
    at n.value (topology.js:1409:13)
    at topology.js:1013:16
    at nue (common.js:48:20)
    at n.value (topology.js:967:12)
    at xce (executor.js:341:25)
    at container.js:746:14
    at engine.js:469:20
    at e.value (engine.js:480:19)
    at e.value (engine.js:467:17)
    at e8 (globals.js:192:17)

Revert this change (#7951) works well on webgpu, webgl, cpu.

Below code is required to change backend before predict:

        await tf.setBackend('webgl'); //webgpu, cpu
        console.log(tf.getBackend());
        await tf.ready();

I drafted a fix here: #7993

@gaikwadrahul8
Copy link
Contributor

Hi, @AmitMY

I apologize for the delayed response. I see that PR #7993 has been merged and I have tested the same code snippet, which is now working as expected. could you please confirm this from your end once ? if so, please feel free to close this issue ? Thank you!

@AmitMY
Copy link
Author

AmitMY commented Oct 18, 2023

i can confirm that i am using 4.12.0
image

but that the bug remains
image

My reproduction still reproduces the issue in 4.12.0
#7974 (comment)

@axinging
Copy link
Contributor

axinging commented Oct 19, 2023

@AmitMY, it seems 4.12.0 doesn't include change #7993.
You can see 4.12.0 commits history here:
https://github.com/tensorflow/tfjs/commits/cb5cd829909a07cdbd14975f2169a78e9213bf4a
Also, you can view it here: https://github.com/tensorflow/tfjs/releases/tag/tfjs-v4.12.0

So @gaikwadrahul8 @Linchenn, Could you help why #7993 is not included in v4.12.0?

BTW, @AmitMY, if you want to try this feature on the latest code, You can download TFJS code and build, then put below file under fold tfjs\e2e\benchmarks\local-benchmark:


<head>
  <title>TensorFlow.js Op demo</title>
</head>

<body>
  <h2>TensorflowJS Model demo</h2>
  <script src="loader.js"></script>
  <script>
    'use strict';

    function resetDropout(layers) {
      for (const layer of layers) {
        // For Sequential models, the layers are in the layers property
        if (layer.layers) {
          resetDropout(layer.layers);
        }

        // For TimeDistributed models, the layer is in the layer property
        if (layer.layer) {
          resetDropout([layer.layer]);
        }

        // If the layer is a dropout layer, reset the rate
        if (layer.rate) {
          layer.rate = 0;
        }
      }
    }

    async function main() {
      await tf.setBackend('webgpu');
      console.log(tf.getBackend());
      await tf.ready();
      tf.env().set('WEBGPU_CPU_FORWARD', false);
      const model = await tf.loadLayersModel('model.json');
      resetDropout(model.layers);

      const tensor = tf.zeros([1, 1, 256, 256, 3]);//.toFloat();
      tensor.print();

      // Must apply model in training=True mode to avoid using aggregated norm statistics
      model.apply(tensor, { training: true }).print();
      console.log("end");
    }

    (async function () {
      let localBuild = ['core', 'cpu', 'webgl', 'webgpu', 'tfjs-converter', 'layers'];
      await loadTFJS(localBuild);
      main();
    })();
  </script>
</body>

</html>

@AmitMY
Copy link
Author

AmitMY commented Oct 23, 2023

since @gaikwadrahul8 says it works, i'll just wait for it to be released. Do you know why it was not released, and if it would be in 4.13.0?

@mattsoulanille
Copy link
Member

Sorry, we're working on a new RC based release structure, and our branch cut for 4.12.0 was before the fix was merged. This will be included in 4.13.0 next week.

@peacefulotter
Copy link

Any news on this? 4.13.0 seems to not be released yet and it's been more than a week. Thanks!

@tylersalminen
Copy link

Any news on this? 4.13.0 seems to not be released yet and it's been more than a week. Thanks!

ditto, waiting for new release with this fix so can enable dropout in our training models

@AmitMY
Copy link
Author

AmitMY commented Nov 8, 2023

No bug in 4.13.0 🥳

@AmitMY AmitMY closed this as completed Nov 8, 2023
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:webgpu type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants