Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling .destroy() when uploading multiple files does not work correctly #152

Closed
juona opened this issue Jun 10, 2019 · 17 comments
Closed
Assignees

Comments

@juona
Copy link

juona commented Jun 10, 2019

This escalated from a question I asked last week (#151). It's since been closed and my comment possibly went unnoticed, hence a new issue.

Using:

graphql-upload 8.0.6
apollo-server-express 2.0.5 or 2.6.2 - same behaviour with both, either way most likely not an Apollo issue

Mutation schema:

testUploads(uploads: [Upload]): Boolean

Resolver:

testUploads: (_, { uploads }) => {
  console.log("Mutation called");

  return Promise.all([
    uploads[0].then(file => {
      console.log("Upload 0 resolved");
      const stream = file.createReadStream();
      stream.resume(); // LINE 1
    }),
    uploads[1].then(file => {
      console.log("Upload 1 resolved");
      const stream = file.createReadStream();
      stream.resume(); // LINE 2
    })
  ]).then(() => true);
}

I am calling this code via the Insomnia client, using a manually formed multipart request:

{
  "operationName": null,
  "variables": { "uploads": [null, null] },
  "query": "mutation ($uploads: [Upload]) {  testUploads(uploads: $uploads)  }"
}

{ "0": ["variables.uploads.0"], "1": ["variables.uploads.1"] }

I am uploading two different files, their sizes do not seem to have any influence over this.

Here's the odd behaviour:

It does not matter what I have in LINE 1 - could be stream.resume(), stream.destroy(), a stream.pipe(process.stdout), or it could be commented out entirely (i.e. a stream is created via createReadStream() but nothing is ever done with it!), this has no influence to how the code works.

Now, LINE 2 can also be just about anything - a stream.resume(), stream.pipe(process.stdout), or commented out. Whatever the combination of LINE 1 and LINE 2, the code appears to work fine, i.e. the mutation gets called each time, the uploads resolve and I get a response with the value true, as expected. I don't see any increase in the memory usage. Perhaps the unused uploads are piling up somewhere in the filesystem, but right now that's beyond my concerns.

There is one important exception, however: if LINE 2 is stream.destroy(), then this happens:

  1. The first time I call the mutation everything seems to be working fine - both uploads resolve and I get my response.
  2. However, the second time I call it, the request just hangs - it does not even reach my HTTP loggers. I suspect it doesn't even hit Express itself, so not sure what is actually going on.
  3. Here's the fun part. If I cancel the request and then fire another one - it works okay again.
  4. If I then fire a fourth request, you guessed it - it hangs.

From hereon, every other request succeeds, and every other hangs.

The behaviour seems to at least partially defy statements previously made by @mike-marcacci:

  1. The bug only manifests itself when I use stream.destroy() on the second upload, so in some way the order of upload processing (or whatever this is) does matter.
  2. If I do nothing with both streams, there really is no problem, so it doesn't seem like I have to do anything with the streams at all, even after they have been created via createReadStream().

Turns out a properly working stream.destroy() is more important than I first thought. The new .pipeline() method plus the original pipe library rely on using destroy(). Also, while I could fully work around this problem by dropping .pipeline() and using stream.resume() to waste faulty streams, it's quite a bit of work + thinking in a complex stream processing system and I believe it may not be too efficient and is not appropriate semantically.

Thank you for your continued work and support!

@mike-marcacci
Copy link
Collaborator

This is indeed very strange...

The library essentially works like this:

  1. When an HTTP request comes in, node buffers, reads and processes the header, then stops. The remaining request bytes (ie. the body) become part of the node IncomingRequest stream.
  2. The graphql-upload package then uses busboy to parse the multipart content of the stream, which creates a new sub-stream for each distinct upload that is part of the request.
  3. Each of these "upload streams" are piped to fs-capacitor, which buffers the contents to the filesystem.
  4. Your resolver code then asks fs-capacitor to create a new read stream from the buffer file whenever you need to use an upload.
  5. Once all resolvers are finished, graphql-upload tells all the capacitor instances that no more read streams will be opened, allowing them to garbage collect the buffer files once all open read streams are closed (either by ending or being destroyed).
  6. The graphql-upload middleware waits until the request is finished, and then tells express or koa to respond.

While no particular problem has jumped out at me, I do have a few observations regarding your test:

  • The fact that one request can cause a subsequent test to hang suggests that your testing library is NOT actually using HTTP requests, but is rather "mocking" them out and sending the mocked IncomingRequest directly to your express app. If this is the case, I would highly encourage having your tests listen on a local port and use fetch or node's http library to submit requests to your server.
  • The apollo-server-express server comes bundled with graphql-upload, so if you are adding your own instance, there's a chance that both are attempting to consume the request stream and delay responding accordingly, which might cause some odd behavior (running duplicate instances definitely isn't a goal of this project, so I don't think we've really thought through what would happen in this case). If you are using your own instance, consider removing it to see if the bundled one works.
  • Do you have any other express middleware that may be consuming the request stream, or telling it to respond? There may be an incompatibility with some other library.

@juona
Copy link
Author

juona commented Jun 10, 2019

Hey,

Thanks for taking a look. A few things:

  1. These are not tests, I literally have a server running and listening on a local port, and Insomnia is an external HTTP client, much like Postman, it just allows me to skip all the UI and work directly with my back-end code.
  2. I disabled the built-in instance while using the library stand-alone. I have just tried switching back to the built-in instance and the behaviour is the same.
  3. I only have basic stuff like CORS, body-parser, some http logging, etc. I have tried to switch all of it off and the issue is still there.

Matter of fact, I have made this: https://github.com/juona/test-graphql-upload.

Perhaps you could clone my repository and see if you are able to reproduce the issue? The instructions and various test scenarios are in the readme.

BTW, I have noticed something peculiar - using cURL does not display this behaviour. Using Chrome - does. Using a separate NodeJS process to call the same API works fine but my original REST client does not. This is particularly odd, yet still seems at least somehow related to the uploads functionality. I hope you can help me!

@juona
Copy link
Author

juona commented Jun 11, 2019

Some new important details.

First

The issue only presents itself when I make server calls from exactly the same client. E.g. every other call fails (and every other succeeds) if I keep submitting multipart requests from Chrome only. Or from Insomnia only. However, if I submit a request from Chrome and then submit the next one from Insomnia - both work correctly (since these two requests would be "firsts" for both clients). If I then submit a second request from Chrome, it will hang. Same happens if I then try a request via Insomnia. The order of Insomnia vs Chrome does not matter. If I open an incognito window in Chrome, in exactly the same fashion its behaviour is also separate from that of the regular Chrome window (and, of course, Insomnia). Basically, here's an example of how the server calls might go:

  1. Call from Insomnia - success
  2. Call from Chrome incognito - success
  3. Call from Insomnia - hang
  4. Call from Chrome regular - success
  5. Call from Insomnia - success
  6. Call from Chrome regular - hang
  7. Call from Insomnia - hang
  8. Call from Chrome incognito - hang

While it may seem like this is drifting away from graphql-upload, I must stress that not using stream.destroy() fixes this problem entirely, which at the least is a clue to what's happening... :/

Second

Turns out, after the first (successful) request, the next request will hang no matter what this request actually is. It does not have to be a multipart upload or a graphql mutation. Even the simplest GET "/test" will hang. Actually, while the request leaves the client successfully, it never really reaches the Express server - it is totally unclear what happens.

Third

These behaviours are not 100% reliable. I'd say the issue manifests 95% of the time, yet occasionally a request that's supposed to hang will not hang, not sure what influences this.

@juona
Copy link
Author

juona commented Jun 11, 2019

I did some CPU profiling and I can clearly see that the requests which hang never even reach the server. My guess is that if some stream is destroyed, the request is not properly processed, which leads to some confusion on certain clients. Any ideas?

@mike-marcacci
Copy link
Collaborator

@juona I quickly checked out that demo repo and couldn't see anything that you are doing wrong. From my experience, different HTTP clients can have surprisingly different behavior when it comes to handling connection reuse and early returns. In fact, graphql-upload goes out of its way to wait for a request to finish before responding because of odd (and possibly incorrect) HTTP behavior in all major browsers and load balancers when not using HTTP persistent connections (ie. when using Connection: close). This is "step 6" above.

I'll try to look at your demo repo in more depth when I get a chance.

@juona
Copy link
Author

juona commented Jun 12, 2019

Actually Connection: keep-alive was the only thing to raise a suspicion, and lo and behold, setting Connection: close in the response actually seems to fix the problem.

@juona
Copy link
Author

juona commented Jun 12, 2019

A further implication of this is that if the upload stream errors out, then I observe this behaviour as well. However, it must be said that I don't know a good way to fail an upload, so I do it programmatically, like this: stream.emit("error", new Error("Test")). It is enough to replace LINE 2 with this. Perhaps it is not entirely appropriate?

@juona
Copy link
Author

juona commented Jun 12, 2019

It seems that a single upload is enough to reproduce the problem after all. Also, some extra tests allowed me to observe even more odd behaviours.

The issue is definitely NOT reproducible when using stream.resume(), stream.pipe(process.stdout), doing nothing with the stream, or emitting an error stream.emit("error", new Error("Message"))*. Issue is only reproducible when using stream.destroy().

Setting Connection: close does not seem to have any effect on the problem**.

Insomnia and Chrome

Super interesting - the upload behaviour is quite clearly influenced by the upload size when using the Insomnia REST client or Chrome.

  • When the upload is small (< ~200kB), the issue is basically irreproducible using whatever client.
  • When the upload gets larger (approaching ~300kB), the issue starts quickly creeping in. I will get a number of successful requests (could be any number from 1 to 10 and possibly even more, not sure) before getting one that does not respond.
  • At ~400kB I sometimes get the original behaviour (every other request hangs), and sometimes I get two successful responses before getting a hanged one.
  • Same behaviour with a 5MB file, so the issue appears very quickly when the file size increases from ~250kB to ~400kB and then stays the same for a while.
  • When the file size reaches ~10MB something new happens! The server code is called just fine, stream.destroy() is called, my resolver resolves and yet I do not get a response! This behaviour is not super consistent, every once in a while I do get a successful response.

cURL

The issue remains impossible to reproduce when calling the server via cURL, except when the file size is large (~ 10+MB) - then, as expected, no response is returned to cURL.

Another NodeJS server

The issue is not reproducible at all when calling the server via another basic local express app on NodeJS, but occasionally I will receive this error in the response: { Error: write EPIPE at WriteWrap.afterWrite [as oncomplete] (net.js:788:14) errno: 'EPIPE', code: 'EPIPE', syscall: 'write' }. Judging by the logs, the resolver executes correctly but the response is an error. Since I am not consuming the stream, I can't comment any further than this.


My code for a single upload (also available in the same repository, under v2.0.0):

testUpload(uploads: [Upload]): Boolean
testUpload: (_, { upload }) => {
  console.log("Mutation called");

  return upload
    .then(file => {
      console.log("Upload resolved");
      const stream = file.createReadStream();
      stream.destroy(); // LINE 1
    })
    .then(() => true);
}
{
  operationName: null,
  variables: { upload: null },
  query: "mutation ($upload: Upload) {  testUpload(upload: $upload)  }"
};

{ "0": ["variables.upload"] };

(*) This is not the case when there are multiple uploads. As mentioned in my previous comment, in that case emitting an error produces the same behaviour!
(**) Not the case with multiple uploads. Using Connection: close with multiple uploads seemed to have a tremendous healing effect.

@mike-marcacci
Copy link
Collaborator

mike-marcacci commented Jun 12, 2019 via email

@mike-marcacci
Copy link
Collaborator

I ran out of time today to get to this, but it's high on my list for tomorrow. What version of node is this running on? There are many very subtle differences between implementations of the stream library in the supported versions.

@juona
Copy link
Author

juona commented Jun 13, 2019

No worries, I managed to successfully implement work-arounds in my original code, so the only remaining failure scenario is when the upload stream itself emits an error (since this destroys the stream). And even then the severity of the problem is rather low, since it only really affects the client, so I don't see any security implications either.

The node version is LTS v10.14.1.

@mike-marcacci mike-marcacci self-assigned this Jun 14, 2019
@mike-marcacci
Copy link
Collaborator

@juona thanks again so much for the intensely diligent sleuthing and fantastic demo repo. I was able to find the exact problem, which happens to be this line in apollo-server-express.

Here's what's happening:

  1. The request begins; node processes the headers; graphql-upload processes the map and creates upload promises.
  2. Apollo is called and in turn calls the resolvers.
  3. The resolvers finish before the request stream has finished (an error might have occurred, an upload might be intentionally ignored, etc).
  4. Apollo bypasses express and instead of callingres.send(graphqlResponse), they use the raw ServerResponse stream methods res.write(graphqlResponse); res.end().

This causes node to try to send the response over the connection while the request is still pending, and puts the HTTP transaction in a state that is handled very poorly by node, Chrome, and most other browsers & load balancers.

This is precisely why we've forced express to wait before actually sending the response.

Apollo should be following the express patterns, rather than using the escape hatch when it isn't necessary; if they did need to use the ServerResponse methods directly, they should at least do it the right way, and wait for the request to finish.

A Mini Rant About HTTP w/ Node

All the tutorials you've ever read are wrong. They depend on a race condition that is never met in a simple "hello world" example but inevitably manifest as impossibly illusive behavior like the one in this issue. Whenever you write a response, you should really be doing the following, while consuming the request stream:

function send(data) {
  if (request.complete) {
    response.end(data);
  } else {
    request.on("end", () => response.end(data));
    request.resume();
  }
}

This is still necessary when using any of the major HTTP frameworks, including koa and express, although they make it possible to encapsulate this behavior in middleware (which is what we do here in this repo).

I'll bet that fewer than 0.1% of node devs realize that this is the case, but it is. 🤷🏻‍♂️

mike-marcacci added a commit to mike-marcacci/graphql-server that referenced this issue Jun 14, 2019
We should not bypass the express conventions here. Doing so prevents other middleware from working as intended.

This includes `graphql-upload`, which is currently unable to force the response to wait for the request to be fully consumed, and causes severe bugs, including locking a browser out of subsequent requests to the server.

See:
jaydenseric/graphql-upload#152 (comment)
@mike-marcacci
Copy link
Collaborator

I've opened a PR to fix this w/ Apollo.

@jaydenseric jaydenseric removed the bug label Jun 14, 2019
@jaydenseric
Copy link
Owner

As this is an issue with Apollo Server, the bug label can be removed and issue can be closed. For visibility we can pin it for a while.

@mike-marcacci
Copy link
Collaborator

For anyone else who stumbles across this after Apollo merges my PR, note that the fix only applies to bringing your own instance of graphql-upload; the middleware bundled in all apollo-server- packages lack any sort of waiting.

You'll want to follow this issue I just created over there.

@juona
Copy link
Author

juona commented Jun 14, 2019

Hello there,

Thank you @mike-marcacci so much for the investigation and the detailed explanation and actions taken. It's not the first time you had to deal with problems @apollo-server, sorry about that. I was kinda hoping the defect would be somewhere in the upload lib, as seeing how you two work, I could've probably expected a quick solution : ]

One more question. You mentioned that with fs-capacitor you buffer the uploads into the file system. You did mean to use the word "buffer", right? I.e. you are not really saving the whole files to the file system and then reading from them again, you are merely buffering the stream?

I never really gave any thought to this, but now I was wondering why fs-capacitor is not helping mitigate this problem, and that is exactly why, yes? Hope this is not a tremendously stupid question.

P. S. I have already learned to always use my own instance of graphql-upload.

P. P. S. Now I am one of the 0.1% too. Thanks!

@mike-marcacci
Copy link
Collaborator

@juona RE fs-capacitor, both are true. The reason that library exists at all is that it's possible that your resolvers need to use uploads in a different order than they're sent in the HTTP payload, and it's also possible for you to use the same upload variable multiple times in your GraphQL query.

Because of this, we have to potentially buffer whole uploads. Of course, we can't rely on these fitting into memory, and so temporary files are used instead. If your resolvers are available to consume the uploads as they arrive, they receive the content without waiting for the whole thing to be buffered; it also allows a new read stream to be attached at any point in its lifecycle (which will start by consuming data already buffered to the file). The library is very good about garbage collecting these, and makes sure to use the temporary files directory, so any files that were not removed by the library (due to, say, a power failure) are removed by the OS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants