-
Notifications
You must be signed in to change notification settings - Fork 27.3k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Error 500] "Socket Hang Up" Randomly Occurring on any Routes in Production Mode #51605
Comments
This issue will be easier to assess if you provide a simple project that reproduces this issue. Nevertheless, based on your stack trace, it looks like you are trying to connect to TLS/SSL socket (which I doubt Nextjs handles such a thing, it is probably handled by one of your libraries). Based on your dependencies too, I am gonna give a big shot that you are somehow trying to connect to a database to authenticate a user. This is a wild guess, but I think the connection between your web server to your database is somehow closed (or not stable, or anything in between really). This is already out of scope. But maybe, for a quick fix, you can check the connection to your database, or simply restart your Nextjs server (if possible. Because it will re-instantiate the database variable and database connection). |
Another thing I overlooked; Shouldn't database call trigger Perhaps the socket hangup is from the vercel side? They have HTTPS handling on their side, and if the client suddenly closes the connection whilst the request isn't complete, maybe it'll throw an error? But I don't think that's the case either. If it were the case many vercel users would have reported that already. Maybe you could provide a simple reproduction code, and see if I can reproduce it myself on vercel. |
We cannot recreate the issue with the provided information. Please add a reproduction in order for us to be able to investigate. Why was this issue marked with the
|
We're facing the same issue, also shortly after upgrading to next 13 Logs are looking like this: and then thousands of errors: This is random but never on fresh instances, for now each time (3 times had this problem) is occuring after days since deploy. Looks like once socket breaks it can't be recreated? I saw @SebastienSusini is using vercen, I'm using aws ecs tasks |
btw I can't share whole project, and since it can happen after week and milion request, I'm not sure how easy would it be to recreate it. Maybe easier route would be enable some debug logs? @SebastienSusini are you having big traffic on that project? Are you experiencing this errors also after some time passes since deploy or randomly it can happen few minutes/hours after deploy? When you upgraded to 13 and how many these incidents you had? |
I'm also seeing this same error, and it occurs 1/10 requests reliably on production, but cannot reproduce it locally with
Notably, my app renders the page, but occasionally throws this error in |
Seems possibly related to #49587 |
@SebastienSusini I also isolated this issue in my app to have started in next |
13.4.12 with the same problem man |
We are also running into this issue, with the same circumstances as described before:
Error: socket hang up
at connResetException (node:internal/errors:705:14)
at Socket.socketOnEnd (node:_http_client:518:23)
at Socket.emit (node:events:525:35)
at Socket.emit (node:domain:489:12)
at endReadableNT (node:internal/streams/readable:1358:12)
at processTicksAndRejections (node:internal/process/task_queues:83:21) {
code: 'ECONNRESET' We will now downgrade to For info, we are running on AWS EC2 instances. @0xadada do I understand correctly that in your case following requests do get handled? For us it seems to completely stop the server from being able to handle any requests after that point. |
Socket hangups do occur from time to time if the client is aborting the connection, and it seems like after it aborted next.js still actively waiting for incoming TCP packets. There are few candidates where this error could occur, but since this error is happening on production mode where incoming traffic might be huge and really hard to reproduce on a small scale, pinpointing an exact part is hard. Nonetheless, I have some rough ideas where this problem(s) could be, based on the effects some were mentioned.
In both cases, next.js uses http-proxy to forward the requests between processes. Though I might write a proposal to rewrite IPC communications between next.js processes to handle requests better, (support for IPC callbacks; passing req,res pair to another process; etc) @dbrxnds With the first point described, does the subsequent request after the error failed immediately or are there timeout before the subsequent request failed? @0xadada I need to confirm, are you deploying this on a machine or shared hosting (vercel, etc)? Do you use appDir or pageDir or combination of both? |
Appreciate the well written response, @NadhifRadityo. I am fairly certain subsequent requests just hang, at least for a good while. We end up getting an error response saying "the upstream server returned an invalid response" but I assume that is just the load balancer or some other part doing its' thing. Requests do just remain pending in your network tab until that point |
This seems unlikely, but are there a chance of your next.js project does I/O operations synchronously or heavy synchronous tasks? Also I need to confirm, the request hangs for any routes right? (dynamic page, static page, static resources) And to make things sure, can you do a process list with process arguments, before and after the error? Search something like And for the record, do you use appDir only or pageDir or combination of both? I will try to eliminate IPC communication first as it makes the most sense in my opinion. I'll try manually killing the worker process, and see if I can reproduce the problem. |
@NadhifRadityo yes, i've got next next.js process running in a shared Docker container with a ruby webserver. Ruby webclient makes HTTP requests to our next process on localhost running |
@0xadada I'd like to cover everything because you get a different stack trace. Or perhaps you also get socket hangups too? |
I do not use appDir, we use
Subsequent requests are always handled. Starting in next
any time either error A or B occurred (~10% of all requests), this would appear in the production logs:
|
@0xadada Interesting... It seems like your issue is a bit different but I think it correlates. I think this could be just the client aborting the connection and next.js still actively waiting for incoming TCP packets. I'll try to reproduce it by making a long request and abort it. See if I can get a reproduction. |
This error sometimes happens in developer mode in version 13.4.12, when there is a lot of refresh it stops working and needs to start the terminal again. |
It does not.
Correct, any route, API routes. Everything.
Pages directory with API routes
I will attempt to do this once it occurs again! |
@0xadada I managed to get a reproduction of your problem and I have created a new issue for that. It seems like your project is having a memory issue which restarts your render worker repeatedly. This is also consistent with you having only 10% of your requests getting these errors since every request will grow a fair bit of memory until it reached a threshold. See #53353 for my detailed explanations. (You can confirm this by checking if the render worker PID changes). And I need to mention that next.js is having a memory problem currently (#46756, #49929, #48748, #49929). Perhaps, you could try reverting to the earlier versions, and see if it helps. Edit: It also might be something else in your project that kills your render worker. High memory usage is just my assumption because people are having these issues as well. |
@mthmcalixto I think this is caused by high memory usage problem currently in newer next.js version. #46756 describes that after editing files and refreshing, memory usage grows rapidly because of recompilation. And because of high memory usage, the worker will restart and on going request will be aborted. |
Our team has reverted to |
Glad to see this thread since I am also experiencing frequent (seemingly random) "socket hang up errors" and it was quite hard to debug the root cause. I think my use case is different so I will add my error scenario just in case it makes the problem exploration easier. Setup
I am developing my next app in a Docker compose composition. The next app runs in the node 18 alpine image as per the Docker examples provided. Other containers in the composition are a postgres db, Prisma (studio), cerbos and strapi. I am developing on MacOS (Mac with M2). I have not yet deployed my app to production and only use development mode currently. Answers to your questions
Things I have tried
Hope this helps! Following this issue and reverting to next 13.4.5 for now... |
Indeed, I noticed Also, this issue is consistent with people using prisma in their project. I will try to reproduce it when I have the time (I have been busy with college). |
I would like to add my experience with this issue
So, I don't think the issue is related to Prisma or NextAuth |
After updating to Prisma 5.3.0, the issue seems to be resolved (in development) so it might be that the issues are unrelated, but I will continue to monitor. |
Using Next.js 13.4.9 and have been facing this issue consistently on my deployment on Vercel. We don't use any of the separate packages mentioned above. It happens quite often for us. For certain APIs on the pages router, it happens almost every 3rd API call. Its like the request doesn't even reach the handler as there are no internal logs printed alongside. Error log from Vercel -
|
We could "solve" or at least work around the socket hang up errors by replacing node with bun. |
Issue happens for us as well.
|
It does reproduce for us as well. Sometimes in the production with "next": "^13.4.1".
|
I have an application with NextJS version 13.4.12 which is deployed on AWS EC2. After I try to upgrade NextJS to version 14.0.0, the AWS health check task fails to execute. The AWS CloudWatch log shows I then went into the ecs container and executed the following command: /app $ wget -qO- localhost:3000
wget: can't connect to remote host (127.0.0.1): Connection refused I found that localhost:300 is unreachable. Then I changed my health check command parameters to: diff --git a/.aws/ecs/dev/task-definition.json b/.aws/ecs/dev/task-definition.json
index abf723d..cf5103d 100644
--- a/.aws/ecs/dev/task-definition.json
+++ b/.aws/ecs/dev/task-definition.json
@@ -45,7 +45,10 @@
"interactive": null,
"healthCheck": {
"retries": 3,
- "command": ["CMD-SHELL", "wget --spider -q localhost:3000/ || exit 1"],
+ "command": [
+ "CMD-SHELL",
+ "wget --spider -q ${HOSTNAME}:3000/ || exit 1"
+ ],
"timeout": 10,
"interval": 5,
"startPeriod": 180 By now, the health check runs successfully and no longer gives the |
In our case it is not random. There is a feed.xml route in our app and it sends a request to S3 to get the actual feed. But the file size is about ~100mb, it gives "socket hang up" error and then it breaks the entire app. No way to handle the error in try-catch. I've tried almost everything to get rid of this error, including trying to use node https or other 3rd party http clients instead of fetch api. But it seems, the problem is not related to next, it's a node.js or maybe undici related problem. Also tried some other node images but I think a stable node image could fix this issue. |
I faced the same issue when I tried to upgrade Node.js version to 20 from 16, and it only occurs in production environment
However, after upgrading Next.js to latest (14.0.3), it seems the issues is gone |
"next": "^14.0.3" the same issue when running custom server |
I have just tried v14.0.4-canary.47 and the issue persists. I also tried Node.js v18 and v20. We are only using the App Router. We do not use Prisma or NextAuth. This is affecting builds hosted on Vercel.com (including production). It takes a little while for the issue to pop up after deploying, but after a few RSC renders, it happens quite often (~15% of the time). |
Same issue here, happening very often, impossible to find where it comes from. I am using "next": "^14.0.4", with nextAuth, and nextJS middleware (my app uses also Wundergraph/sdk) Any update on this issue? Thx a lot |
In case anyone is using Sentry, our issue turned out to be related to a bug with the https://github.com/orgs/vercel/discussions/3248#discussioncomment-7851868 |
I am not using Sentry, but I would be very interested to understand which type of error from Sentry was solved. Indeed I have "Socket hang up error" quite often but have found no ways to track the issue for now... Thx I am using nextjs 14.0.4, nextauth with middleware, and wundergraph as backend. |
Also have this exact problem, using NextJS 14.0.3. The error is not caught by our NextJS error boundary, and the user sees the default Vercel 500 error page (black background, white text). Site works after simply refreshing. My initial thought (before finding this thread) was; could this have something to do with cookies from Vercel preview deployments? |
Also experiencing this issue. Node 18/20, Next 14.0.4, next-auth 4.24.5. Specifically, my
Not using Vercel, this is on a Windows Server 2022 VM and locally on my M1 Mac. Works fine running
I was seeing this intermittently, but over the last week something has changed and I am seeing it 100% of the time. Again, this all works perfectly when running Other issues I've found along the way that may be related?
|
Exact same issue on Node 18, Next 14.0.4, next-auth 4.24.5. In
Not sure how it's related, but the random freezes we experienced in production every few days are now completely gone (2 weeks in a row without this issue)! I'd guess a user from time to time gets an unexpected error (from authenticating in our case), which triggered this 500 error page and since |
Hello guys, if your using sentry and keeps getting 500 error, refer to this thread Error 500. This fixed my random error 500. |
Same thing here with
for any request that takes more than 30s to complete. Seems related to the previous issue: I can see that the remote server continues to process and eventually completes and returns a correct response - not the 500 that nextjs is claiming. Is there any way to eliminate the timeout or customize it to a longer period? |
Hi everyone, I will be moving this issue to our We encourage folks to file a new issue with a consistently reproducible Happy 2024! |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Verify canary release
Provide environment information
Operating System: Platform: darwin Arch: x64 Version: Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64 Binaries: Node: 16.14.2 npm: 8.5.0 Yarn: 1.22.15 pnpm: 6.11.0 Relevant packages: next: 13.4.6 eslint-config-next: 13.4.2 react: 18.2.0 react-dom: 18.2.0 typescript: 4.9.5
Which area(s) of Next.js are affected? (leave empty if unsure)
No response
Link to the code that reproduces this issue or a replay of the bug
not possible confidential
To Reproduce
this our package.json
our next.config.js :
our middleware.ts
Describe the Bug
We are experiencing a bug that occurs randomly for some of our users, only in production, on any route of the site, and it has never been reported on Sentry. We can only see it in the Vercel logs.
The full error message is as follows:
Uncaught Exception {"errorType":"Error","errorMessage":"socket hang up","code":"ECONNRESET","stack":["Error: socket hang up"," at connResetException (node:internal/errors:717:14)"," at TLSSocket.socketOnEnd (node:_http_client:526:23)"," at TLSSocket.emit (node:events:525:35)"," at TLSSocket.emit (node:domain:489:12)"," at endReadableNT (node:internal/streams/readable:1359:12)"," at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"]} Unknown application error occurred Runtime.Unknown.
We think (but can't verify) that this bug appeared when we updated to Next.js 13. However, none of our pages use appRouter; we're still using Page Router for the time being. We've seen that rewrites can cause socket hangs, but as you can see in our next.config.js, we don't use rewrites.
This can happen on SSG (Static Site Generation), SSR (Server-Side Rendering), or Client-side rendered pages.
It can also happen on any browser or device.
Honestly, we have no clue or way of reproducing this problem because even in our development environment, we don't encounter any problems.
Expected Behavior
I expect the application to work seamlessly without any errors or disruptions. Specifically, I anticipate that the mentioned "Socket Hang Up" error will not occur randomly in production mode on any route of the site. Additionally, I hope that better error handling mechanisms will be implemented to address any potential issues that may arise.
Which browser are you using? (if relevant)
No response
How are you deploying your application? (if relevant)
Vercel
The text was updated successfully, but these errors were encountered: