Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Error 500] "Socket Hang Up" Randomly Occurring on any Routes in Production Mode #51605

Closed
1 task done
SebastienSusini opened this issue Jun 21, 2023 · 71 comments
Closed
1 task done
Labels
bug Issue was opened via the bug report template. please add a complete reproduction Please add a complete reproduction.

Comments

@SebastienSusini
Copy link

Verify canary release

  • I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
      Platform: darwin
      Arch: x64
      Version: Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
    Binaries:
      Node: 16.14.2
      npm: 8.5.0
      Yarn: 1.22.15
      pnpm: 6.11.0
    Relevant packages:
      next: 13.4.6
      eslint-config-next: 13.4.2
      react: 18.2.0
      react-dom: 18.2.0
      typescript: 4.9.5

Which area(s) of Next.js are affected? (leave empty if unsure)

No response

Link to the code that reproduces this issue or a replay of the bug

not possible confidential

To Reproduce

this our package.json

`{
  "name": "********",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "next dev",
    "dev-https": "NODE_TLS_REJECT_UNAUTHORIZED='0' node server.js",
    "ngrok": "ngrok http https://localhost:3000",
    "build": "next build",
    "postbuild": "next-sitemap",
    "start": "next start",
    "clean": "rimraf .next out",
    "lint": "next lint",
    "lint.fix": "next lint --fix",
    "test": "jest --watch",
    "prepare": "husky install",
    "analyze": "ANALYZE=true next build"
  },
  "dependencies": {
    "@everipedia/wagmi-magic-connector": "^0.12.1",
    "@headlessui/react": "^1.7.15",
    "@headlessui/tailwindcss": "^0.1.3",
    "@heroicons/react": "^1.0.6",
    "@next/bundle-analyzer": "^12.2.0",
    "@next/env": "^13.1.5",
    "@radix-ui/react-dropdown-menu": "^2.0.5",
    "@rainbow-me/rainbowkit": "^0.12.15",
    "@ramp-network/ramp-instant-sdk": "^4.0.2",
    "@react-spring/web": "^9.6.1",
    "@react-three/cannon": "^6.4.0",
    "@react-three/drei": "^9.34.3",
    "@react-three/fiber": "^8.8.10",
    "@segment/analytics-next": "^1.52.0",
    "@sentry/nextjs": "^7.54.0",
    "@stripe/react-stripe-js": "^1.16.3",
    "@stripe/stripe-js": "^1.46.0",
    "@tanstack/react-table": "^8.5.13",
    "@use-gesture/react": "^10.2.19",
    "axios": "^1.4.0",
    "clsx": "^1.2.1",
    "cookies-next": "^2.1.1",
    "date-fns": "^2.29.3",
    "ethers": "^5.7.1",
    "i18next": "^22.4.9",
    "next": "^13.4.6",
    "next-auth": "^4.21.1",
    "next-axiom": "^0.17.0",
    "next-i18next": "^11.3.0",
    "next-password-protect": "^1.8.0",
    "next-share": "^0.18.2",
    "next-sitemap": "^3.1.47",
    "nextjs-progressbar": "^0.0.14",
    "react": "^18.2.0",
    "react-canvas-confetti": "^1.3.0",
    "react-countup": "^6.4.0",
    "react-csv": "^2.2.2",
    "react-currency-input-field": "^3.6.10",
    "react-device-detect": "^2.2.3",
    "react-div-100vh": "^0.7.0",
    "react-dom": "^18.2.0",
    "react-fast-marquee": "^1.3.5",
    "react-hook-form": "^7.41.5",
    "react-hot-toast": "^2.4.0",
    "react-i18next": "^12.1.4",
    "react-icons": "^4.8.0",
    "react-infinite-scroll-component": "^6.1.0",
    "react-intersection-observer": "^9.4.1",
    "react-spring-bottom-sheet": "^3.5.0-alpha.0",
    "react-type-animation": "^2.1.1",
    "react-use-intercom": "^3.0.2",
    "recharts": "2.5.0",
    "sharp": "^0.30.7",
    "swiper": "^9.1.1",
    "swr": "1.3.0",
    "tailwind-merge": "^1.13.1",
    "tailwind-scrollbar": "^3.0.0",
    "tailwind-scrollbar-hide": "^1.1.7",
    "tailwindcss": "^3.1.4",
    "three": "^0.144.0",
    "uuid": "^9.0.0",
    "wagmi": "^0.12.12"
  },
  "devDependencies": {
    "@commitlint/cli": "^17.0.3",
    "@commitlint/config-conventional": "^17.3.0",
    "@testing-library/jest-dom": "^5.16.4",
    "@testing-library/react": "^13.3.0",
    "@types/jest": "^28.1.4",
    "@types/node": "18.0.0",
    "@types/react": "18.0.14",
    "@types/react-csv": "^1.1.3",
    "@types/react-dom": "18.0.5",
    "@types/react-stripe-elements": "^6.0.6",
    "@types/three": "^0.143.0",
    "@types/uuid": "^8.3.4",
    "@typescript-eslint/eslint-plugin": "^5.30.0",
    "@typescript-eslint/parser": "^5.30.0",
    "autoprefixer": "^10.4.7",
    "commitizen": "^4.2.6",
    "commitlint": "^11.0.0",
    "commitlint-config-gitmoji": "2.2.5",
    "cssnano": "^5.1.12",
    "cz-conventional-changelog": "^3.3.0",
    "eslint": "8.18.0",
    "eslint-config-airbnb-base": "^15.0.0",
    "eslint-config-airbnb-typescript": "^17.0.0",
    "eslint-config-next": "^13.3.0",
    "eslint-config-prettier": "^8.5.0",
    "eslint-plugin-import": "^2.26.0",
    "eslint-plugin-jsx-a11y": "^6.6.0",
    "eslint-plugin-prettier": "^4.1.0",
    "eslint-plugin-react": "^7.30.1",
    "eslint-plugin-react-hooks": "^4.6.0",
    "eslint-plugin-simple-import-sort": "^7.0.0",
    "eslint-plugin-tailwindcss": "^3.6.0",
    "eslint-plugin-unused-imports": "^2.0.0",
    "husky": "^8.0.0",
    "jest": "^28.1.2",
    "jest-environment-jsdom": "^28.1.2",
    "lint-staged": "^13.0.3",
    "postcss": "^8.4.14",
    "prettier": "^2.7.1",
    "rimraf": "^3.0.2",
    "typescript": "^4.9.5"
  },
  "config": {
    "commitizen": {
      "path": "./node_modules/cz-conventional-changelog"
    }
  }
}`

our next.config.js :

`/** @type {import('next').NextConfig} */

const { withSentryConfig } = require('@sentry/nextjs');
const { withAxiom } = require('next-axiom');
const withBundleAnalyzer = require('@next/bundle-analyzer')({
  enabled: process.env.ANALYZE === 'true',
});

const { i18n } = require('./next-i18next.config');

const IS_PROTECTED = process.env.NEXT_PUBLIC_NODE_ENV === 'staging';

const securityHeaders = [
  {
    key: 'X-XSS-Protection',
    value: '1; mode=block',
  },
  {
    key: 'X-Content-Type-Options',
    value: 'nosniff',
  },
  {
    key: 'Referrer-Policy',
    value: 'origin-when-cross-origin',
  },
  {
    key: 'X-DNS-Prefetch-Control',
    value: 'on',
  },
  {
    key: 'Strict-Transport-Security',
    value: 'max-age=63072000; includeSubDomains; preload',
  },
];

const nextConfig = withAxiom(
  withBundleAnalyzer({
    reactStrictMode: true,
    swcMinify: false,
    i18n,
    env: {
      PASSWORD_PROTECT: IS_PROTECTED,
    },
    images: {
      domains: ['lh3.googleusercontent.com', 'i.scdn.co'],
    },
    sentry: {
      widenClientFileUpload: true,
      hideSourceMaps: true,
      automaticVercelMonitors: false,
    },
    // transpilePackages: ['react-native'],
    async redirects() {
      return [
        {
          source: '/login',
          destination: '/auth/login',
          permanent: true,
        },
        {
          source: '/signup',
          destination: '/auth/signup',
          permanent: true,
        },
        {
          source: '/dashboard',
          destination: '/users/dashboard',
          permanent: true,
        },
        {
          source: '/backstage',
          destination: '/artists/backstage',
          permanent: true,
        },
        {
          source: '/explore',
          destination: '/search',
          permanent: true,
        },
        {
          source: '/faqs',
          destination: '/faq',
          permanent: true,
        },
        {
          source: '/users/reward-tasks',
          destination: '/users/game/explain',
          permanent: true,
        },
      ];
    },
    async headers() {
      return [
        {
          source: '/:path*',
          headers: securityHeaders,
        },
        {
          source: '/.well-known/apple-developer-merchantid-domain-association',
          headers: [{ key: 'Content-Type', value: 'application/json' }],
        },
      ];
    },
    webpack: (config) => {
      config.module.rules.push({
        test: /\.pdf$/,
        use: {
          loader: 'file-loader',
          options: {
            name: '[path][name].[ext]',
          },
        },
      });
      // config.externals.push('react-native');
      return config;
    },
  })
);

const sentryWebpackPluginOptions = {
  org: '*****-*****',
  project: '*****-nextjs',
  silent: true, // Suppresses all logs
  // For all available options, see:
  // https://github.com/getsentry/sentry-webpack-plugin#options.
};

module.exports = withSentryConfig(nextConfig, sentryWebpackPluginOptions);

our middleware.ts

/* eslint-disable consistent-return */
import type { NextRequest } from 'next/server';
import { NextResponse } from 'next/server';
import { withAuth } from 'next-auth/middleware';

const ROLES_ALLOWED_TO_AUTH = new Set<any>(['artist', 'user']);

export default withAuth(
  function middleware(req: NextRequest & { nextauth: { token: any } }) {s
    // Redirect if they don't have the appropriate role
    if (
      req.nextUrl.pathname.startsWith('/artists/backstage') ||
      req.nextUrl.pathname.startsWith('/artists/onboarding') ||
      req.nextUrl.pathname.startsWith('/artists/new')
    ) {
      if (!ROLES_ALLOWED_TO_AUTH.has(req.nextauth.token?.userRole)) {
        return NextResponse.redirect(new URL('/auth/login', req.url));
      }
      if (req.nextauth.token?.userRole === 'user' && req.nextauth.token?.userRole !== 'artist') {
        return NextResponse.redirect(new URL('/users/dashboard', req.url));
      }
      if (req.nextauth.token?.userRole === 'artist') {
        return NextResponse.next();
      }
    }
  },
  {
    callbacks: {
      authorized: ({ token }) =>
        token?.userRole !== undefined && ROLES_ALLOWED_TO_AUTH.has(token.userRole),
    },
  }
);

export const config = {
  matcher: [
    '/feed',
    '/artists/new/:path*',
    '/artists/backstage/:path*',
    '/artists/onboarding/:path*',
    '/users/dashboard/:path*',
    '/users/game/:path*',
    '/users/settings',
  ],
};

Describe the Bug

We are experiencing a bug that occurs randomly for some of our users, only in production, on any route of the site, and it has never been reported on Sentry. We can only see it in the Vercel logs.

The full error message is as follows:
Uncaught Exception {"errorType":"Error","errorMessage":"socket hang up","code":"ECONNRESET","stack":["Error: socket hang up"," at connResetException (node:internal/errors:717:14)"," at TLSSocket.socketOnEnd (node:_http_client:526:23)"," at TLSSocket.emit (node:events:525:35)"," at TLSSocket.emit (node:domain:489:12)"," at endReadableNT (node:internal/streams/readable:1359:12)"," at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"]} Unknown application error occurred Runtime.Unknown.

We think (but can't verify) that this bug appeared when we updated to Next.js 13. However, none of our pages use appRouter; we're still using Page Router for the time being. We've seen that rewrites can cause socket hangs, but as you can see in our next.config.js, we don't use rewrites.

This can happen on SSG (Static Site Generation), SSR (Server-Side Rendering), or Client-side rendered pages.
It can also happen on any browser or device.

Honestly, we have no clue or way of reproducing this problem because even in our development environment, we don't encounter any problems.

Expected Behavior

I expect the application to work seamlessly without any errors or disruptions. Specifically, I anticipate that the mentioned "Socket Hang Up" error will not occur randomly in production mode on any route of the site. Additionally, I hope that better error handling mechanisms will be implemented to address any potential issues that may arise.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

Vercel

@SebastienSusini SebastienSusini added the bug Issue was opened via the bug report template. label Jun 21, 2023
@SebastienSusini SebastienSusini changed the title [Error 500] "Socket Hang Up" Randomly Occurring in Next.js Routes in Production Mode [Error 500] "Socket Hang Up" Randomly Occurring on any Routes in Production Mode Jun 21, 2023
@NadhifRadityo
Copy link

This issue will be easier to assess if you provide a simple project that reproduces this issue. Nevertheless, based on your stack trace, it looks like you are trying to connect to TLS/SSL socket (which I doubt Nextjs handles such a thing, it is probably handled by one of your libraries). Based on your dependencies too, I am gonna give a big shot that you are somehow trying to connect to a database to authenticate a user. This is a wild guess, but I think the connection between your web server to your database is somehow closed (or not stable, or anything in between really).

This is already out of scope. But maybe, for a quick fix, you can check the connection to your database, or simply restart your Nextjs server (if possible. Because it will re-instantiate the database variable and database connection).

@SebastienSusini
Copy link
Author

Thank's for your response.

We don't have a direct connection to a database from Next, we use a Ruby on Rails API, and on the rails side we don't get any errors (we don't even see an API call when we get this error on Next in the rails log).

We don't provide any code because I don't think it's very interesting, the bug is really random, and only happens in production mode, I can however make a minimal reproducible example with all the dependencies and simulate our login page deployed on Vercel, but I have no guarantee that the bug will happen again.

I'm attaching an image below with more information from Vercel, which shows that neither memory nor execution time is exceeded.

Capture d’écran 2023-06-22 à 09 56 13

@NadhifRadityo
Copy link

Another thing I overlooked; Shouldn't database call trigger DYNAMIC cache level? I have never deployed to vercel but I think that's a bit weird. Can errors be cached? But this still would not explain why the error happens in the first place.

Perhaps the socket hangup is from the vercel side? They have HTTPS handling on their side, and if the client suddenly closes the connection whilst the request isn't complete, maybe it'll throw an error? But I don't think that's the case either. If it were the case many vercel users would have reported that already.

Maybe you could provide a simple reproduction code, and see if I can reproduce it myself on vercel.

@balazsorban44 balazsorban44 added the please add a complete reproduction Please add a complete reproduction. label Jun 30, 2023
@github-actions
Copy link
Contributor

We cannot recreate the issue with the provided information. Please add a reproduction in order for us to be able to investigate.

Why was this issue marked with the please add a complete reproduction label?

To be able to investigate, we need access to a reproduction to identify what triggered the issue. We prefer a link to a public GitHub repository (template for pages, template for App Router), but you can also use these templates: CodeSandbox: pages or CodeSandbox: App Router.

To make sure the issue is resolved as quickly as possible, please make sure that the reproduction is as minimal as possible. This means that you should remove unnecessary code, files, and dependencies that do not contribute to the issue.

Please test your reproduction against the latest version of Next.js (next@canary) to make sure your issue has not already been fixed.

I added a link, why was it still marked?

Ensure the link is pointing to a codebase that is accessible (e.g. not a private repository). "example.com", "n/a", "will add later", etc. are not acceptable links -- we need to see a public codebase. See the above section for accepted links.

What happens if I don't provide a sufficient minimal reproduction?

Issues with the please add a complete reproduction label that receives no meaningful activity (e.g. new comments with a reproduction link) are automatically closed and locked after 30 days.

If your issue has not been resolved in that time and it has been closed/locked, please open a new issue with the required reproduction.

I did not open this issue, but it is relevant to me, what can I do to help?

Anyone experiencing the same issue is welcome to provide a minimal reproduction following the above steps. Furthermore, you can upvote the issue using the 👍 reaction on the topmost comment (please do not comment "I have the same issue" without reproduction steps). Then, we can sort issues by votes to prioritize.

I think my reproduction is good enough, why aren't you looking into it quicker?

We look into every Next.js issue and constantly monitor open issues for new comments.

However, sometimes we might miss one or two due to the popularity/high traffic of the repository. We apologize, and kindly ask you to refrain from tagging core maintainers, as that will usually not result in increased priority.

Upvoting issues to show your interest will help us prioritize and address them as quickly as possible. That said, every issue is important to us, and if an issue gets closed by accident, we encourage you to open a new one linking to the old issue and we will look into it.

Useful Resources

@piotrcichosz
Copy link

We're facing the same issue, also shortly after upgrading to next 13

Logs are looking like this:
Error: socket hang up at connResetException (node:internal/errors:705:14) at Socket.socketOnEnd (node:_http_client:518:23) at Socket.emit (node:events:525:35) at runInContextCb (/app/node_modules/newrelic/lib/shim/shim.js:1322:22) at LegacyContextManager.runInContext (/app/node_modules/newrelic/lib/context-manager/legacy-context-manager.js:59:23) at Shim.applySegment (/app/node_modules/newrelic/lib/shim/shim.js:1312:25) at Socket.wrapper [as emit] (/app/node_modules/newrelic/lib/shim/shim.js:1898:17) at endReadableNT (node:internal/streams/readable:1358:12) at runInContextCb (/app/node_modules/newrelic/lib/shim/shim.js:1322:22) at LegacyContextManager.runInContext (/app/node_modules/newrelic/lib/context-manager/legacy-context-manager.js:59:23) { code: 'ECONNRESET'}
arond 100 errors like this, then:
Error: read ECONNRESET at TCP.onStreamRead (node:internal/stream_base_commons:217:20) at runInContextCb (/app/node_modules/newrelic/lib/shim/shim.js:1322:22) at LegacyContextManager.runInContext (/app/node_modules/newrelic/lib/context-manager/legacy-context-manager.js:59:23) at Shim.applySegment (/app/node_modules/newrelic/lib/shim/shim.js:1312:25) at TCP.wrapper (/app/node_modules/newrelic/lib/shim/shim.js:1898:17) at TCP.callbackTrampoline (node:internal/async_hooks:130:17) { errno: -104, code: 'ECONNRESET', syscall: 'read'}
9 this errors (with 3 ecs task instances, on each 2 process running)

and then thousands of errors:
Error: connect ECONNREFUSED 192.168.214.16:33105 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) { errno: -111, code: 'ECONNREFUSED', syscall: 'connect', address: '192.168.214.16', port: 33105}

This is random but never on fresh instances, for now each time (3 times had this problem) is occuring after days since deploy. Looks like once socket breaks it can't be recreated?

I saw @SebastienSusini is using vercen, I'm using aws ecs tasks

@piotrcichosz
Copy link

btw I can't share whole project, and since it can happen after week and milion request, I'm not sure how easy would it be to recreate it. Maybe easier route would be enable some debug logs?

@SebastienSusini are you having big traffic on that project? Are you experiencing this errors also after some time passes since deploy or randomly it can happen few minutes/hours after deploy? When you upgraded to 13 and how many these incidents you had?

@0xadada
Copy link
Contributor

0xadada commented Jul 27, 2023

I'm also seeing this same error, and it occurs 1/10 requests reliably on production, but cannot reproduce it locally with curl or browser requests.

  • Next.js 13.4.10
  • Node v18
Error: aborted
  at connResetException (node:internal/errors:720:14)
  at Socket.socketCloseListener (node:_http_client:461:19)
  at Socket.emit (node:events:526:35)�
  at TCP.<anonymous> (node:net:323:12) {
    code: �'ECONNRESET'

Notably, my app renders the page, but occasionally throws this error in 13.4.6. As of 13.4.10, the rendering fails when this error is thrown.

@0xadada
Copy link
Contributor

0xadada commented Jul 27, 2023

Seems possibly related to #49587

@0xadada
Copy link
Contributor

0xadada commented Jul 27, 2023

@SebastienSusini I also isolated this issue in my app to have started in next 13.4.6 (same version in your bug report), and reverting to 13.4.5 resolved the issue 🎉

@mthmcalixto
Copy link

@SebastienSusini I also isolated this issue in my app to have started in next (same version in your bug report), and reverting to 13.4.5 resolved the issue13.4.6 🎉

13.4.12 with the same problem man

@dbrxnds
Copy link

dbrxnds commented Jul 28, 2023

We are also running into this issue, with the same circumstances as described before:

  • Only seems to occur in production. (Possibly due to load?)
  • Occurs only after a couple of days or a week+, after millions of requests have already happened
  • Unable to pin-point one specific action/request that causes this issue
Error: socket hang up
    at connResetException (node:internal/errors:705:14)
    at Socket.socketOnEnd (node:_http_client:518:23)
    at Socket.emit (node:events:525:35)
    at Socket.emit (node:domain:489:12)
    at endReadableNT (node:internal/streams/readable:1358:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  code: 'ECONNRESET'

We will now downgrade to 13.4.5 and watch if it happens again, unfortunately we have no way to test it consistently. Time will tell..?

For info, we are running on AWS EC2 instances. @0xadada do I understand correctly that in your case following requests do get handled? For us it seems to completely stop the server from being able to handle any requests after that point.

@NadhifRadityo
Copy link

NadhifRadityo commented Jul 28, 2023

Socket hangups do occur from time to time if the client is aborting the connection, and it seems like after it aborted next.js still actively waiting for incoming TCP packets.

There are few candidates where this error could occur, but since this error is happening on production mode where incoming traffic might be huge and really hard to reproduce on a small scale, pinpointing an exact part is hard. Nonetheless, I have some rough ideas where this problem(s) could be, based on the effects some were mentioned.

  1. Requests after the error don't get handled (completely bricks the server)
  • This could be either the ipc worker process died or the render worker process died. Either processes could be purposefully killed if the memory usage is above the limit. But after killed or died, jest worker should respawn these processes, and it might have failed. (I need to mention, that I noticed jest worker is a bit sloppy handling respawning on my machine).
  • This could be either the ipc worker process or the render worker process still handling another request that synchronously doing something that makes subsequent request timed out. https://stackoverflow.com/questions/16995184/nodejs-what-does-socket-hang-up-actually-mean
  1. Requests still get handled correctly after the error. This could be just the client aborting the connection and next.js still actively waiting for incoming TCP packets.

In both cases, next.js uses http-proxy to forward the requests between processes. Though I might write a proposal to rewrite IPC communications between next.js processes to handle requests better, (support for IPC callbacks; passing req,res pair to another process; etc)


@dbrxnds With the first point described, does the subsequent request after the error failed immediately or are there timeout before the subsequent request failed?

@0xadada I need to confirm, are you deploying this on a machine or shared hosting (vercel, etc)? Do you use appDir or pageDir or combination of both?

@dbrxnds
Copy link

dbrxnds commented Jul 28, 2023

Appreciate the well written response, @NadhifRadityo.

I am fairly certain subsequent requests just hang, at least for a good while. We end up getting an error response saying "the upstream server returned an invalid response" but I assume that is just the load balancer or some other part doing its' thing. Requests do just remain pending in your network tab until that point

@NadhifRadityo
Copy link

NadhifRadityo commented Jul 28, 2023

This seems unlikely, but are there a chance of your next.js project does I/O operations synchronously or heavy synchronous tasks?

Also I need to confirm, the request hangs for any routes right? (dynamic page, static page, static resources)

And to make things sure, can you do a process list with process arguments, before and after the error? Search something like node processChild.js ... and write down the PID. The next time the error happens do a proccess list again and check if node processChild.js ... still there and compare if the PID changes.

And for the record, do you use appDir only or pageDir or combination of both?

I will try to eliminate IPC communication first as it makes the most sense in my opinion. I'll try manually killing the worker process, and see if I can reproduce the problem.

@0xadada
Copy link
Contributor

0xadada commented Jul 29, 2023

@0xadada I need to confirm, are you deploying this on a machine or shared hosting (vercel, etc)? Do you use appDir or pageDir or combination of both?

@NadhifRadityo yes, i've got next next.js process running in a shared Docker container with a ruby webserver. Ruby webclient makes HTTP requests to our next process on localhost running next start

@NadhifRadityo
Copy link

NadhifRadityo commented Jul 29, 2023

@0xadada
Do subsequent requests get handled? If not, do subsequent requests after the error failed immediately or are there timeout before the subsequent requests failed?

I'd like to cover everything because you get a different stack trace. Or perhaps you also get socket hangups too?

@0xadada
Copy link
Contributor

0xadada commented Jul 29, 2023

@NadhifRadityo

Do you use appDir or pageDir or combination of both?

I do not use appDir, we use pages/api/* and pages/.

@0xadada Do subsequent requests get handled? If not, do subsequent requests after the error failed immediately or are there timeout before the subsequent requests failed?

Subsequent requests are always handled.

Starting in next 13.4.6, we saw a regression where occasionally we'd see 1-of-2 possible errors roughly 10% of all requests:

  • failure type A: the request would return a completely failed page rendering with a 500 error, and a log message on production (below)
  • failure type B: the request would return successfully with a 200 response, but a log message on production (below)

any time either error A or B occurred (~10% of all requests), this would appear in the production logs:

Error: aborted
    at connResetException (node:internal/errors:717:14)
    at Socket.socketCloseListener (node:_http_client:462:19)
    at Socket.emit (node:events:525:35)
    at TCP.<anonymous> (node:net:322:12) {
      code: 'ECONNRESET'
    }

@NadhifRadityo
Copy link

@0xadada Interesting... It seems like your issue is a bit different but I think it correlates. I think this could be just the client aborting the connection and next.js still actively waiting for incoming TCP packets.

I'll try to reproduce it by making a long request and abort it. See if I can get a reproduction.

@mthmcalixto
Copy link

This error sometimes happens in developer mode in version 13.4.12, when there is a lot of refresh it stops working and needs to start the terminal again.

@dbrxnds
Copy link

dbrxnds commented Jul 29, 2023

@NadhifRadityo

This seems unlikely, but are there a chance of your next.js project does I/O operations synchronously or heavy synchronous tasks?

It does not.

Also I need to confirm, the request hangs for any routes right? (dynamic page, static page, static resources)

Correct, any route, API routes. Everything.

And for the record, do you use appDir only or pageDir or combination of both?

Pages directory with API routes

And to make things sure, can you do a process list with process arguments, before and after the error? Search something like node processChild.js ... and write down the PID. The next time the error happens do a proccess list again and check if node processChild.js ... still there and compare if the PID changes.

I will attempt to do this once it occurs again!

@NadhifRadityo
Copy link

NadhifRadityo commented Jul 30, 2023

@0xadada I managed to get a reproduction of your problem and I have created a new issue for that. It seems like your project is having a memory issue which restarts your render worker repeatedly. This is also consistent with you having only 10% of your requests getting these errors since every request will grow a fair bit of memory until it reached a threshold. See #53353 for my detailed explanations. (You can confirm this by checking if the render worker PID changes).

And I need to mention that next.js is having a memory problem currently (#46756, #49929, #48748, #49929). Perhaps, you could try reverting to the earlier versions, and see if it helps.

Edit: It also might be something else in your project that kills your render worker. High memory usage is just my assumption because people are having these issues as well.

@NadhifRadityo
Copy link

This error sometimes happens in developer mode in version 13.4.12, when there is a lot of refresh it stops working and needs to start the terminal again.

@mthmcalixto I think this is caused by high memory usage problem currently in newer next.js version. #46756 describes that after editing files and refreshing, memory usage grows rapidly because of recompilation. And because of high memory usage, the worker will restart and on going request will be aborted.

@0xadada
Copy link
Contributor

0xadada commented Jul 31, 2023

And I need to mention that next.js is having a memory problem currently (#46756, #49929, #48748, #49929). Perhaps, you could try reverting to the earlier versions, and see if it helps.

Our team has reverted to 13.4.5 and no longer see the problem. It doesn't seem to be memory-related, as the server outputs process.memory.rss and it was consisently around 80-90mb (below our mem-max), and the problem would still randomly occur. Reverting has solved it.

@ielaajezdev
Copy link

ielaajezdev commented Aug 2, 2023

Glad to see this thread since I am also experiencing frequent (seemingly random) "socket hang up errors" and it was quite hard to debug the root cause. I think my use case is different so I will add my error scenario just in case it makes the problem exploration easier.

Setup

  • NextJS version: 13.4.11
  • Using the pages/* and pages/api/* directories. not using the new app router

I am developing my next app in a Docker compose composition. The next app runs in the node 18 alpine image as per the Docker examples provided. Other containers in the composition are a postgres db, Prisma (studio), cerbos and strapi. I am developing on MacOS (Mac with M2). I have not yet deployed my app to production and only use development mode currently.

Answers to your questions

  • The socket hang up occurs seemingly random and when it happens, I get about 3 socket hang up/ECONNRESET errors in a row.
  • After a socket hang up, no other requests are handled anymore: I need to restart the container again to restore functionality.
  • I thought at first that the problem was connected to my file API endpoint (an endpoint that downloads a file from a remote server and then sends it to the client after logging the request) but that does not seem to be the case: even when not loading any images, this error occurs.

Things I have tried

  • I have tried to use the specific linux/amd64 platform for my Dockerfile and that did seem to resolve issues but was too slow in the development environment on my M2 mac
  • I have tried to decrease the MTU for the docker bridge network as suggested here but that also did not resolve issues

Hope this helps! Following this issue and reverting to next 13.4.5 for now...

@NadhifRadityo
Copy link

Currently I have this issue very often (almost everyday). Situation is always the same: socket hang up, next-render-worker-app is missing in processes, server doesn't respond - just hangs.

BTW prisma maintainers said they fixed related problem: prisma/prisma#19419 – maybe this helps. I haven't tried it yet since the fix is in unstable branch for now.

Indeed, I noticed next-render-worker-app is missing too, which explains the socket hangups. The underlying cause is still unknown to me. But perhaps can you inspect the memory usage just before the process is missing?

Also, this issue is consistent with people using prisma in their project. I will try to reproduce it when I have the time (I have been busy with college).

@IonelLupu
Copy link

I can confirm we get the same error pretty much every day. Here is a Sentry log:
image

But I don't see anything related to Nextjs in the stack trace

@volodymyr-strilets-mindcurv

I would like to add my experience with this issue

  • we don't use Prisma in our project
  • we don't use Next Auth
  • but still, the error appeared multiple times on prod
  • had to downgrade to 13.3.4 and haven't seen the error since then (at least, in our case)

So, I don't think the issue is related to Prisma or NextAuth

@ielaajezdev
Copy link

After updating to Prisma 5.3.0, the issue seems to be resolved (in development) so it might be that the issues are unrelated, but I will continue to monitor.

@taylor-lindores-reeves
Copy link

taylor-lindores-reeves commented Sep 19, 2023

I have a NextJS frontend repo and a separate ExpressJS backend with Prisma, but the backend is irrelevant in this case because my production deployment is failing for the frontend only.

Wasn't able to upgrade to NextJS 13.4+ for the longest time because we had an issue where POST requests would not proxy correctly.

Upgraded to NextJS v13.4.19 version today and development environment works fine, but the production deployment is failing with the following AWS CloudWatch logs:

Screenshot 2023-09-19 at 12 22 50

EDIT: this may be due to discrepancy between NodeJS versions in my local env vs production. My local is 18.17.1 and my production is v16

@jyotman
Copy link

jyotman commented Oct 3, 2023

Using Next.js 13.4.9 and have been facing this issue consistently on my deployment on Vercel. We don't use any of the separate packages mentioned above.

It happens quite often for us. For certain APIs on the pages router, it happens almost every 3rd API call. Its like the request doesn't even reach the handler as there are no internal logs printed alongside.
Never happens locally.

Error log from Vercel -

2023-10-03T08:08:30.392Z	1343d692-fe6a-46df-a18f-e82bd20b3afd	ERROR	Uncaught Exception 	{"errorType":"Error","errorMessage":"socket hang up","code":"ECONNRESET","stack":["Error: socket hang up","    at connResetException (node:internal/errors:720:14)","    at TLSSocket.socketOnEnd (node:_http_client:525:23)","    at TLSSocket.emit (node:events:526:35)","    at endReadableNT (node:internal/streams/readable:1359:12)","    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"]}
[ERROR] [1696320510450] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 400.
Error: Runtime exited with error: exit status 129
Runtime.ExitError

@navidevongit2
Copy link

We could "solve" or at least work around the socket hang up errors by replacing node with bun.

@unbelievableflavour
Copy link

Issue happens for us as well.

  • we don't use Prisma in our project.
  • we don't use Next Auth.
  • downgrade to 13.3.4 did not fix the issue for us.

@rohanvachheta
Copy link

It does reproduce for us as well. Sometimes in the production with "next": "^13.4.1".

  • We don't use prisma
  • We do have the next-auth

@nnmax
Copy link
Contributor

nnmax commented Oct 29, 2023

I have an application with NextJS version 13.4.12 which is deployed on AWS EC2. After I try to upgrade NextJS to version 14.0.0, the AWS health check task fails to execute.

The AWS CloudWatch log shows Error: socket hang up.

I then went into the ecs container and executed the following command:

/app $ wget -qO- localhost:3000
wget: can't connect to remote host (127.0.0.1): Connection refused

I found that localhost:300 is unreachable. Then I changed my health check command parameters to:

diff --git a/.aws/ecs/dev/task-definition.json b/.aws/ecs/dev/task-definition.json
index abf723d..cf5103d 100644
--- a/.aws/ecs/dev/task-definition.json
+++ b/.aws/ecs/dev/task-definition.json
@@ -45,7 +45,10 @@
       "interactive": null,
       "healthCheck": {
         "retries": 3,
-        "command": ["CMD-SHELL", "wget --spider -q localhost:3000/ || exit 1"],
+        "command": [
+          "CMD-SHELL",
+          "wget --spider -q ${HOSTNAME}:3000/ || exit 1"
+        ],
         "timeout": 10,
         "interval": 5,
         "startPeriod": 180

By now, the health check runs successfully and no longer gives the Error: socket hang up error.

@okngnr
Copy link

okngnr commented Oct 30, 2023

In our case it is not random. There is a feed.xml route in our app and it sends a request to S3 to get the actual feed. But the file size is about ~100mb, it gives "socket hang up" error and then it breaks the entire app. No way to handle the error in try-catch.

I've tried almost everything to get rid of this error, including trying to use node https or other 3rd party http clients instead of fetch api. But it seems, the problem is not related to next, it's a node.js or maybe undici related problem. Also tried some other node images but I think a stable node image could fix this issue.

@gruckion
Copy link

This is also happening for us. We get it in both local and production. I was hoping it was just a local dev issue but I saw it on a users browser a week ago.

image image
<!DOCTYPE html>
<html>
    <head>
        <style data-next-hide-fouc="true">
            body {
                display: none
            }
        </style>
        <noscript data-next-hide-fouc="true">
            <style>
                body {
                    display: block
                }
            </style>
        </noscript>
        <meta charSet="utf-8"/>
        <meta name="viewport" content="width=device-width"/>
        <meta name="next-head-count" content="2"/>
        <noscript data-n-css=""></noscript>
        <script defer="" nomodule="" src="/_next/static/chunks/polyfills.js?ts=1698758234124"></script>
        <script src="/_next/static/chunks/webpack.js?ts=1698758234124" defer=""></script>
        <script src="/_next/static/chunks/main.js?ts=1698758234124" defer=""></script>
        <script src="/_next/static/chunks/pages/_app.js?ts=1698758234124" defer=""></script>
        <script src="/_next/static/chunks/pages/_error.js?ts=1698758234124" defer=""></script>
        <script src="/_next/static/development/_buildManifest.js?ts=1698758234124" defer=""></script>
        <script src="/_next/static/development/_ssgManifest.js?ts=1698758234124" defer=""></script>
        <noscript id="__next_css__DO_NOT_USE__"></noscript>
    </head>
    <body>
        <div id="__next"></div>
        <script src="/_next/static/chunks/react-refresh.js?ts=1698758234124"></script>
        <script id="__NEXT_DATA__" type="application/json">
            {"props":{"pageProps":{"statusCode":500}},"page":"/_error","query":{"__NEXT_PAGE":"/api/trpc/activity.logAgentActivity"},"buildId":"development","isFallback":false,"err":{"name":"Error","source":"server","message":"socket hang up","stack":"Error: socket hang up\n    at connResetException (node:internal/errors:717:14)\n    at Socket.socketOnEnd (node:_http_client:526:23)\n    at Socket.emit (node:events:525:35)\n    at endReadableNT (node:internal/streams/readable:1359:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"},"gip":true,"scriptLoader":[]}
        </script>
    </body>
</html>
Error: socket hang up
    at connResetException (node:internal/errors:717:14)
    at Socket.socketOnEnd (node:_http_client:526:23)
    at Socket.emit (node:events:525:35)
    at endReadableNT (node:internal/streams/readable:1359:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

My application is expecting JSON from two tRPC endpoints. The issue is difficult to reproduce, it does not always happen. Here I had logged in and it had failed to get back the expected response from two endpoints.

As you can see from the picture the two endpoints before this went through fine.

Simply refreshing the page it works again.

image

@kaykdm
Copy link
Contributor

kaykdm commented Nov 29, 2023

I faced the same issue when I tried to upgrade Node.js version to 20 from 16, and it only occurs in production environment
My environment:

    Operating System:
      Platform: darwin
      Arch: arm64
      Version: Darwin Kernel Version 22.5.0: Thu Jun  8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
    Binaries:
      Node: 20.9.0
      npm: 10.1.0
      Yarn: 4.0.1
      pnpm: N/A
    Relevant packages:
      next: 13.4.2
      eslint-config-next: 13.4.2
      react: 18.2.0
      react-dom: 18.2.0
      typescript: 5.2.2

However, after upgrading Next.js to latest (14.0.3), it seems the issues is gone

@netergart
Copy link

"next": "^14.0.3" the same issue when running custom server

@EnriqCG
Copy link

EnriqCG commented Dec 7, 2023

I have just tried v14.0.4-canary.47 and the issue persists. I also tried Node.js v18 and v20.

We are only using the App Router. We do not use Prisma or NextAuth. This is affecting builds hosted on Vercel.com (including production).

It takes a little while for the issue to pop up after deploying, but after a few RSC renders, it happens quite often (~15% of the time).

@timcouchoud
Copy link

Same issue here, happening very often, impossible to find where it comes from.

I am using "next": "^14.0.4", with nextAuth, and nextJS middleware (my app uses also Wundergraph/sdk)

Any update on this issue?

Thx a lot

@tehseenc
Copy link

In case anyone is using Sentry, our issue turned out to be related to a bug with the @sentry/nextjs package. Bumping it up a version has fixed the issue on our end.

https://github.com/orgs/vercel/discussions/3248#discussioncomment-7851868

@timcouchoud
Copy link

In case anyone is using Sentry, our issue turned out to be related to a bug with the @sentry/nextjs package. Bumping it up a version has fixed the issue on our end.

https://github.com/orgs/vercel/discussions/3248#discussioncomment-7851868

I am not using Sentry, but I would be very interested to understand which type of error from Sentry was solved. Indeed I have "Socket hang up error" quite often but have found no ways to track the issue for now... Thx

I am using nextjs 14.0.4, nextauth with middleware, and wundergraph as backend.

@stefvw93
Copy link

Also have this exact problem, using NextJS 14.0.3. The error is not caught by our NextJS error boundary, and the user sees the default Vercel 500 error page (black background, white text).

Site works after simply refreshing. My initial thought (before finding this thread) was; could this have something to do with cookies from Vercel preview deployments?

@marpstar
Copy link

marpstar commented Dec 22, 2023

Also experiencing this issue. Node 18/20, Next 14.0.4, next-auth 4.24.5. Specifically, my /api/auth/[...nextauth] route hangs. Initially it produces an "Outgoing request timed out after 3500ms" from next-auth, but if I extend the timeout past 60 seconds, I get the full error from Node:

https://next-auth.js.org/errors#signin_oauth_error socket hang up

error: {
  message: 'socket hang up"
    stack: 'Error: socket hang up\n' +
    at connResetException (node:internal/errors:720:14)\n' + 
    at TLSSocket.socketOnEnd (node:_http_client:525:23)In' +
    at TLSSocket.emit (node: events:526:35)\n' + 
    at endReadableNT (node: internal/streams/readable:1359:12) In' + at process.processTicksAndRejections     (node:internal/process/task_queues: 82:21)',
  name: 'Error'
}

Not using Vercel, this is on a Windows Server 2022 VM and locally on my M1 Mac. Works fine running npm run dev; doesn't seem to be related to DNS or IPv4 vs IPv6, if you enable logging via NODE_DEBUG=http,https,net,tls, I'm seeing that is making a call to my OAuth provider, getting the correct IP addresses via DNS, and attempting to open a connection.

HTTPS 9140: reuse session for "sso.example.com:443:::::::::::::::"
NET 9140: pipe false null
NET 9140: connect: find host sso.example.com
NET 9140: connect: dns options { family: undefined, hints: 0 }
NET 9140: connect: autodetecting
NET 9140: _read - n 16384 isConnecting? true hasHandle? true
NET 9140: read wait for connection
NET 9140: connect/multiple: only one address found, switching back to single connection
NET 9140: connect: attempting to connect to X.XX.XXX.XXX:443 (addressType: 4)
NET 9140: afterConnect
NET 9140: _read - n 16384 isConnecting? false hasHandle? true
NET 9140: Socket._handle.readStart                   <---- it hangs at this step
NET 9140: destroy
NET 9140: close
NET 9140: close handle

I was seeing this intermittently, but over the last week something has changed and I am seeing it 100% of the time.

Again, this all works perfectly when running npm run dev, so it's seems like the compilation/runtime is at least part of the problem.

Other issues I've found along the way that may be related?

@raphi
Copy link

raphi commented Dec 22, 2023

Exact same issue on Node 18, Next 14.0.4, next-auth 4.24.5.

In pages/_error.js I've changed export const getStaticProps for Error.getInitialProps as their documentation states:

Error does not currently support Next.js Data Fetching methods like getStaticProps or getServerSideProps.

Not sure how it's related, but the random freezes we experienced in production every few days are now completely gone (2 weeks in a row without this issue)!

I'd guess a user from time to time gets an unexpected error (from authenticating in our case), which triggered this 500 error page and since getStaticProps isn't supported it freezes the entire app 🤷🏻‍♀️ Hope that helps

@belmerbelandres
Copy link

Hello guys, if your using sentry and keeps getting 500 error, refer to this thread Error 500. This fixed my random error 500.

@AdolfVonKleist
Copy link

Same thing here with "next": "13.4.12". Issue only appears when using next.config.js rewrites. I get:

- error Error: socket hang up
    at connResetException (node:internal/errors:720:14)
    at Socket.socketCloseListener (node:_http_client:474:25)
    at Socket.emit (node:events:526:35)
    at TCP.<anonymous> (node:net:323:12) {
  code: 'ECONNRESET'
}

for any request that takes more than 30s to complete. Seems related to the previous issue:

I can see that the remote server continues to process and eventually completes and returns a correct response - not the 500 that nextjs is claiming. Is there any way to eliminate the timeout or customize it to a longer period?

@samcx
Copy link
Member

samcx commented Jan 3, 2024

Hi everyone,

I will be moving this issue to our :nextjs: Discussions, as there are multiple different issues shared in this thread that is causing a socket hang up (e.g., Sentry issue, improperly handled async in a Serverless Environment, etc.).

We encourage folks to file a new issue with a consistently reproducible :repro: if they are still coming across this issue.

Happy 2024!

@vercel vercel locked and limited conversation to collaborators Jan 3, 2024
@samcx samcx converted this issue into discussion #60148 Jan 3, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Issue was opened via the bug report template. please add a complete reproduction Please add a complete reproduction.
Projects
None yet
Development

No branches or pull requests