Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recently started getting "Too many open files..." spinning up cluster on Apple M1 #6072

Closed
cheslijones opened this issue Jun 23, 2021 · 34 comments
Assignees
Labels
kind/bug Something isn't working kind/question User question meta/release priority/p2 May take a couple of releases

Comments

@cheslijones
Copy link

cheslijones commented Jun 23, 2021

Expected behavior

That my project would be deployed and spin-up without any issues.

Actual behavior

It is safe to say, I think, that this is isolated to Apple Silicon.

I get all of these "Too many open files..." warnings, the deployment slows down to snails crawl, other messages related to "failed to port forward is taken, retrying..." and then eventually it acts like it starts trying to rebuild the images again. Basically it is a mess...

Interestingly you can access the services, but none of the communication between them is working.

Without -vdebug:

Screen.Recording.2021-06-23.at.12.29.11.PM.mov

With -vdebug:

Screen.Recording.2021-06-23.at.12.16.25.PM.mov

I've only tested it is still working on AMD and Intel Linux and Windows (WSL2) machines. Works perfectly fine, the application spins up quickly and without issues.

I don't have an Intel Mac around me at this point in time to test it, but I'd imagine it works fine there as well. At least that is what I was working on for two years up until about a month ago.

Pretty sure this was working perfectly about a month ago on an M1 Mac that has since been reformatted... but I could be mistaken. I reinstalled everything following the exact same steps, however.

Information

  • Skaffold version: 1.26.1
  • Operating system: macOS 11.4
  • Installed via: skaffold.dev, but did also try brew with the same results... using the darwin-arm64 version
  • Contents of skaffold.yaml:

You can comment out the individual artifacts, or the manifests, to just deploy one service at a time. The results are the same... a ton of errors about "Too many open files..." and eventually it becomes unresponsive.

apiVersion: skaffold/v2beta12
kind: Config
build:
  artifacts:
  - image: admin
    context: admin
    sync:
      manual:
      - src: "src/**/*.php"
        dest: .
      - src: "conf/**/*.conf"
        dest: .
      - src: "src/Templates/**/*.tbs"
        dest: .
      - src: "src/css/**/*.css"
        dest: .
      - src: "src/js/**/*.js"
        dest: .
    docker:
      dockerfile: Dockerfile
  - image: admin-v2
    context: admin-v2
    sync:
      manual:
      - src: 'src/**/*.ts'
        dest: .
      - src: 'src/**/*.tsx'
        dest: .
      - src: '**/*.json'
        dest: .
      - src: 'public/**/*.html'
        dest: .
      - src: 'src/assets/sass/**/*.scss'
        dest: .
      - src: 'src/build/**/*.js'
        dest: .
    docker:
      dockerfile: Dockerfile.dev
  - image: api
    context: api
    sync:
      manual:
      - src: "**/*.py"
        dest: .
    docker:
      dockerfile: Dockerfile.dev
  - image: api-v2
    context: api-v2
    sync:
      manual:
      - src: "**/*.py"
        dest: .
    docker:
      dockerfile: Dockerfile.dev
  - image: client
    context: client
    sync:
      manual:
      - src: 'src/**/*.js'
        dest: .
      - src: 'src/**/*.jsx'
        dest: .
      - src: '**/*.json'
        dest: .
      - src: 'public/**/*.html'
        dest: .
      - src: 'src/assets/sass/**/*.scss'
        dest: .
      - src: 'src/build/**/*.js'
        dest: .
    docker:
      dockerfile: Dockerfile.dev
  - image: postgres
    context: postgres
    sync:
      manual:
      - src: "**/*.sql"
        dest: .
    docker:
      dockerfile: Dockerfile.dev
  local:
    push: false
deploy:
  kubectl:
    manifests:
      - k8s/dev/ingress.yaml
      - k8s/dev/postgres.yaml
      - k8s/dev/client.yaml
      - k8s/dev/admin.yaml
      - k8s/dev/admin-v2.yaml
      - k8s/dev/api.yaml
      - k8s/dev/api-v2.yaml
    defaultNamespace: dev

Steps to reproduce the behavior

  1. Unfortunately I can't post the project due to it being proprietary. I'd suspect anyone with not even a moderate size or complex project will also run into the issue.
  2. skaffold dev --port-forward -n dev

These are other dependencies:

  • Docker Desktop for Apple Silicon 3.4.0 (65384)
  • Kubernetes v1.21.1 (enabled via Docker Desktop)
  • Minikube v1.21.0 (arm64)
  • Skaffold v1.26.1 (arm64)
  • Homebrew 3.2.0

Pretty sure can't install anything but arm64 for these and haven't tried the non-arm versions.

This is what I've tried:

  • Several reformats of the system.
  • Using virtualization.framework and hypervisor.framework.
  • Installing all in terminal under as "Universal" or "Apple Silicon".
  • Running with VS Code as "Universal" or "Apple Silicon".
  • Running with VS Code as "Intel".
  • Installing all in VS Code as "Intel" and running in VS Code as "Intel".
  • 8GB and 16GB M1 models.
  • 4 CPU cores and 6GB RAM in Docker Desktop.
  • Minikube has 2 CPU cores and 4GB RAM.
  • Using Docker driver in Minikube.
@cheslijones
Copy link
Author

cheslijones commented Jun 24, 2021

Actually did manage to get all of those messages to go away, but now I'm getting these errors:

First time:

[postgres] COPY 4844
WARN[0567] got unexpected event of type ERROR           
[admin] error: http2: client connection lost
WARN[0567] exit status 1                                
[admin-v2] error: http2: client connection lost
WARN[0569] exit status 1                                
[postgres] error: http2: client connection lost
WARN[0582] exit status 1                                
[api] error: http2: client connection lost
WARN[0583] exit status 1                                
[api-v2] error: http2: client connection lost
WARN[0583] exit status 1                                
[client] error: http2: client connection lost
WARN[0584] exit status 1                                
^CCleaning up...

Second time:

[postgres] COPY 5733
[client] error: unexpected EOF
[postgres] error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
[admin-v2] error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
[api-v2] error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
WARN[0664] exit status 1                                
WARN[0664] exit status 1                                
WARN[0664] exit status 1                                
WARN[0664] exit status 1                                
[admin] error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
WARN[0664] exit status 1                                
[api] error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
WARN[0664] exit status 1                                
^CCleaning up...
WARN[0714] could not map pods to service dev/postgres-cluster-ip-service-dev/5432: getting service dev/postgres-cluster-ip-service-dev: context canceled 
WARN[0714] could not map pods to service dev/api-v2-cluster-ip-service-dev/5001: getting service dev/api-v2-cluster-ip-service-dev: context canceled 
WARN[0714] could not map pods to service dev/client-cluster-ip-service-dev/3000: getting service dev/client-cluster-ip-service-dev: context canceled 
WARN[0714] could not map pods to service dev/api-cluster-ip-service-dev/5000: getting service dev/api-cluster-ip-service-dev: context canceled 
WARN[0714] could not map pods to service dev/admin-cluster-ip-service-dev/4000: getting service dev/admin-cluster-ip-service-dev: context canceled 
WARN[0714] could not map pods to service dev/admin-v2-cluster-ip-service-dev/4001: getting service dev/admin-v2-cluster-ip-service-dev: context canceled 

Looks like I'm actually getting this on Linux and WSL2 as well, so maybe some other issue.

I did the following and the "Too many open files" messages went away on the M1:

  • Docker Desktop arm64 (required can't use amd64)
  • Enabled kubectl in Docker desktop
  • VS Code - Intel
  • Installed brew while in VS Code - Intel
  • Installed minikube arm64 via the website instructions (required can't user amd64) in VS Code - Intel
  • skaffold installed via brew in VS Code - Intel... pretty sure this is the amd64` version
  • Spun up the cluster in VS Code - Intel

Didn't get the "Too many open files..." warning by doing this, but now getting the http2 errors followed by:

 - stdout: ""
 - stderr: "Unable to connect to the server: net/http: TLS handshake timeout\n"
 - cause: exit status 1 

@briandealwis
Copy link
Member

I don't think this is a Skaffold problem: I run Skaffold on an M1, admittedly with fewer containers, and I don't see this issue.

It's worth poking around with lsof and try to find out which processes have an inordinate number of open files.

$ lsof -n | awk '{ counts[$1]++} END { for(c in counts) { print counts[c], c}}' | sort -n | tail
104 Terminal
124 AppleSpell
170 Google\x20Chrome\x20Helper
243 UserEventAgent
243 app_mode_loader
253 corespotlightd
381 eclipse
472 Google\x20Chrome
884 zsh
2684 Google\x20Chrome\x20Helper\x20(Renderer)

And check your sysctl values too:

$ sysctl -a | grep files
kern.maxfiles: 49152
kern.maxfilesperproc: 24576
kern.filesetuuid: FB10CC0A-B8BA-C020-BC47-A50D64476F11
kern.num_files: 5191

Ah, now that I'm writing this, I do remember having this problem with my older MBP. For some reason, my default ulimits for the maximum open files was set really low (ulimit -n). This doc describes a way to bump the settings:

https://gist.github.com/tombigel/d503800a282fcadbee14b537735d202c

@briandealwis briandealwis added the kind/question User question label Jun 24, 2021
@cheslijones
Copy link
Author

What version of skaffold are you using on your M1? arm64 or amd64?

I found that installing the skaffold-darwin-amd64 version inside either the i386 or arm64 terminal does get it working and I'm actually able to work. It is really slow to spin up though. Installing skaffold-darwin-arm64 in either i386 or arm64... yeah, just isn't working for me for some reason.

It looks like my ulimit -n is at 8192. Maybe I'll increase that and give the arm64 version a try again.

@cheslijones
Copy link
Author

OK, gave the link you provided a shot.

At 20000, I got the same errors though less frequent and ultimately the cluster failed to load due to:

unable to stat file "admin/Dockerfile": stat admin/Dockerfile: too many open files in system

So I increased it to 524288 that caused my computer to crash the first time. The second time Terminal crashed. So maybe there is a value somewhere in the middle that might clear up the issue without causing crashes.

But given the skaffold-darwin-amd64 version works OK, though be it slow, I might have to stick with that for now.

@tejal29
Copy link
Member

tejal29 commented Jul 1, 2021

Thanks @cheslijones for bringing this issue. I am going to keep this open for now.

@tejal29 tejal29 added the priority/awaiting-more-evidence Lowest Priority. May be useful, but there is not yet enough supporting evidence. label Jul 1, 2021
@MaxVale46
Copy link

MaxVale46 commented Jul 18, 2021

Hi all.
Hoping it will help, i have the same problem on a MacBook Air (M1, 2020).
It's a very little project, I had just started when the error appeared. Actually it includes only one file '.ts' and few files in the node_modules.

Information

Skaffold version: 1.28.0 arm64
Operating system: maxOS Big Sur (Version 11.4)
Installed via: skaffold.dev

Contents of skaffold.yaml:

    apiVersion: skaffold/v2beta16
    kind: Config
    deploy:
      kubectl:
        manifests:
          - ./infra/k8s/*
    build:
      local:
        push: false
      artifacts:
        - image: ADMIN/auth
          context: auth
          docker:
            dockerfile: Dockerfile
          sync:
            manual:
              - src: 'src/**/*.ts'
                dest: .

Some logs with -vdebug

    DEBU[0000] 2 manifests to deploy. 2 are updated or new  
    DEBU[0000] Running command: [kubectl --context docker-desktop apply -f -] 
     - deployment.apps/auth-depl configured
     - service/auth-srv configured
    INFO[0000] Deploy completed in 417.480333ms             
    Waiting for deployments to stabilize...
    DEBU[0000] getting client config for kubeContext: ``    
    DEBU[0000] checking status deployment/auth-depl         
    DEBU[0001] Running command: [kubectl --context docker-desktop rollout status deployment auth-depl --namespace default --watch=false] 
    DEBU[0002] Command output: [deployment "auth-depl" successfully rolled out
    ] 
    DEBU[0002] Fetching events for pod "auth-depl-55968bd7fb-tqmk4" 
     - deployment/auth-depl is ready.
    Deployments stabilized in 1.093 second
    DEBU[0002] getting client config for kubeContext: ``    
    Press Ctrl+C to exit
    INFO[0002] Streaming logs from pod: auth-depl-55968bd7fb-tqmk4 container: auth 
    INFO[0002] Streaming logs from pod: auth-depl-6f9c955cbf-hlv46 container: auth 
    DEBU[0002] Running command: [kubectl --context docker-desktop logs --since=2s -f auth-depl-55968bd7fb-tqmk4 -c auth --namespace default] 
    DEBU[0002] Running command: [kubectl --context docker-desktop logs --since=2s -f auth-depl-6f9c955cbf-hlv46 -c auth --namespace default] 
    DEBU[0002] Couldn't start notify trigger. Falling back to a polling trigger 
    WARN[0002] ./infra/k8s/* did not match any file         
    INFO[0002] files deleted: [/ticketing/infra/k8s/auth-depl.yaml] 
    Watching for changes...
    [auth] 
    [auth] > [email protected] start
    [auth] > ts-node-dev src/index.ts
    [auth] 
    [auth] [INFO] 12:17:36 ts-node-dev ver. 1.1.8 (using ts-node ver. 9.1.1, typescript ver. 4.3.5)
    DEBU[0003] Found dependencies for dockerfile: [{package.json /app true} {. /app true}] 
    DEBU[0003] Skipping excluded path: node_modules         
    INFO[0003] files deleted: [auth/src/index.ts]           
    INFO[0003] files added: [/ticketing/infra/k8s/auth-depl.yaml] 
    Syncing 1 files for ADMIN/auth:1c8c6c62510e86130d2c73973ebee34f6998c3ac7b845eefbcc533a13d8abe7d
    INFO[0003] Deleting files: map[auth/src/index.ts:[/app/src/index.ts]] from ADMIN/auth:1c8c6c62510e86130d2c73973ebee34f6998c3ac7b845eefbcc533a13d8abe7d 
    DEBU[0003] getting client config for kubeContext: ``    
    DEBU[0003] Running command: [kubectl --context docker-desktop exec auth-depl-6f9c955cbf-hlv46 --namespace default -c auth -- rm -rf -- /app/src/index.ts] 
    DEBU[0003] Running command: [kubectl --context docker-desktop exec auth-depl-55968bd7fb-tqmk4 --namespace default -c auth -- rm -rf -- /app/src/index.ts] 
    WARN[0003] Skipping deploy due to sync error: deleting files: starting command /usr/local/bin/kubectl --context docker-desktop exec auth-depl-6f9c955cbf-hlv46 --namespace default -c auth -- rm -rf -- /app/src/index.ts: pipe: too many open files 
    Watching for changes...
    DEBU[0004] Found dependencies for dockerfile: [{package.json /app true} {. /app true}] 
    DEBU[0004] Skipping excluded path: node_modules         
    DEBU[0004] stopping accessor                            
    DEBU[0004] stopping debugger                            
    Tags used in deployment:
     - ADMIN/auth -> ADMIN/auth:1c8c6c62510e86130d2c73973ebee34f6998c3ac7b845eefbcc533a13d8abe7d
    DEBU[0004] Local images can't be referenced by digest.
    They are tagged and referenced by a unique, local only, tag instead.
    See https://skaffold.dev/docs/pipeline-stages/taggers/#how-tagging-works 
    Starting deploy...
    DEBU[0004] getting client config for kubeContext: ``    
    DEBU[0004] Running command: [kubectl --context docker-desktop create --dry-run=client -oyaml -f /ticketing/infra/k8s/auth-depl.yaml] 
    WARN[0004] Skipping deploy due to error: kubectl create: starting command /usr/local/bin/kubectl --context docker-desktop create --dry-run=client -oyaml -f /ticketing/infra/k8s/auth-depl.yaml: pipe: too many open files 
    Watching for changes...
    DEBU[0005] Found dependencies for dockerfile: [{package.json /app true} {. /app true}] 
    DEBU[0005] Skipping excluded path: node_modules         
    DEBU[0006] Found dependencies for dockerfile: [{package.json /app true} {. /app true}] 
    DEBU[0006] Skipping excluded path: node_modules         
    DEBU[0007] Found dependencies for dockerfile: [{package.json /app true} {. /app true}] 
    DEBU[0007] Skipping excluded path: node_modules

I tried to increase file limit but error persists.

(If I delete folder node_modules everything works well and I have no errors, but is not a solution).

UPDATE
With the amd64 I have no errors, too.

@demisx
Copy link
Contributor

demisx commented Jul 25, 2021

We've also ran into this issue where my colleague on Apple M1 laptop sees this error and I, who is on MacBook Pro (15-inch, 2018)/2.2 GHz 6-Core Intel Core i7, does not. This is what he sees after starting skaffold on Apple M1:

WARN[3018] Ignoring changes: listing files: unable to evaluate build args: 
reading dockerfile: open /ismedia-nx-scaffold/choi/apps/api/Dockerfile: too many open files

# And predictably the re-depoy fails:

WARN[3091] Skipping deploy due to sync error: copying files: starting 
command /usr/local/bin/kubectl --context docker-desktop exec api-74bc656856-gxb48 
--namespace default -c api -i -- tar xmf - -C / --no-same-owner: fork/exec /usr/local/bin/kubectl: too many open files
$ lsof -n | awk '{ counts[$1]++} END { for(c in counts) { print counts[c], c}}' | sort -n | tail
...
2921 Google
24570 skaffold # <-- I don't even have skaffold reported on my Intel chip MacBook Pro.

The issue seems goes away if my colleague deletes the root node_modules/ directory. It's like skaffold is trying to watch the entire node_modules/ on M1, but ignores it on the older Intel Chip version.

I've created a sample repo to recreate this issue. Please follow the instructions in choi/README.md to start the project.

@demisx
Copy link
Contributor

demisx commented Aug 13, 2021

This is what we have to do ☹️ on the M1 laptops to be able to use skaffold:

npm install
npm run skaffold # generate images
# ^C to exit skaffold
rm -rf node_modules
npm run skaffold

On non-M1 laptops, we can run skaffold with one command.

@sampullman
Copy link

I tried to look in to the sync/dockerignore code involved here, but didn't find anything obvious that would cause things to behave differently on my M1.

While poking around, I did notice that the issue resolves itself by using a local skaffold build instead of the official release.

@briandealwis briandealwis added kind/bug Something isn't working meta/release priority/p1 High impact feature/bug. and removed priority/awaiting-more-evidence Lowest Priority. May be useful, but there is not yet enough supporting evidence. labels Aug 16, 2021
@briandealwis
Copy link
Member

@sampullman: While poking around, I did notice that the issue resolves itself by using a local skaffold build instead of the official release.

Ah! 💡 We're currently cross-compiling Skaffold for darwin/arm64 but have to disable cgo (#5286) as we don't have the required headers and libraries available. I've been meaning to retool our release process and this provides the impetus.

@briandealwis briandealwis self-assigned this Aug 16, 2021
@resumerise
Copy link

skaffold-darwin-amd64

With skaffold-darwin-amd64 it works like a charm. The error has gone. I would be great if arm64 version could be fixed.

@briandealwis briandealwis added priority/p2 May take a couple of releases and removed priority/p1 High impact feature/bug. labels Oct 4, 2021
@ryan6416
Copy link

ryan6416 commented Nov 8, 2021

@resumerise @cheslijones I initially installed skaffold via homebrew. So I did brew uninstall skaffold and installed skaffold-darwin-amd64 from https://skaffold.dev/docs/install/#managed-ide

# For macOS on x86_64 (amd64)
curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-darwin-amd64 && \
sudo install skaffold /usr/local/bin/

But I'm still getting the same error.

Running: skaffold dev -v info --port-forward --rpc-http-port 64180 --filename /Users/ryanefendy/Documents/gopuff/payments-service/skaffold.yaml --wait-for-deletions-max 2m0s --wait-for-connection
starting gRPC server on port 64186
starting gRPC HTTP server on port 64180 (proxying to 64186)
Skaffold &{Version:v1.34.0 ConfigVersion:skaffold/v2beta25 GitVersion: GitCommit:22cfab75ffb305e7af220910af2f48d0a5c0e6af BuildDate:2021-10-27T00:34:25Z GoVersion:go1.16beta1 Compiler:gc Platform:darwin/arm64 User:}
Loaded Skaffold defaults from \"/Users/ryanefendy/.skaffold/config\"
Using kubectl context: docker-desktop
build concurrency first set to 1 parsed from *local.Builder[0]
final build concurrency value is 1
Listing files to watch...
 - gopuffd.azurecr.io/images/payments-service
List generated in 115.554125ms
Generating tags...
 - gopuffd.azurecr.io/images/payments-service -> gopuffd.azurecr.io/images/payments-service:0.0.1-1-ga68ce4e-dirty
Checking cache...
Tags generated in 50.266209ms
 - gopuffd.azurecr.io/images/payments-service: Found Locally
Cache check completed in 856.951042ms
Tags used in deployment:
 - gopuffd.azurecr.io/images/payments-service -> gopuffd.azurecr.io/images/payments-service:e6c5eca61b2cc3825b9f26ff905770ea3637e2d8aaaaf6881cdb51e191656d7b
Starting deploy...
Deploying with helm v3.6.3 ...
Helm release payments-service not installed. Installing...
Building helm dependencies...
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 2 charts
Downloading redis from repo https://charts.bitnami.com/bitnami
Downloading postgresql from repo https://charts.bitnami.com/bitnami
Deleting outdated charts
W1109 08:37:43.407518   34408 warnings.go:70] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W1109 08:37:44.205835   34408 warnings.go:70] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
NAME: payments-service
LAST DEPLOYED: Tue Nov  9 08:37:43 2021
NAMESPACE: local-payments
STATUS: deployed
REVISION: 1
W1109 08:37:46.888837   34324 warnings.go:70] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
Deploy completed in 10.499 seconds
Waiting for deployments to stabilize...
 - local-payments:deployment/payments-service: waiting for rollout to finish: 0 of 1 updated replicas are available...
 - local-payments:deployment/payments-service is ready.
Deployments stabilized in 11.426 seconds
Port forwarding service/redis-replicas in namespace local-payments, remote port 6379 -> http://127.0.0.1:6379
Port forwarding service/redis-master in namespace local-payments, remote port 6379 -> http://127.0.0.1:6380
Port forwarding service/postgresql in namespace local-payments, remote port 5432 -> http://127.0.0.1:5432
Port forwarding service/payments-service in namespace local-payments, remote port 80 -> http://127.0.0.1:4503
Port forwarding service/postgresql-headless in namespace local-payments, remote port 5432 -> http://127.0.0.1:5433
Port forwarding service/redis-headless in namespace local-payments, remote port 6379 -> http://127.0.0.1:6381
Press Ctrl+C to exit
Streaming logs from pod: payments-service-7fccbdc75b-nvqv6 container: payments-service
Streaming logs from pod: payments-service-7fccbdc75b-nvqv6 container: wait-for-postgres
[wait-for-postgres]waiting for postgres
[wait-for-postgres]waiting for postgres
[wait-for-postgres]waiting for postgres
[wait-for-postgres]waiting for postgres
files modified: [charts/payments-service/charts/postgresql-10.13.4.tgz charts/payments-service/charts/redis-15.5.4.tgz]
Cleaning up...
deployer cleanup:pipe: too many open files
[payments-service]
[payments-service]> [email protected] migrate:up
[payments-service]> ts-node ./node_modules/typeorm/cli.js migration:run
[payments-service]
listing files: issue walking releases: too many open files
Skaffold exited with code 1.
Cleaning up...
W1109 08:38:04.450505   34592 warnings.go:70] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
release "payments-service" uninstalled

Information

  • Skaffold version: v1.34.0 tried both skaffold-darwin-amd64 and skaffold-darwin-arm64
  • Apple M1 & Operating system: macOS Big Sur Version 11.5.2

  • How to check the skaffold binary architecture? trying to figure out if the skaffold I have installed on my machine is arm64, not amd64.
  • Any suggestion/recommendation here?

@erulabs
Copy link

erulabs commented Nov 14, 2021

@ryan-efendy you can find out which build of skaffold you have with: file $(which skaffold) && skaffold version

Also, @briandealwis - I can confirm the issue goes away if I use a locally compiled version! (v1.34.0-54-gbfc43b09a works great while v1.34.0 from the releases page does not).

For now I'll start recommending folks build their own copy of skaffold to get around this issue.

@ryan6416
Copy link

@erulabs what do you mean by

use a locally compiled version

I installed skaffold via brew

> file $(which skaffold) && skaffold version                                                                                              
/opt/homebrew/bin/skaffold: Mach-O 64-bit executable arm64
v1.34.0

Can you elaborate below. As in download the standalone binary vs. using package managers?

For now I'll start recommending folks build their own copy of skaffold to get around this issue.

@erulabs
Copy link

erulabs commented Nov 15, 2021

@ryan-efendy - by "locally compiled version" I mean literally cloning the https://github.com/GoogleContainerTools/skaffold repository on an M1 Mac and running make && make install and using that skaffold binary directly, rather than using a downloaded one. The bug seems to only exist in the cross-compiled version that is available for download from brew and/or the releases page here on github.

@iyosayi
Copy link

iyosayi commented Dec 2, 2021

@ryan-efendy - by "locally compiled version" I mean literally cloning the https://github.com/GoogleContainerTools/skaffold repository on an M1 Mac and running make && make install and using that skaffold binary directly, rather than using a downloaded one. The bug seems to only exist in the cross-compiled version that is available for download from brew and/or the releases page here on github.

This did the trick! After compiling the binaries myself, I just copied it over to /usr/local/bin/skaffold from my GOPATH binary location, and I can use it around from anywhere in my system by typing skaffold and it works! Thank you very much

@Mrhoho
Copy link

Mrhoho commented Jan 6, 2022

This is my workaround

node-setup-daemon-set.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-setup
  namespace: kube-system
  labels:
    k8s-app: node-setup
spec:
  selector:
    matchLabels:
      name: node-setup
  template:
    metadata:
      labels:
        name: node-setup
    spec:
      containers:
        - name: node-setup
          image: ubuntu
          command: ["/bin/sh", "-c"]
          args:
            [
              "/script/node-setup.sh; while true; do echo Sleeping && sleep 3600; done",
            ]
          volumeMounts:
            - name: node-setup-script
              mountPath: /script
          securityContext:
            allowPrivilegeEscalation: true
            privileged: true
      volumes:
        - name: node-setup-script
          configMap:
            name: node-setup-script
            defaultMode: 0755
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-setup-script
  namespace: kube-system
data:
  node-setup.sh: |
    #!/bin/bash
    # change the file-watcher max-count on each node to 524288

    # insert the new value into the system config
    sysctl -w fs.inotify.max_user_watches=524288
    sysctl -w fs.inotify.max_user_instances=512

    # check that the new value was applied
    cat /proc/sys/fs/inotify/max_user_watches
    cat /proc/sys/fs/inotify/max_user_instances

@MarlonGamez
Copy link
Contributor

@briandealwis following up on your comment here, do you think that the recent changes to our build process should have resolved this issue ?

@briandealwis
Copy link
Member

Oh yes! To all following this issue, this bug should be fixed in Skaffold v1.35.1.

@tejal29
Copy link
Member

tejal29 commented Jan 26, 2022

@cheslijones and @Mrhoho did v1.35.1 fix your issue?

@Mrhoho
Copy link

Mrhoho commented Jan 27, 2022

I use the local compiled version v1.35.2 and the problem persists @tejal29

@briandealwis
Copy link
Member

@Mrhoho are you using a version that you compiled yourself? Or are you using the version from:

https://github.com/GoogleContainerTools/skaffold/releases/tag/v1.35.2

@Mrhoho
Copy link

Mrhoho commented Jan 28, 2022

@ryan-efendy - by "locally compiled version" I mean literally cloning the https://github.com/GoogleContainerTools/skaffold repository on an M1 Mac and running make && make install and using that skaffold binary directly, rather than using a downloaded one. The bug seems to only exist in the cross-compiled version that is available for download from brew and/or the releases page here on github.

@briandealwis

@briandealwis
Copy link
Member

@Mrhoho interesting! If you're willing, I'd like you try a few things to help us narrow the issue:

  1. Could you try using the binary from the v1.35.2 release just to rule out any local configuration issue?
  2. Could you try using the darwin-amd64 binary from the v1.35.2 release using Rosetta? Others found that they didn't have the problems there.
  3. Could you check your file limits using the commands listed above?

@Mrhoho
Copy link

Mrhoho commented Jan 28, 2022

Macbook pro m1 max
k3d version v5.2.2
k3s version v1.21.7-k3s1 (default)

1.MacOS arm64

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/v1.35.2/skaffold-darwin-arm64 && chmod +x skaffold && sudo mv skaffold /usr/local/bin

which skaffold
/usr/local/bin/skaffold

skaffold dev --port-forward

[envoy-sidecar] [2022-01-28 14:33:32.472][1][info][upstream] [source/server/lds_api.cc:77] lds: add/update listener 'public_listener:10.42.0.133:20000'
[envoy-sidecar] [2022-01-28 14:33:32.472][1][info][upstream] [source/server/lds_api.cc:77] lds: add/update listener 'addsvc:127.0.0.1:7081'
[envoy-sidecar] [2022-01-28 14:33:32.473][1][info][upstream] [source/server/lds_api.cc:77] lds: add/update listener 'jaeger:127.0.0.1:6831'
[envoy-sidecar] [2022-01-28 14:33:32.473][1][info][config] [source/server/listener_manager_impl.cc:784] all dependencies initialized. starting workers
[envoy-sidecar] failed to create fsnotify watcher: too many open files

2.macOS amd64

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/v1.35.2/skaffold-darwin-amd64 && chmod +x skaffold && sudo mv skaffold /usr/local/bin

file $(which skaffold) && skaffold version
/usr/local/bin/skaffold: Mach-O 64-bit executable x86_64
v1.35.2

skaffold dev --port-forward

[docsvc] 2022/01/28 14:46:29 debug logging disabled
[docsvc] 2022/01/28 14:46:29 Initializing logging reporter
[addsvc] failed to create fsnotify watcher: too many open files[docsvc] 2022/01/28 14:46:29 debug logging disabled
[docsvc] level=info ts=2022-01-28T14:46:29.027129222Z caller=main.go:148 service=docsvc protocol=HTTP exposed=7030
[docsvc] level=info ts=2022-01-28T14:46:29.027694847Z caller=main.go:161 service=docsvc protocol=GRPC protocol=GRPC exposed=7031
[docsvc] failed to create fsnotify watcher: too many open files[copy-consul-bin] failed to create fsnotify watcher: too many open files

3.File limits

lsof -n | awk '{ counts[$1]++} END { for(c in counts) { print counts[c], c}}' | sort -n | tail

228 cloud-dri
273 MTLCompil
278 UserEvent
303 cloudd
335 WeChat
597 com.apple
631 QQ
941 Code\x20H
1887 Electron
2145 Google

sysctl -a | grep files

kern.maxfiles: 524288
kern.maxfilesperproc: 524288
kern.filesetuuid: 6FE5ED52-0EAA-BAFE-2188-14C92CA517B4
kern.num_files: 8063

@briandealwis

@Mrhoho
Copy link

Mrhoho commented Jan 28, 2022

I have another problem, after a few minutes of skaffold dev --port-forward and then lose all output but the cluster app is running fine, any ideas what's going on?🤔

[docsvc] 2022/01/28 15:30:16 Reporting span ebdd360774a673ea:6825845e15599835:649556d57eecec23:1
[docsvc] level=info ts=2022-01-28T15:30:16.410448758Z caller=middleware.go:19 service=docsvc method=create transport_error=null took=397.10525ms
[router-http] 2022/01/28 15:30:16 Reporting span ebdd360774a673ea:649556d57eecec23:3bf5716890ea1e2b:1
[envoy-sidecar] error: unexpected EOF
WARN[0620] exit status 1                                 subtask=-1 task=DevLoop
[envoy-sidecar] error: unexpected EOF
WARN[0620] exit status 1                                 subtask=-1 task=DevLoop
[prometheus-statsd] error: unexpected EOF
[envoy-sidecar] error: unexpected EOF
WARN[0620] exit status 1                                 subtask=-1 task=DevLoop
WARN[0620] exit status 1                                 subtask=-1 task=DevLoop
[envoy-sidecar] error: unexpected EOF
[router-grpc] error: unexpected EOF
WARN[0620] exit status 1                                 subtask=-1 task=DevLoop
WARN[0620] exit status 1                                 subtask=-1 task=DevLoop
[envoy-sidecar] error: unexpected EOF
WARN[0620] exit status 1                                 subtask=-1 task=DevLoop
[envoy-sidecar] error: unexpected EOF
WARN[0621] exit status 1                                 subtask=-1 task=DevLoop
[foosvc] error: unexpected EOF
[prometheus-statsd] error: unexpected EOF
[prometheus-statsd] error: unexpected EOF
WARN[0621] exit status 1                                 subtask=-1 task=DevLoop
WARN[0621] exit status 1                                 subtask=-1 task=DevLoop
WARN[0621] exit status 1                                 subtask=-1 task=DevLoop
[website] error: unexpected EOF
WARN[0653] exit status 1                                 subtask=-1 task=DevLoop
[addsvc] error: unexpected EOF
WARN[0741] exit status 1                                 subtask=-1 task=DevLoop
[router-http] error: unexpected EOF
[docsvc] error: unexpected EOF
WARN[0744] exit status 1                                 subtask=-1 task=DevLoop
WARN[0744] exit status 1                                 subtask=-1 task=DevLoop

@briandealwis
Copy link
Member

Odd! Your file limits are an order of magnitude of my machine settings.

I should have asked two other things:

  1. what does ulimit -a show from your shell?
  2. what does launchctl limit show?

There are a number of reports similar to this with the M1, saying the default values for maxfiles are too low. You may get further by running ulimit -n 65536 (or higher!) before launching skaffold.

@Mrhoho
Copy link

Mrhoho commented Jan 29, 2022

1.ulimit -a

-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8176
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       10666
-n: file descriptors                524288

2.launchctl limit
cpu unlimited unlimited
filesize unlimited unlimited
data unlimited unlimited
stack 8372224 67092480
core 0 unlimited
rss unlimited unlimited
memlock unlimited unlimited
maxproc 10666 16000
maxfiles 524288 524288

I modified the default value

  • vi /Library/LaunchDaemons/limit.maxfiles.plist
<?xml version="1.0" encoding="UTF-8"?>  
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"  
        "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">  
  <dict>
    <key>Label</key>
    <string>limit.maxfiles</string>
    <key>ProgramArguments</key>
    <array>
      <string>launchctl</string>
      <string>limit</string>
      <string>maxfiles</string>
      <string>524288</string>
      <string>524288</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>ServiceIPC</key>
    <false/>
  </dict>
</plist>
plutil /Library/LaunchDaemons/limit.maxfiles.plist
/Library/LaunchDaemons/limit.maxfiles.plist: OK

sudo chown root:wheel /Library/LaunchDaemons/limit.maxfiles.plist
sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
  • docker-Preferences-Docker Engine
{
  "experimental": false,
  "features": {
    "buildkit": true
  },
  "default-ulimits": {
    "nofile": {
      "Soft": 524288,
      "Hard": 524288,
      "Name": "nofile"
    }
  },
  "builder": {
    "gc": {
      "enabled": true,
      "defaultKeepStorage": "20GB"
    }
  }
}

@briandealwis
Copy link
Member

I’m wondering what process has these open files, and what these files may be. Can you try running that lsof commands periodically during your skaffold dev? I’m curious if the problem is Skaffold or Docker. You could then try running lsof and diff the results to see what new files are opened.

@Mrhoho
Copy link

Mrhoho commented Jan 29, 2022

diff

@briandealwis
Copy link
Member

Thanks for that lsof diff @Mrhoho. Based on the diff @@ markers, you look to have less than 15000 open files, which should be well within your limits. It looks like there are 14 instances of kubectl running, presumably for kubectl log and kubectl port-forward.

Two observations:

  1. You seem to be using kubectl from homebrew. We've had reports of very odd resource use from kubectl from homebrew (MacOS Big Sur killing skaffold process -- violates CPU wakes limit #5161), for reasons unknown. But perhaps you could try downloading kubectl directly from dl.k8s.io?
  2. There seems to be thousands of files open for Google, which I suspect is Google Chrome. Could you try quitting Chrome?

And just to confirm, you see this when you are not using your fs.inotify.* patch in #6072 (comment), correct? I don't understand how that patch helps, to be honest, unless you have a file-watcher running within your container images?

@Mrhoho
Copy link

Mrhoho commented Jan 30, 2022

Thank you for your help!

1.Doesn't work

2.Still doesn't work

Yes i did not apply this patch.

I don't know the exact reason. But this patch worked for me

Thanks again

@briandealwis
Copy link
Member

@Mrhoho: I think you're hitting a different issue than what's described in this thread. The symptoms described by others arise from executing commands by Skaffold within macOS [*] whereas your symptoms seem to be within the Docker containers. Given that you're having to tweak the fs.inotify flags from within each node, there must be some component in one of your container images that's hitting resource limits within the Docker VM. Skaffold doesn't run a file-watcher inside the containers or inside the cluster, and I'm struggling to think how Skaffold may be at fault here.

[*] @cheslijones did mention seeing odd http2 errors though they happened on Linux and WSL2 too. I'm fairly certain those errors are in-container errors as Skaffold errors would have been reported with a leading WARN[xxx]

@briandealwis
Copy link
Member

briandealwis commented Mar 5, 2022

Closing as the principal issue seems solved with the new cgo-enabled build process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working kind/question User question meta/release priority/p2 May take a couple of releases
Projects
None yet
Development

No branches or pull requests