Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.13: [Bug]: in-cluster build sync error when root path is very long #4527

Closed
chrispsplash opened this issue Jun 2, 2023 · 12 comments · Fixed by #4867
Closed

0.13: [Bug]: in-cluster build sync error when root path is very long #4527

chrispsplash opened this issue Jun 2, 2023 · 12 comments · Fixed by #4867

Comments

@chrispsplash
Copy link

Garden Bonsai (0.13) Bug

Current Behavior

When using garden build in especially long root paths (over 70 chars) I've noticed an odd repeating error (log below).

For example, the full path being used below is /tmp/Extremely/Long/path-with/some-hyphens/and/whatnot/example-bonsai-sync-issue

command output
❯ KUBE_CONTEXT=amer-dev REGISTRY_HOSTNAME=reg-chrisprzybycien.tools.splashdevelop.com garden build
Build 🔨

Garden v0.13 (Bonsai) is a major release with significant changes. Please help us improve it by reporting any issues/bugs here:
https://go.garden.io/report-bonsai
→ Run garden util hide-warning 0.13-bonsai to disable this warning.
ℹ garden               → Running in Garden environment example.chrisprzybycien
ℹ providers            → Getting status...
✔ providers            → Cached (took 0.6 sec)
ℹ providers            → Run with --force-refresh to force a refresh of provider statuses.
ℹ graph                → Resolving actions and modules...
✔ graph                → Done (took 0.1 sec)
ℹ build.base-image     → -> Deploying garden-buildkit daemon in example-problem-chrisprzybycien namespace (was outdated)
ℹ build.base-image     → Waiting for resources to be ready...
ℹ build.base-image     → Resources ready
ℹ build.base-image     → Done!
ℹ build.base-image     → -> Deploying garden-buildkit daemon in example-problem-chrisprzybycien namespace (was outdated)
ℹ build.base-image     → Waiting for resources to be ready...
ℹ build.base-image     → Resources ready
ℹ build.base-image     → Done!
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 1/10)...
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 2/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 3/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 4/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 5/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 6/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 7/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 8/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 9/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 10/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 1/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 2/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 3/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 4/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 5/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 6/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 7/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 8/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 9/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Synchronization monitor exited with code 1.
⚠ build.base-image     → Could not connect to sync daemon, retrying (attempt 10/10)...
⚠ build.base-image     → Synchronization monitor exited with code 1.
✖ build.base-image     → Build failed (took 248.2 sec)
✖ build.base-image     → Failed processing Build type=container name=base-image (took 248.22 sec). Here is the output:

────────────────────────────────────────────────────────────────────────────────
Command "/Users/chrisprzybycien/.garden/tools/mutagen/5f82909105ed5d86/mutagen sync terminate k8s--build-sync--example--example-problem-chrisprzybycien--base-image--bubrjv4c" failed with code 1:

Attempting to start Mutagen daemon...
Error: unable to connect to daemon: connection timed out (is the daemon running?)
────────────────────────────────────────────────────────────────────────────────
1 build action(s) failed!

See .garden/error.log for detailed error message
❯ pwd
/tmp/Extremely/Long/path-with/some-hyphens/and/whatnot/example-bonsai-sync-issue

 /tmp/Extremely/Long/path-with/some-hyphens/and/whatnot/example-bonsai-sync-issue  main ?1 
❯

Expected behavior

Garden builds the container without issue.

Reproducible example

I created a super simple example in a repo called example-bonsai-sync-issue

Workaround

There is no problem when you move the project to a much shorter path (ex /tmp/example-bonsai-sync-issue).

Suggested solution(s)

🤷

Additional context

N/A

Your environment

  • OS: MacOS Venture 13.2.1
  • How I'm running Kubernetes: EKS

garden version
0.13.0

@chrispsplash
Copy link
Author

Verified it still happens in 0.13.1 ✅

@Orzelius
Copy link
Contributor

Orzelius commented Jun 8, 2023

@chrispsplash Thank you for reporting this and figuring out the underlying cause. It shouldn't be a hard fix so we'll try to get on to it in the coming weeks.

@Orzelius Orzelius added the bug label Jun 8, 2023
@Orzelius Orzelius moved this to Candidate in Core Weekly Jun 8, 2023
@vvagaytsev vvagaytsev moved this from Candidate to In Progress in Core Weekly Jun 9, 2023
@vvagaytsev vvagaytsev self-assigned this Jun 9, 2023
@vvagaytsev
Copy link
Collaborator

I reproduced this locally with gke example. I my case the mutagen daemon exits with code 1 in Mutagen.ensureSyncs function when it calls getActiveSyncSession function.

So it fails in a different place, not when running the mutagen sync terminate command. But when calling the mutagen sync list. In both cases, we pass the MUTAGEN_DATA_DIRECTORY env variable to the mutagen CLI.

I found an original mutagen issue that might be relevant: mutagen-io/mutagen#433.
There is a comment about the Unix domain socket path length restrictions. Despite the MUTAGEN_DATA_DIRECTORY itself does not exceed that limit (on macOS it's 104 characters, see https://unix.stackexchange.com/questions/367008/why-is-socket-path-length-limited-to-a-hundred-chars/367012#367012), it's used inside the mutagen code as a prefix value to build some new ones. And the resulting values might exceed the socket path length limit.

I found a limit experimentally on my local machine, it was 85 characters for MUTAGEN_DATA_DIRECTORY. The value was /Users/vladimirvagaytsev/Repositories/garden/examples/some-name1/gke/.garden/mutagen, so the project root path was /Users/vladimirvagaytsev/Repositories/garden/examples/some-name1/gke, .i.e. 70 characters.

Adding one more character to the path caused mutagen daemon failures with code 1. There was no valuable information in the error messages.

As long as it's a socket path limitation, we can't do any proper fix here. But we can mitigate it by firing a warning if the path exceeds the threshold value of 70 chars. Also, we can warn about possible socket path length limitations on repeated failures of the mutagen. I'll file a PR for that soon.

@vvagaytsev
Copy link
Collaborator

@chrispsplash it doesn't seem that the problem can be fixed in the Garden codebase. The issue comes from the implementation details of the underlying sync tool. See the comment above for the details.

In #4582 we introduced a warning to clarify the failure cause. The fix is available in the edge-bonsai release and will be released in the 0.13.3..

Please let us know if it's ok to close this issue.

@vvagaytsev vvagaytsev moved this from In Progress to Done in Core Weekly Jun 9, 2023
@chrispsplash
Copy link
Author

@vvagaytsev thanks that works for me!

@salotz
Copy link

salotz commented Jul 6, 2023

What about setting up a shorter directory path at $HOME for MUTAGEN_DATA_DIRECTORY? I used mutagen for years and didn't run into this limitation with long project paths, I think its not an issue due to mutagen itself using the ~/.mutagen directory instead of being in your project directory.

Even just supporting users manually setting this would be a helpful workaround. I tried export MUTAGEN_DATA_DIRECTORY myself but it didn't seem to change anything.


Thank you very much for making the error message by the way, or I would have been hopelessly confused given the error message.

@stefreak
Copy link
Member

stefreak commented Jul 14, 2023

After discussing with @vvagaytsev we believe that there is a way to avoid this failure mode altogether (Some ideas are here: https://discord.com/channels/817392104711651328/1088795679159222292/1127950349374869534).

There are also some tests in our codebase like "builds a Docker image and emits a namespace status event" that could be re-enabled once this issue has been resolved.

@stefreak stefreak reopened this Jul 14, 2023
@stefreak stefreak moved this from Done to Todo in Core Weekly Jul 14, 2023
@Walther Walther assigned shumailxyz and unassigned vvagaytsev Jul 17, 2023
@shumailxyz shumailxyz moved this from Todo to In Progress in Core Weekly Jul 19, 2023
@shumailxyz shumailxyz moved this from In Progress to Done in Core Weekly Jul 20, 2023
@vvagaytsev
Copy link
Collaborator

vvagaytsev commented Jul 20, 2023

@chrispsplash @salotz the fix has been released in 0.13.9, please try it out.

@chrispsplash
Copy link
Author

Confirmed!! Thanks!

@salotz
Copy link

salotz commented Jul 27, 2023

@vvagaytsev Looks to be working for me as well. Thanks for the fix!


I will note that I still receive lots of warnings and errors on trying to connect. Its highly variable and it if it doesn't automatically work it out like in the following example, a retry usually does it. I don't think its related to this issue, but I thought I would mention it.

ℹ deploy.tests-deploy  → Deploying version v-71e8082725...
ℹ deploy.tests-deploy  → Waiting for resources to be ready...
ℹ deploy.tests-deploy  → Resources ready
✔ deploy.tests-deploy  → Done (took 2.1 sec)
ℹ deploy.tests-deploy  → Starting sync
ℹ deploy.tests-deploy  → Syncing ./src to /app/src in Deployment/tests-deployment (one-way)
ℹ deploy.tests-deploy  → Syncing ./tests/integration to /app/tests/integration in Deployment/tests-deployment (one-way)
⚠ deploy.tests-deploy  → Failed to start sync from ./src to /app/src in Deployment/tests-deployment. 5 attempts left.
⚠ deploy.tests-deploy  → Failed to start sync from ./src to /app/src in Deployment/tests-deployment. 4 attempts left.
⚠ deploy.tests-deploy  → Failed to start sync from ./src to /app/src in Deployment/tests-deployment. 3 attempts left.
ℹ deploy.db-migrate-debug → [sync]: Sync connected to source
ℹ deploy.db-migrate-debug → [sync]: Sync connected to target
ℹ deploy.db-migrate-debug → [sync]: Sync connected to target
⚠ deploy.tests-deploy  → Failed to start sync from ./src to /app/src in Deployment/tests-deployment. 2 attempts left.
⚠ deploy.tests-deploy  → Failed to start sync from ./src to /app/src in Deployment/tests-deployment. 1 attempts left.
⚠ deploy.tests-deploy  → Failed to start sync from ./src to /app/src in Deployment/tests-deployment. 0 attempts left.
✖ deploy.tests-deploy  → Failed processing Deploy type=kubernetes name=tests-deploy (took 60.28 sec). Here is the output:

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Command "/home/salotz/.garden/tools/mutagen/c8f927f6b7e9b6c5/mutagen sync create /home/salotz/tree/ibeks/devel/ungulata/src exec:'/home/salotz/.garden/tools/kubectl/49eb930aa565a80f/kubectl exec -i --context=minikube --namespace=ungulata-default --container tests Deployment/tests-deployment -- /.garden/mutagen-agent synchronizer':/app/src --name k-8-s-local-default-tests-deploy-deployment-tests-deployment-src-app-src --sync-mode one-way-safe -i /**/*.git -i **/*.garden -i __pycache__/" failed with code 1:

Error: unable to connect to beta: unable to connect to endpoint: unable to dial agent endpoint: unable to handshake with agent process: server magic number incorrect (error output: command terminated with exit code 126)

Here's the full output:

Connecting to agent (POSIX)...                                                  
Error: unable to connect to beta: unable to connect to endpoint: unable to dial agent endpoint: unable to handshake with agent process: server magic number incorrect (error output: command terminated with exit code 126)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


2 deploy action(s) failed!

See .garden/error.log for detailed error message
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
⚠ deploy.tests-deploy  → Failed to start sync from ./tests/integration to /app/tests/integration in Deployment/tests-deployment. 5 attempts left.
ℹ deploy.tests-deploy  → [sync]: Scanning files to sync
✔ deploy.tests-deploy  → Connected to sync target /app/tests/integration in Deployment/tests-deployment
ℹ deploy.tests-deploy  → [sync]: Scanning files to sync
ℹ deploy.tests-deploy  → [sync]: Saving sync archive
ℹ deploy.tests-deploy  → [sync]: Saving sync archive
ℹ deploy.tests-deploy  → [sync]: Watching for changes
✔ deploy.tests-deploy  → [sync]: Completed initial sync from ./tests/integration to /app/tests/integration in Deployment/tests-deployment
ℹ deploy.tests-deploy  → [sync]: Watching for changes

@vvagaytsev
Copy link
Collaborator

@salotz, thank you for sharing extra details. You are right, that looks like another issue.

We'll take a look if we can consistently reproduce it. Could you please share a reproducible example?

@salotz
Copy link

salotz commented Jul 28, 2023

We'll take a look if we can consistently reproduce it. Could you please share a reproducible example?

If I find something reproducible I will share in another issue. I haven't been able to reproduce it reliably myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants