Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run cloudflared on OT-2 #189

Open
arogozhnikov opened this issue Nov 30, 2022 · 16 comments
Open

Can't run cloudflared on OT-2 #189

arogozhnikov opened this issue Nov 30, 2022 · 16 comments

Comments

@arogozhnikov
Copy link

cloudflared is a communication utility that connects from/to an internal network of the company.
It provides zero-trust connection from the internet.

It supports pretty much any OS of any distribution.

However, buildroot locks systemd for editing, and when cloudflared tries to install service, I run into this problem:

2022-11-30T16:46:41Z INF Using Systemd
2022-11-30T16:46:41Z ERR error generating service template error="error writing /etc/systemd/system/cloudflared.service: open /etc/systemd/system/cloudflared.service: read-only file system"
error writing /etc/systemd/system/cloudflared.service: open /etc/systemd/system/cloudflared.service: read-only file system

Are there any tools to override this? I saw the discussion in #106 about allowing customers to use services.

@sfoster1
Copy link
Member

sfoster1 commented Nov 30, 2022

The root filesystem (which includes /usr, /lib, /bin, and /etc which is where default systemd units are stored) in the ot-2 is mounted read-only to prevent modification. This is because that root filesystem is completely overwritten during update - so if you put a systemd service in /etc, whenever you update your robot it will be gone.

/home, /var, and /data are on a separate partition that does not get overwritten during update. That means that the user unit search path /home/.config/systemd/user/ is a good place to put custom systemd units - try there?

@arogozhnikov
Copy link
Author

@sfoster1

Hi Seth,

That means that the user unit search path /home/.config/systemd/user/ is a good place to put custom systemd units - try there?

That's what I get:

mkdir: can't create directory '/home/.config/': Read-only file system

so I assume that's not going to work.

Now, I've tried setting up the way you described in #106

  • A systemd service unit
  • A directory /var/home/.config/systemd/user/opentrons.target.wants that includes a symlink to your service

and couldn't make systemctl see the service:

~ # cat /root/tunnel.service
[Unit]
Description=Static tunnel to ssh into machine
After=basic.target

[Service]
Type=exec
ExecStart=/root/cloudflared tunnel --config /root/tunnel_conf.yaml --protocol=quic  run

[Install]
WantedBy=opentrons.target
~ # ls /var/home/.config/systemd/user/opentrons.target.wants -lah
total 2
drwxr-xr-x    2 root     root        1.0K Dec  1 23:19 .
drwxr-xr-x    3 root     root        1.0K Dec  1 23:18 ..
lrwxrwxrwx    1 root     root          20 Dec  1 23:19 tunnel.service -> /root/tunnel.service
systemctl daemon-reload
# this command returns nothing, and I see nothing relevant when listing either
systemctl list-units --type=service --all | grep tunnel

@arogozhnikov
Copy link
Author

remark: open to using any option (i.e. not necessary systemd) that can launch process on boot as daemons

@sfoster1
Copy link
Member

sfoster1 commented Dec 2, 2022

remark: open to using any option (i.e. not necessary systemd) that can launch process on boot as daemons

Oh! In that case, we support boot scripts run with run-parts. Drop an executable shell script named NN-some-ascii-text where NN is a number (this is mostly a convention - the only rules are that it has to be ascii letters, numbers, or - and _) and it'll get run at boot:

# cat /var/data/boot.d/00-demo 
echo "my service ran"
touch /var/data/my-service-ran
# ls -l /var/data/boot.d/00-demo 
-rwxr-xr-x    1 root     root            53 Dec  2 14:13 /var/data/boot.d/00-demo
# reboot
# # ls -l /var/data/
total 310
(other results removed for clarity)
-rw-r--r--    1 root     root             0 Dec  2 14:14 my-service-ran

# journalctl -u opentrons-run-boot-scripts --no-pager
-- Logs begin at Fri 2018-06-22 11:11:49 UTC, end at Fri 2022-12-02 14:16:05 UTC. --
-- Reboot --
Dec 02 14:14:21 opentrons run-parts[162]: my service ran
Dec 02 14:14:21 opentrons systemd[1]: Starting Opentrons: Run user-supplied boot scripts...
Dec 02 14:14:21 opentrons systemd[1]: Started Opentrons: Run user-supplied boot scripts.

@sfoster1
Copy link
Member

sfoster1 commented Dec 2, 2022

One thing to keep in mind is that run-parts scripts are unfortunately a lot less configurable than systemd services. You might know how to do this stuff better than me, but those scripts all want to execute like a systemd oneshot service - the script runs once and then exits. That means you really need cloudflared to daemonize (fork and abandon its parent) when called on the commandline - there might be a -d,--daemonize command line flag, or maybe just the absence of a --foreground flag or something, I'm not familiar with cloudflared and can't find a good reference for its command line params.

@arogozhnikov
Copy link
Author

@sfoster1 nice, does run-parts just assumes these files are shell scripts?

@arogozhnikov
Copy link
Author

asking because there is no shebang in your example

@sfoster1
Copy link
Member

sfoster1 commented Dec 2, 2022

Ah, yes it does. It runs them through the shell.

@arogozhnikov
Copy link
Author

arogozhnikov commented Dec 15, 2022

@sfoster1 likely I'm doing something wrong, but the service isn't started during reboot:

Location:

~ # ls /var/data/boot.d/00-cftunnel -lah
-rw-r--r--    1 root     root         375 Dec  2 17:27 /var/data/boot.d/00-cftunnel

In log, nothing shows it was found or called:

Dec 15 18:31:03 opentrons ot-commit-machine-id[164]: machine-id "05a2d52f19ca460a9f87f944c6532461" already committed. Exiting without doing anything.
Dec 15 18:31:02 opentrons systemd[1]: Starting Jupyter notebook server...
Dec 15 18:31:02 opentrons systemd[1]: Starting Opentrons: Run user-supplied boot scripts...
Dec 15 18:31:02 opentrons systemd[1]: Starting Network Connectivity...
Dec 15 18:31:02 opentrons systemd[1]: Starting Opentrons: Ensure system wired connections...
Dec 15 18:31:02 opentrons systemd[1]: Starting Rerun udev for block devices...
Dec 15 18:31:02 opentrons systemd[1]: Started D-Bus System Message Bus.

Contents of file:

# cat /var/data/boot.d/00-cftunnel

echo "starting cloudflared tunnel"
echo -n $(date -u) >> /data/tunnel.log
echo "starting cloudflared tunnel" >> /root/tunnel.log
tmux kill-session -t ot-tunnel-session || (echo 'no tmux session to stop' >> /root/tunnel.log)
<actual cloudflared command goes here>

Update: Command that you suggested:

-- Reboot --
Dec 15 18:31:02 opentrons systemd[1]: Starting Opentrons: Run user-supplied boot scripts...
Dec 15 18:31:02 opentrons systemd[1]: Started Opentrons: Run user-supplied boot scripts.

@sfoster1
Copy link
Member

@arogozhnikov Mark it executable: chmod u+x /var/data/boot.d/00-cftunnel

@arogozhnikov
Copy link
Author

arogozhnikov commented Jun 12, 2023

@sfoster1 I think I've tried everything and cloudflare just can't run at this point in boot process. I am not 100% sure, but here is what I have:

  1. /var/data/boot.d/00-cftunnel runs at startup
  2. I place several commands inside, and they run
  3. I additionally place a simple echo wrapped in tmux to verify that tmux server can be started from boot.d
  4. if I just source /var/data/boot.d/00-cftunnel, tunnel is started normally.

I do not see any logs or errors from cloudflared. Adding sleep 60 before running cloudflared did not help either

Any other ideas?

@sfoster1
Copy link
Member

Well huh. A lot of my ideas are broken by cloudflared working fine if you source /var/data/boot.d/00-cftunnel. I assume you're doing something like their setup docs with a config file somewhere on the OT-2 filesystem that you're passing the path to in /var/data/boot.d/00-cftunnel, right?

Where is that config file on the OT-2 filesystem? I wonder if there's some problem like that part of the filesystem not being mounted at the time you run 00-cftunnel. And putting sleep 60 in there wouldn't necessarily fix it because runparts and the systemd unit it's in would just see that as the script taking a long time and delay starting whatever depends on it.

Where on the OT-2 filesystem did you put the cftunnel binary+supporting solibs and config file?

@arogozhnikov
Copy link
Author

I place everything (binary, config, logs) right under /root

/root/cloudflared tunnel --config /root/tunnel_conf.yaml --protocol=quic --logfile /root/tunnel.log run > /root/tunnel_last_start.log 2>&1

@sfoster1
Copy link
Member

And then there's nothing in /root/tunnel.log or /root/tunnel_last_start.log when you ssh in after boot, right?

I'm really not sure what in the world is going wrong but one thing we could try is your idea to wait some time before starting the service, but do it in a fork'd child of the run-parts script. What you'd want to do is the following:

  1. Create a new script somewhere, let's say /var/data/boot.d/cftunnel-worker, with chmod +x and a bash shebang, and put basically everything that's currently in 00-cftunnel in there, including an initial 60-second sleep
  2. Make 00-cftunnel only do the following:
    nohup /var/data/boot.d/cftunnel-worker 0<&- &>/dev/null &
    This should do something similar to daemonize(1) which is not available on the ot2. Ignore this next part if you already know what it means, but it basically creates a child process and then severs the child process's connection to the parent process so the child can run forever in the background.

So if the problem we're facing is (1) system resources aren't ready enough at the time runparts runs so cftunnel can't start and (2) doing a sleep 60 in the runparts script just means that systemd delays bringing up those parts of the system until the script is done, this should solve it by avoiding (2).

@arogozhnikov
Copy link
Author

arogozhnikov commented Jun 14, 2023

It is not /root not mounted, but something with network, I assume. Also there is probably something around tmux + cf used together

your solution (nohup + delay) seems to work. Need more tests to be sure about that, but at least it restarted successfully twice

Delay is critical, otherwise I get this in logs:

{"level":"warn","error":"Group ID 0 is not between ping group 1 to 0","time":"2023-06-14T21:12:47Z","message":"The user running cloudflared process has a GID (group ID) that is not within ping_group_range. You might need to add that user to a group within that range, or instead update the range to encompass a group the user is already in by modifying /proc/sys/net/ipv4/ping_group_range. Otherwise cloudflared will not be able to ping this network"}
{"level":"warn","error":"cannot create ICMPv4 proxy: Group ID 0 is not between ping group 1 to 0 nor ICMPv6 proxy: socket: permission denied","time":"2023-06-14T21:12:47Z","message":"ICMP proxy feature is disabled"}
{"level":"error","event":0,"error":"lookup _v2-origintunneld._tcp.argotunnel.com on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: cannot assign requested address","time":"2023-06-14T21:12:47Z","message":"edge discovery: error looking up Cloudflare edge IPs: the DNS query failed"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"Please try the following things to diagnose this issue:"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"  1. ensure that argotunnel.com is returning \"origintunneld\" service records."}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     Run your system's equivalent of: dig srv _origintunneld._tcp.argotunnel.com"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"  2. ensure that your DNS resolver is not returning compressed SRV records."}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     See GitHub issue https://github.com/golang/go/issues/27546"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     For example, you could use Cloudflare's 1.1.1.1 as your resolver:"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/"}
{"level":"info","time":"2023-06-14T21:12:47Z","message":"ICMP proxy will use 0.0.0.0 as source for IPv4"}
{"level":"info","time":"2023-06-14T21:12:47Z","message":"ICMP proxy will use :: as source for IPv6"}

@sfoster1
Copy link
Member

Ah, I guess it's not designed to handle "I'm not currently network-connected" or something. Well, I'm glad the nohup plus delay works! Let me know if something fails in those further tests - I'll leave this open for another couple days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants