Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP monitors all have a status of "canceled" after update to 1.23.4 #4027

Closed
2 tasks done
baubie opened this issue Nov 13, 2023 · 38 comments
Closed
2 tasks done

HTTP monitors all have a status of "canceled" after update to 1.23.4 #4027

baubie opened this issue Nov 13, 2023 · 38 comments
Labels
bug Something isn't working

Comments

@baubie
Copy link

baubie commented Nov 13, 2023

⚠️ Please verify that this bug has NOT been raised before.

  • I checked and didn't find similar issue

🛡️ Security Policy

Description

HTTP monitors all showing "canceled" after update to 1.23.4 and it must be related to #3961 (comment)

When I checked, the timeout for these monitors were all set to 2880 seconds (I don't think I set this). I set them down to something more reasonable (a couple of seconds) and all monitors work again.

I didn't see any other posts about this and it may just affect me as I've been running Uptime Kuma for a long time now.

FIX: Update the timeout of your HTTP monitors to a new value.

👟 Reproduction steps

Have old monitors and upgrade to 1.23.4 (from 1.23.2 in my case).

👀 Expected behavior

Monitors will report being up.

😓 Actual Behavior

Monitors reported down with the status "canceled".

🐻 Uptime-Kuma Version

1.23.4

💻 Operating System and Arch

UBuntu 22.04

🌐 Browser

N/A

🐋 Docker Version

No response

🟩 NodeJS Version

No response

📝 Relevant log output

No response

@baubie baubie added the bug Something isn't working label Nov 13, 2023
@RoiArthurB
Copy link

Same here.

As a workaround, I noticed that enabling Ignore TLS/SSL error for HTTPS websites does fix this issue and makes the monitor works as before.

@louislam
Copy link
Owner

louislam commented Nov 14, 2023

Ops, that is strange. I thought I handled it. Need to investigate.

I thought #4011 was fixed issue, but clearly not.

Do you guys know how to use SQLite Expert to view data? I want to see what is inside your timeout field.

image

@kWAYTV
Copy link

kWAYTV commented Nov 14, 2023

Ops, that is strange. I thought I handled it. Need to investigate.

I thought #4011 was fixed issue, but clearly not.

Do you guys know how to use SQLite Expert to view data? I want to see what is inside your timeout field.

image

Same issue here, clean first time install. Screenshot:
image

@chakflying
Copy link
Collaborator

chakflying commented Nov 14, 2023

I tried reproducing this by creating a new database in branch 1.21.X, adding HTTP monitors with intervals 60 and 360, then upgrading to 1.23.X, but it still works fine as expected.

Maybe you can post the config of the monitor here (masking out anything sensitive), and the node.js version used (or docker).

@megosugit
Copy link

Same here.

As a workaround, I noticed that enabling Ignore TLS/SSL error for HTTPS websites does fix this issue and makes the monitor works as before.

not for me :(

@CommanderStorm

This comment was marked as outdated.

@chakflying

This comment was marked as outdated.

@kokofixcomputers
Copy link
Contributor

Ops, that is strange. I thought I handled it. Need to investigate.

I thought #4011 was fixed issue, but clearly not.

Do you guys know how to use SQLite Expert to view data? I want to see what is inside your timeout field.

image

Thats a weird issue! So this only happened after upgrading to 1.23.4? and should i upgrade to 1.23.4? or should i wait?

@JMyrng
Copy link

JMyrng commented Nov 15, 2023

After downgrading to 1.23.3 the monitors are up and running again.

@megosugit
Copy link

megosugit commented Nov 15, 2023

After downgrading to 1.23.3 the monitors are up and running again.

is it safe to downgrade while keeping all data ? Is it just a matter of pulling 1.23.3 and recreating the container ?
@JMyrng :)

@JMyrng
Copy link

JMyrng commented Nov 16, 2023

After downgrading to 1.23.3 the monitors are up and running again.

is it safe to downgrade while keeping all data ? Is it just a matter of pulling 1.23.3 and recreating the container ?
@JMyrng :)

Yes, I've done exactly that.

@megosugit
Copy link

After downgrading to 1.23.3 the monitors are up and running again.

is it safe to downgrade while keeping all data ? Is it just a matter of pulling 1.23.3 and recreating the container ?
@JMyrng :)

Yes, I've done exactly that.

thank you, I'm going to do the same as I keep receiving Cancelled notifications for no reason and my paused monitors get unpaused magically :/

@chakflying
Copy link
Collaborator

Monitors unpausing is another issue that should be fixed with #4032.

Again, for those who have experienced this exact issue (HTTP monitor getting "Cancelled" when it's actually up), please post your node.js version and OS (or docker) so we can help troubleshoot. 🙇🏻‍♂️

@megosugit
Copy link

I saw about #4032, good to know a fix is coming, thank you.

My versions:
Docker 20.10.23 (API: 1.41) (running on Synology NAS)
node.js 16.20.2

@megosugit
Copy link

I'm not sure if I had exactly the same issue regarding the Cancelled message.
In my case, I started a new install with 1.23.3, then upgraded to 1.23.4. I then created a new HTTPS monitor to check a website that was currently down, and I kept receiving notifications endlessly until the website was finally up.
In the GUI, it was looping between pending (orange) and down (red).

It shows repetitive Down/cancelled events:

image

@louislam
Copy link
Owner

Just release a beta version for fixing this issue:

louislam/uptime-kuma:1.23.5-beta.0

Would be appreciated if you guys could help to test it.

@kWAYTV
Copy link

kWAYTV commented Nov 17, 2023

Just release a beta version for fixing this issue:

louislam/uptime-kuma:1.23.5-beta.0

Would be appreciated if you guys could help to test it.

image
image

@louislam
Copy link
Owner

Just release a beta version for fixing this issue:
louislam/uptime-kuma:1.23.5-beta.0
Would be appreciated if you guys could help to test it.

image image

You did not use 1.23.5-beta.0

@kWAYTV
Copy link

kWAYTV commented Nov 18, 2023

Just release a beta version for fixing this issue:
louislam/uptime-kuma:1.23.5-beta.0
Would be appreciated if you guys could help to test it.

image image

You did not use 1.23.5-beta.0

my bad, now i am using it:
image

@louislam
Copy link
Owner

louislam commented Nov 18, 2023

That's strange. I just released 1.23.6. Free feel to try.

If it is still not working. Please change the tag to 1.23.3.

@JMyrng
Copy link

JMyrng commented Nov 18, 2023

That's strange. I just released 1.23.6. Free feel to try.

If it is still not working. Please change the tag to 1.23.3.

Installed 1.23.6 but issue is still there. So reverting to 1.23.3.

Could it be related to IPv6? Pushover fails to send notifications:

Error: Error: Error: connect ETIMEDOUT 2606:4700:10::6814:2bec:443 
    at Pushover.throwGeneralAxiosError (/app/server/notification-providers/notification-provider.js:38:15)
    at Pushover.send (/app/server/notification-providers/pushover.js:41:18)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Monitor.sendNotification (/app/server/model/monitor.js:1433:21)
    at async beat (/app/server/model/monitor.js:964:21)
    at async safeBeat (/app/server/model/monitor.js:1034:17)

Also there are services that are not affected by this strange "cancelled" bug. Both of them have only A Records.

Hit me up if there is anything else I could help you with!

@louislam
Copy link
Owner

louislam commented Nov 18, 2023

@JMyrng Thank you so much.

I just created a new image for printing more info and +10 seconds for AbortSignal (#4053)

louislam/uptime-kuma:nightly

Now the log will output something like this:

2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: monitor id: 58
2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: timeout: 16
2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: interval: 20
2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: Using AbortSignal.timeout: 26000

The monitor id could be found in the browser address bar.

Would be appropriated if you could show the log here for the monitors that are having issues.

@JMyrng
Copy link

JMyrng commented Nov 18, 2023

Would be appropriated if you could show the log here for the monitors that are having issues.

2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: monitor id: 11
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: timeout: 5
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: interval: 60
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: Using AbortSignal.timeout: 15000
2023-11-18T16:34:10+01:00 [MONITOR] WARN: Monitor #11 'RAE Tunnel': Pending: timeout of 5000ms exceeded | Max retries: 1 | Retry: 1 | Retry Interval: 20 seconds | Type: json-query

@louislam
Copy link
Owner

louislam commented Nov 18, 2023

Would be appropriated if you could show the log here for the monitors that are having issues.

2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: monitor id: 11
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: timeout: 5
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: interval: 60
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: Using AbortSignal.timeout: 15000
2023-11-18T16:34:10+01:00 [MONITOR] WARN: Monitor #11 'RAE Tunnel': Pending: timeout of 5000ms exceeded | Max retries: 1 | Retry: 1 | Retry Interval: 20 seconds | Type: json-query

Wait, it seems that it is actually a real timeout that is longer than 5 seconds?

Could you show this graph here?
image

@JMyrng
Copy link

JMyrng commented Nov 18, 2023

Screenshot_2023-11-18-17-18-20-70_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

I'm on mobile right now, but the red/orange is the timespan while on nightly. Green is on old version.

@louislam
Copy link
Owner

louislam commented Nov 18, 2023

Thanks for your help, which make me wondering, we might be looking into a wrong direction. It may be a Node.js 18 issue.

I will build two more docker images later.

@louislam
Copy link
Owner

  • Node 18 no Signal Patch: louislam/uptime-kuma:nightly-node18-no-signal
  • Node 16 with Signal Patch: louislam/uptime-kuma:nightly-node16-with-signal
  • Node 16 no Signal Patch (This should be similar to 1.23.3): louislam/uptime-kuma:nightly-node16-no-signal

Looking forward to the results. 👀

@chakflying
Copy link
Collaborator

chakflying commented Nov 18, 2023

This is another layered cake situation as I managed to reproduce the "Retries getting reset" part of the problem.

image

  • Since this is 1.23.X, it does not have the accurate interval fix (Fix: More accurate heartbeat Interval #3072). This means that any blocking timeout would count towards the beat interval.
  • When the beat interval is set to 60, and timeout is set to 48, part of that 48 seconds gets counted towards the beat interval.
  • In practice, this means that continuously timing-out would pass the threshold where checkMonitors would consider it as stuck.
      [MONITOR_CHECKER] ERROR: Monitor Interval: 60 Monitor 1 lastStartBeatTime diff: 95
      [MONITOR_CHECKER] ERROR: Unexpected error: Monitor 1 is struck for unknown reason
    
  • Hence the monitor is restarted, retries gets reset, and notifications get sent again.

I think we should either backport #3072, or increase the checkMonitors tolerance to prevent this.

@louislam
Copy link
Owner

louislam commented Nov 18, 2023

You are right, the pattern is similar to @JMyrng's graph, assume that he did not restart the monitor manually.

@kWAYTV
Copy link

kWAYTV commented Nov 19, 2023

So i just have to restart the monitor? When i updated i completely removed the old & pulled out the new ver, i assumed that was enough, i'm using portainer tho

@louislam
Copy link
Owner

So i just have to restart the monitor? When i updated i completely removed the old & pulled out the new ver, i assumed that was enough, i'm using portainer tho

No, that comment was replied to @chakflying's comment.

You could try these images I mentioned in #4027 (comment)

@kWAYTV
Copy link

kWAYTV commented Nov 19, 2023

So i just have to restart the monitor? When i updated i completely removed the old & pulled out the new ver, i assumed that was enough, i'm using portainer tho

No, that comment was replied to @chakflying's comment.

You could try these images I mentioned in #4027 (comment)

I'm starting to think it's my container's issue cause i went into portainer container, click on console and then pinged google.com and these were the results
image

@JMyrng
Copy link

JMyrng commented Nov 19, 2023

  • Node 18 no Signal Patch: louislam/uptime-kuma:nightly-node18-no-signal
  • Node 16 with Signal Patch: louislam/uptime-kuma:nightly-node16-with-signal
  • Node 16 no Signal Patch (This should be similar to 1.23.3): louislam/uptime-kuma:nightly-node16-no-signal

Looking forward to the results. 👀

  • nightly-node18-no-signal: same as here but it's not cancelled but timeout
  • nightly-node16-with-signal: is similar to 1.23.3, everything is up again and Pushover is working again
  • nightly-node16-no-signal: same as nightly-node16-with-signal

Has node 18 changed anything about the use of IPv6? My IPv4 only monitors are still not affected by this bug. I believe that IPv6 is not set up correctly in the container and the preferred variant has changed. I'll try to disable IPv6 later.

@chakflying
Copy link
Collaborator

chakflying commented Nov 19, 2023

Starting from node v17, the DNS lookup of node would preserve the IPv4/IPv6 priority returned by the resolver (nodejs/node#39987), whereas before it would always prefer IPv4. This means that compared to before, some connections will switch to using IPv6 when available.

@JMyrng
Copy link

JMyrng commented Nov 19, 2023

Starting from node v17, the DNS lookup of node would preserve the IPv4/IPv6 priority returned by the resolver (nodejs/node#39987), whereas before it would always prefer IPv4. This means that compared to before, some connections will switch to using v6 when available.

How can I set dns.lookup verbatim option to false to test this?

@JMyrng
Copy link

JMyrng commented Nov 19, 2023

I'll try to disable IPv6 later.

Got it disabled in the docker network, everything is working now with 'latest' tag. So in my case it was a incorrectly configured ipv6 setup.

@louislam
Copy link
Owner

louislam commented Nov 19, 2023

@JMyrng Thanks for testing and finally found the problem. I will mention this in the changelog, as I unlikely downgrade back to Node.js 16.

@chakflying For #4027 (comment), feel free to continue in a new issue.

@ignatiusmadee
Copy link

I have same issue, 1.23.6, sometimes got canceled status update from 1.23.3 to 1.23.6 node version 20.9.0
and then I check some of my web cert exp. not display

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants