Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unstable clusterer module (repeated messages Node [X] is UP) #3542

Open
kertor opened this issue Dec 13, 2024 · 2 comments
Open

[BUG] Unstable clusterer module (repeated messages Node [X] is UP) #3542

kertor opened this issue Dec 13, 2024 · 2 comments

Comments

@kertor
Copy link

kertor commented Dec 13, 2024

OpenSIPS version you are running

version: opensips 3.5.2 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: b16e49c98
main.c compiled on  with gcc 11

Describe the bug

I observe an unstable cluster module.
Have two nodes (usrloc replication):

# On main node:
{
    "Clusters": [
        {
            "cluster_id": 1,
            "Nodes": [
                {
                    "node_id": 4,
                    "db_id": 0,
                    "url": "bin:10.0.0.184:5501",
                    "link_state": "Up",
                    "state": "enabled",
                    "next_hop": "4",
                    "description": "node-2-reserve"
                }
            ]
        }
    ]
}

# On secondary node
{
    "Clusters": [
        {
            "cluster_id": 1,
            "Nodes": [
                {
                    "node_id": 3,
                    "db_id": 0,
                    "url": "bin:10.0.0.183:5501",
                    "link_state": "Up",
                    "state": "enabled",
                    "next_hop": "3",
                    "description": "node-2-main"
                }
            ]
        }
    ]
}

Sometime it works, then i see this repeated messages on "node-2-reserve":

Dec 13 10:53:46 node-2-reserve /usr/sbin/opensips[875256]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [3] is possibly down, retrying
Dec 13 10:53:46 node-2-reserve /usr/sbin/opensips[875253]: INFO:clusterer:handle_pong: Node [3] is UP

I understand that this already fixed in 4c38ceb but i have last build "b16e49c98" and still observe the problem.

To Reproduce

It is floating problem, at this moment do not know how to reproduce it.
But command clusterer_reload is fixing this problem.

Expected behavior

Stable clusterer, no logs like:

Dec 13 10:53:08 node-2-main /usr/sbin/opensips[418257]: INFO:usrloc:replicate_ucontact_update: All destinations in cluster: 1 are down or probing
Dec 13 10:53:08 node-2-main /usr/sbin/opensips[418257]: ERROR:usrloc:replicate_ucontact_update: replicate ucontact update failed

Relevant System Logs

Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: Using the global ( per process ) buff 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_parse_headers: received a COMPLETE message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: completely received a message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: Nothing more to read on TCP conn 0x7f1b6c902258, currently in state 0 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:call_callbacks: binary Packet CMD: 1. Capability: clusterer-internal
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:bin_rcv_cl_packets: received clusterer message from: 10.0.0.183:53379 with source id: 3 and cluster id: 1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:handle_internal_msg: Received ping reply from node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:set_link_w_neigh: setting link with neighbour [3] to state <0>
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: con found in state 0
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: tcp connection found (0x7f1b6c903448), acquiring fd
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: c= 0x7f1b6c903448, n=16, Usock=289
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875265]: DBG:core:handle_worker: read response= 7f1b6c903448, 1, fd -1 from 14 (875253)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: after receive_fd: c= 0x7f1b6c903448 n=8 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: sending via fd 10...
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tsend_stream_async: Async successful write from first try on 0x7f1b6c903448
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: after write: c= 0x7f1b6c903448 n/len=86/86 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:send_full_top_update: Sent topology update to node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: con found in state 0
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: tcp connection found (0x7f1b6c903448), acquiring fd
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: c= 0x7f1b6c903448, n=16, Usock=289
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875265]: DBG:core:handle_worker: read response= 7f1b6c903448, 1, fd -1 from 14 (875253)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: after receive_fd: c= 0x7f1b6c903448 n=8 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: sending via fd 10...
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tsend_stream_async: Async successful write from first try on 0x7f1b6c903448
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: after write: c= 0x7f1b6c903448 n/len=99/99 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:send_cap_update: Sent capability update to node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: INFO:clusterer:handle_pong: Node [3] is UP
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:evi_param_set: set int cluster_id=1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:evi_param_set: set int node_id=3
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:evi_param_set: set int new_state=1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7f1bef51b070
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7f1bef51b070
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:destroy_avp_list: destroying list (nil)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:sr_add_report: adding report to identifier [node_states] group [clusterer] on idx 18 [19,18]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: tcp_read_req end
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: Using the global ( per process ) buff 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_parse_headers: received a COMPLETE message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: completely received a message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: Nothing more to read on TCP conn 0x7f1b6c902258, currently in state 0 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:call_callbacks: binary Packet CMD: 9. Capability: clusterer-internal
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:bin_rcv_cl_packets: received clusterer message from: 10.0.0.183:53379 with source id: 3 and cluster id: 1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:handle_internal_msg: Received capability update with source [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:bin_pop_str: Popped: 'usrloc-contact-repl' [19]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: tcp_read_req end







ec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:tcp_conn_get: after receive_fd: c= 0x7f1b6c903448 n=8 fd=5
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:proto_bin:proto_bin_send: sending via fd 5...
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:tsend_stream_async: Async successful write from first try on 0x7f1b6c903448
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:proto_bin:proto_bin_send: after write: c= 0x7f1b6c903448 n/len=309/309 fd=5
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:clusterer:msg_send_retry: sent bin packet to node [3]
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:registrar:build_contact: created Contact HF: Contact: <sip:[email protected]>;expires=60

Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:MD5StringArray: MD5 calculated: 9fb93ca357c3189b8747a9751f9dd969
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:parse_headers_aux: flags=ffffffffffffffff
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:cleanup_uac_timers: RETR/FR timers reset
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:insert_timer_unsafe: [2]: 0x7f1b6c8ef798 (162816)
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:_reply_light: reply sent out. buf=0x7f1bef51e4d8: SIP/2.0 2..., shmem=0x7f1b6c8fd648: SIP/2.0 2
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:_reply_light: finished
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:comp_scriptvar: int 20: 60 / 0
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: INFO:[REGISTER] Registered successfully ([email protected] ci=0bb129639bd2e2bd942247bbe69)
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:parse_headers_aux: flags=ffffffffffffffff
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:comp_scriptvar: str 20: INSPECTOR-VI
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:t_unref: UNREF_UNSAFE: [0x7f1b6c8ef718] after is 0
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:destroy_avp_list: destroying list (nil)
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:receive_msg: cleaning up
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [3] is possibly down, retrying
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875265]: DBG:core:handle_worker: read response= 7f1b6c903448, 1, fd -1 from 12 (875251)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:clusterer:do_action_trans_2: Sent ping retry to node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:clusterer:set_link_w_neigh: setting link with neighbour [3] to state <5>
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:evi_param_set: set int cluster_id=1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:evi_param_set: set int node_id=3
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:evi_param_set: set int new_state=0
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7f1bef51afd8
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7f1bef51afd8
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:destroy_avp_list: destroying list (nil)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:sr_add_report: adding report to identifier [node_states] group [clusterer] on idx 17 [18,17]

OS/environment information

  • Operating System: Ubuntu 22
  • OpenSIPS installation: source
  • other relevant information:

Additional context

Could you please check it?

Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Dec 29, 2024
@ewadii
Copy link

ewadii commented Dec 30, 2024

same "issue" after upgrading from 3.4.8 to 3.4.10, no routing issues just a bothersome log message on the secondary box during low traffic conditions
basic usage of clusterer module without usrloc. "INFO:clusterer:do_action_trans_2: Ping reply not received, node [1] is possibly down, retrying"

@stale stale bot removed the stale label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants