All storages is offline after restart nebula services #5398

mxsavchenko · 2023-03-13T19:43:13Z

Installation: Docker
OS: AlmaLinux 8.5
CPU: Intel xeon 4116
Commit id (db3c1b3)
Database size: ~400Gb
Settings: default

Hi, i have Nebula cluster on 3 nodes (graph/meta/storage), which was installed in v3.2.1 version.
A few days ago, i wanted to upgrade to version 3.4.0, i stopped all services (graph/meta/storage) on all nodes, then update docker image version to 3.4.0 and started the services again, but storage is not state ONLINE, after load parts, when switching to version 3.2.1 - the same problem. In logs storage, the leader is constantly being re-elected, and it seems that each node randomly takes the role of leader all the time, the console keeps switching storage OFFLINE/ONLINE, and then when all 3 storages have loaded parts, they go OFFLINE.

show hosts graph;
+-----------+------+----------+---------+--------------+---------+
| Host | Port | Status | Role | Git Info Sha | Version |
+-----------+------+----------+---------+--------------+---------+
| "graphd0" | 9669 | "ONLINE" | "GRAPH" | "db3c1b3" | "3.4.0" |
| "graphd1" | 9669 | "ONLINE" | "GRAPH" | "db3c1b3" | "3.4.0" |
| "graphd2" | 9669 | "ONLINE" | "GRAPH" | "db3c1b3" | "3.4.0" |
+-----------+------+----------+---------+--------------+---------+

#####################

show hosts meta;
+----------+------+----------+--------+--------------+---------+
| Host | Port | Status | Role | Git Info Sha | Version |
+----------+------+----------+--------+--------------+---------+
| "metad2" | 9559 | "ONLINE" | "META" | "db3c1b3" | "3.4.0" |
| "metad0" | 9559 | "ONLINE" | "META" | "db3c1b3" | "3.4.0" |
| "metad1" | 9559 | "ONLINE" | "META" | "db3c1b3" | "3.4.0" |
+----------+------+----------+--------+--------------+---------+

#####################

show hosts storage;
+-------------+------+-----------+-----------+--------------+---------+
| Host | Port | Status | Role | Git Info Sha | Version |
+-------------+------+-----------+-----------+--------------+---------+
| "storaged0" | 9779 | "OFFLINE" | "STORAGE" | "db3c1b3" | "3.4.0" |
| "storaged1" | 9779 | "OFFLINE" | "STORAGE" | "db3c1b3" | "3.4.0" |
| "storaged2" | 9779 | "OFFLINE" | "STORAGE" | "db3c1b3" | "3.4.0" |
+-------------+------+-----------+-----------+--------------+---------+

logs from storaged0/storaged1 in zip archive:
logs.zip

The text was updated successfully, but these errors were encountered:

mxsavchenko · 2023-03-14T09:53:06Z

Errors in storage, after increased the level of logging:

I20230314 09:48:25.417989 43 RaftPart.cpp:1256] [Port: 9780, Space: 69, Part: 10] Receive response about askForVote from "storaged2":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20230314 09:48:25.418040 43 RaftPart.cpp:1283] [Port: 9780, Space: 69, Part: 10] Did not get enough votes from election of term 11, isPreVote = 1
I20230314 09:48:26.581475 73 RaftPart.cpp:1289] [Port: 9780, Space: 64, Part: 13] Start leader election...
I20230314 09:48:26.582363 73 RaftPart.cpp:1317] [Port: 9780, Space: 64, Part: 13] Sending out an election request (space = 64, part = 13, term = 14, lastLogId = 304921783, lastLogTerm = 13, candidateIP = storaged0, candidatePort = 9780), isPreVote = 1
I20230314 09:48:26.582995 43 RaftPart.cpp:1256] [Port: 9780, Space: 64, Part: 13] Receive response about askForVote from "storaged2":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20230314 09:48:26.583040 43 RaftPart.cpp:1283] [Port: 9780, Space: 64, Part: 13] Did not get enough votes from election of term 14, isPreVote = 1
I20230314 09:48:28.321467 73 RaftPart.cpp:1289] [Port: 9780, Space: 64, Part: 13] Start leader election...
I20230314 09:48:28.321529 73 RaftPart.cpp:1317] [Port: 9780, Space: 64, Part: 13] Sending out an election request (space = 64, part = 13, term = 14, lastLogId = 304921783, lastLogTerm = 13, candidateIP = storaged0, candidatePort = 9780), isPreVote = 1
I20230314 09:48:28.322351 43 RaftPart.cpp:1256] [Port: 9780, Space: 64, Part: 13] Receive response about askForVote from "storaged2":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20230314 09:48:28.322397 43 RaftPart.cpp:1283] [Port: 9780, Space: 64, Part: 13] Did not get enough votes from election of term 14, isPreVote = 1

mxsavchenko · 2023-03-14T10:00:54Z

and another question, is there any way to speed up the loading of parts after restarting the nebula storage? maybe some parameter in the configuration is responsible for this... Currently, it takes me about 3 hours to loading parts ((

wenhaocs · 2023-03-14T21:46:22Z

E_RAFT_UNKNOWN_PART typically indicates the part is not found in your storaged2. Let me check why it happened. BTW, how many parts do you have?

pengweisong · 2023-03-15T03:22:25Z

and another question, is there any way to speed up the loading of parts after restarting the nebula storage? maybe some parameter in the configuration is responsible for this... Currently, it takes me about 3 hours to loading parts ((

How many replicas do you set for each part? From the log, it looks like 2 instead of 3?
Which disk type you used? HDD or SSD? The latter may be slow to start, especially when encounter compaction of RocksDB.

wey-gu · 2023-03-15T05:23:55Z

How many replicas do you set for each part? From the log, it looks like 2 instead of 3?

We should consider rejecting even number in replication factor

#5380

mxsavchenko · 2023-03-15T07:10:17Z

and another question, is there any way to speed up the loading of parts after restarting the nebula storage? maybe some parameter in the configuration is responsible for this... Currently, it takes me about 3 hours to loading parts ((

How many replicas do you set for each part? From the log, it looks like 2 instead of 3? Which disk type you used? HDD or SSD? The latter may be slow to start, especially when encounter compaction of RocksDB.

on every space have 16 partitions and replication factor 2, disks SSD.

wey-gu · 2023-03-15T11:45:31Z

We should not configure the replication factor as an even number, maybe we should have banned such configuration when creating spaces.

Could you wipe the cluster and recreate space with replication factor 1(non-ha) or 3(ha)?

mxsavchenko · 2023-03-15T11:51:58Z

We should not configure the replication factor as an even number, maybe we should have banned such configuration when creating spaces.

Could you wipe the cluster and recreate space with replication factor 1(non-ha) or 3(ha)?

I can wipe the cluster but I have no backups ( Is there any other way to recover my data?

wey-gu · 2023-03-15T12:34:57Z

@wenhaocs @pengweisong I think copying data from some of the storaged to others will do the job, right?

pengweisong · 2023-03-17T08:22:03Z

Do you have executed balance data command?

kikimo · 2023-03-17T09:14:52Z

Is the network stable, or what about the I/O, CPU load of storage server?

mxsavchenko · 2023-03-17T10:30:50Z

Do you have executed balance data command?

no, but all storages are OFFLINE, will that help?
try BALANCE LEADER?

mxsavchenko · 2023-03-17T10:31:46Z

Is the network stable, or what about the I/O, CPU load of storage server?

yes, netwok sis table, and other resources also (

pengweisong · 2023-03-17T10:32:43Z

no, but all storages are OFFLINE, will that help? try BALANCE LEADER?

no, do not execute any balance data command, it will be a disaster when you only have 2 copies.

QingZ11 · 2023-05-05T03:10:58Z

@mxsavchenko Hi, I have noticed that the issue you created hasn’t been updated for nearly a month, so I have to close it for now. If you have any new updates, you are welcome to reopen this issue anytime.

Thanks a lot for your contribution anyway 😊

mxsavchenko added the type/bug Type: something is unexpected label Mar 13, 2023

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Mar 13, 2023

wey-gu mentioned this issue Mar 18, 2023

Weekly Report 2023-03-17 vesoft-inc/nebula-community#392

Closed

QingZ11 closed this as completed May 5, 2023

github-actions bot added the process/fixed Process of bug label May 5, 2023

wey-gu mentioned this issue May 6, 2023

Weekly Report 2023-05-05 vesoft-inc/nebula-community#400

Closed

johnny-smitherson mentioned this issue Mar 12, 2024

[Windows10 + WSL2 Docker Desktop] booting storaged with 300MB of data takes more than 20min #5836

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All storages is offline after restart nebula services #5398

All storages is offline after restart nebula services #5398

mxsavchenko commented Mar 13, 2023

mxsavchenko commented Mar 14, 2023

mxsavchenko commented Mar 14, 2023

wenhaocs commented Mar 14, 2023

pengweisong commented Mar 15, 2023 •

edited

Loading

wey-gu commented Mar 15, 2023

mxsavchenko commented Mar 15, 2023 •

edited

Loading

wey-gu commented Mar 15, 2023

mxsavchenko commented Mar 15, 2023

wey-gu commented Mar 15, 2023

pengweisong commented Mar 17, 2023 •

edited

Loading

kikimo commented Mar 17, 2023

mxsavchenko commented Mar 17, 2023

mxsavchenko commented Mar 17, 2023

pengweisong commented Mar 17, 2023 •

edited

Loading

QingZ11 commented May 5, 2023

All storages is offline after restart nebula services #5398

All storages is offline after restart nebula services #5398

Comments

mxsavchenko commented Mar 13, 2023

mxsavchenko commented Mar 14, 2023

mxsavchenko commented Mar 14, 2023

wenhaocs commented Mar 14, 2023

pengweisong commented Mar 15, 2023 • edited Loading

wey-gu commented Mar 15, 2023

mxsavchenko commented Mar 15, 2023 • edited Loading

wey-gu commented Mar 15, 2023

mxsavchenko commented Mar 15, 2023

wey-gu commented Mar 15, 2023

pengweisong commented Mar 17, 2023 • edited Loading

kikimo commented Mar 17, 2023

mxsavchenko commented Mar 17, 2023

mxsavchenko commented Mar 17, 2023

pengweisong commented Mar 17, 2023 • edited Loading

QingZ11 commented May 5, 2023

pengweisong commented Mar 15, 2023 •

edited

Loading

mxsavchenko commented Mar 15, 2023 •

edited

Loading

pengweisong commented Mar 17, 2023 •

edited

Loading

pengweisong commented Mar 17, 2023 •

edited

Loading