[leo_mq][leo_redundant_manager] Spamming error logs can happen #710

mocchira · 2017-04-18T08:27:37Z

What versions

LeoFS: 1.3.3-rc3

Where the problem happened

[email protected]

When

During [email protected] was down.

What you did

package/leo_storage_0/bin/leo_storage stop.
package/leo_storage_0/bin/leo_storage start.

What happened

[E]     storage_2@192.168.3.55  2017-04-17 22:24:02.132330 +0300        1492457042      leo_mq_consumer:consume/4       516     [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E]     storage_2@192.168.3.55  2017-04-17 22:24:02.132926 +0300        1492457042      leo_mq_consumer:consume/4       516     [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E]     storage_2@192.168.3.55  2017-04-17 22:24:02.133397 +0300        1492457042      leo_mq_consumer:consume/4       516     [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
...

repeated endlessly while [email protected] was down.
So it can cause DISK FULL if the down time was too long.

What should happen instead

Ideally much less errors should be dumped rather than spamming the same error too many times.

Why

Given that how the leo_mq consumer is implemented in https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L502-L525,
Spamming the same error many times could happen when an exception raised in handle_call callback (https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L509) because the item is not consumed (skipped this line: https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L510) and leo_mq_consumer try to consume the same item until it's consumed.

In this case,
https://github.com/leo-project/leo_redundant_manager/blob/1.9.35/src/leo_membership_mq_client.erl#L157-L192 keep failing while [email protected] is down.

Solution

It's OK if an error is temporal however since we have no idea how long an error keeps occurring, we have to avoid consuming the same item in a same batch.
(give it another try on the next batch so the same error should/will appear once per one batch)

Related Links

yosukehara · 2017-04-18T09:47:42Z

@mocchira

It's OK if an error is temporal however since we have no idea how long an error keeps occurring, we have to avoid consuming the same item in the same batch.

I totally agree with your idea.
I have one question regarding "in the same batch". What does it depend on?

mocchira · 2017-04-19T06:03:02Z

@yosukehara

I have one question regarding "in the same batch". What does it depend on?

As you know,
leo_mq consumes items in its queue with a certain amount within a specific limit in order to avoid bursting Disk I/O. We call this processing unit batch.
Its consuming rate is determined by

interval
number of batch messages

implemented in https://github.com/leo-project/leo_mq/blob/1.4.15/src/leo_mq_api.erl#L98-L103.
This value can be set at callers through calling leo_mq_api:new so that depends on

leo_storage: apps/leo_storage/src/leo_storage_mq.erl
leo_manager: apps/leo_manager/src/leo_manager_mq_client.erl
leo_redundant_manager: deps/leo_redundant_manager/src/leo_membership_mq_client.erl

according to this script.

$ find deps/ apps/ -type f|xargs grep leo_mq_api:new|grep -v eunit|grep -v test
deps/leo_redundant_manager/src/leo_membership_mq_client.erl:    leo_mq_api:new(leo_mq_sup, InstanceId, [{?MQ_PROP_MOD, ?MODULE},
apps/leo_storage/src/leo_storage_mq.erl:    leo_mq_api:new(Sup, Id, [{?MQ_PROP_MOD, ?MODULE},
apps/leo_manager/src/leo_manager_mq_client.erl:    leo_mq_api:new(RefMqSup, ?QUEUE_ID_FAIL_REBALANCE

Let me know if the above answer is different from what you intended to ask.

mocchira · 2017-04-19T07:04:07Z

@yosukehara
@windkit

A naive approach by

leo_mq_consumer
- save all failed items on memory as an additional argument on leo_mq_consumer:consume
  - https://github.com/leo-project/leo_mq/blob/1.4.15/src/leo_mq_consumer.erl#L497-L529
- pass those items to leo_mq_server:peek in order to filter those out
leo_mq_server
- retrieve the first item while excluding given items with leo_backend_db_server:first/2 (there is no such exported function for now, so I implemented for this issue)
  - https://github.com/leo-project/leo_mq/blob/1.4.15/src/leo_mq_server.erl#L227
leo_backend_db_(api|server)
- leo_backend_db_server:first/2 finds the first item that is not included in given items
leo_backend_db_eleveldb
- call eleveldb:iterator_move recursively until finding the first item that is not included in given items
  - https://github.com/leo-project/leo_backend_db/blob/1.2.10/src/leo_backend_db_eleveldb.erl#L227-L250

Given that now we have N failed items on memory as sets (red black tree) and M items on leo_mq,
The maximum complexity will be O(log N * M).
That said it doesn't perform well when M exceeds a certain threshold.
This might not be problem in some use cases however it can be problem in write intensive use cases with failed item being remained for a long time.

Now I'm considering other approaches.
Please let me know if you have any idea.

windkit · 2017-04-19T07:36:06Z

To me, failed message could be enqueued to a "retry-later" MQ.
A separate process would then scan through the batch and enqueue them to corresponding MQ.

It just seems better to log those failed messages on disk, and process it later.

mocchira · 2017-04-19T07:41:00Z

One solution I just come up with is

Bulk Loading items with its batch size before https://github.com/leo-project/leo_mq/blob/1.4.15/src/leo_mq_consumer.erl#L490.
And simply iterating items bulk loaded until reaching the end of items.

This solution doesn't need failed items so that the maximum complexity will be O(M) also since now M items is bulk loaded at once, the actual complexity interacting Disk I/O is close to O(1).
Sounds perfect to me.
So I will go forward this direction.

mocchira · 2017-04-19T07:49:35Z

@windkit thanks for sparing your time.

I also considered the similar approach you mentioned however the more promising one came to my mind as posted on the above. I will try this one at first.
Anyway thanks again.

windkit · 2017-04-19T07:58:06Z

@mocchira Could you elaborate more on that?
The original problem is that we are always peeking the first item, if the handling fails, the item is not removed and then next time we will keep hitting the same failed item.

Isn't loading a batch of M messages would fall into same loop if all of them fail?

Not removing the entry from queue would decrease the effective batch size, thus performance and eventually would jam up the queue.

mocchira · 2017-04-19T09:27:49Z

The original problem is that we are always peeking the first item, if the handling fails, the item is not removed and then next time we will keep hitting the same failed item.

The original problem was to keep peeking the first item in a same batch.

Isn't loading a batch of M messages would fall into same loop if all of them fail?

Right.

In my tests with

5 storage nodes
consistency.N = 3
one/two node down could happen randomly for a relatively short period (a few minutes)
read intensive workload (several thousands read ops while several dozen write ops)

In such cases,
The default batch size would not jam up the queue with the bulk load solution.

however in case multiple node downs for a long time with write intensive workload,
It would jam up the queue.
So something other solution like "retry-later" would be needed.

Anyway,
Bulk Loading can be combined with other solutions and at least solve a part of use cases so that
As a first step, I'm going to implement it.

Thoughts? > @windkit @yosukehara

windkit · 2017-04-20T00:04:14Z

@mocchira having the bulk load as a quick fix looks good to me for the moment

yosukehara · 2017-04-20T00:31:42Z

@mocchira @windkit Almost situations are able to cover Bulk Loading method, we need to fix this issue by its approach with v1.3.3.

But It is certainly that there are few irregular cases, we need to make a list of those situations to fix them on the next version, v1.4.0 or later.

mocchira · 2017-04-20T00:41:36Z

@yosukehara @windkit thank you for reviewing. I will go forward with bulk load for 1.3.3.

vstax · 2017-04-26T15:46:28Z

I don't know if this is the same problem or not, but on the latest development version this happens pretty often when some node that was stopped is starting up (I believe this happened with the older version which this bug was created about as well, I've noticed it at least once):

[W] [email protected]  2017-04-26 18:28:17.757349 +0300    1493220497  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'[email protected]',nodedown}
[E] [email protected]  2017-04-26 18:28:22.342160 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.353530 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.364200 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.375630 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.387463 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[...lots of lines skipped...]
[E] [email protected]  2017-04-26 18:28:24.729524 +0300    1493220504  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]

18:28:22 is around the time "leo_storage start" was executed on storage_2, and 18:28:25 is around time it finished starting up.
This is reproduced reliably. E.g. another case, logs from storage_1 this time:

[W] [email protected]  2017-04-26 18:37:31.276017 +0300    1493221051  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'[email protected]',nodedown}
[E] [email protected]  2017-04-26 18:37:39.207441 +0300    1493221059  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:37:39.219958 +0300    1493221059  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:37:39.232078 +0300    1493221059  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]

This probably is not as critical as original problem (i.e. it can't fill the disk); still, getting errors in tons on various nodes just because some other node is starting up fine is a bit alarming.

mocchira · 2017-04-27T08:59:40Z

@vstax

Thanks for reporting.
As I commented at #702 (comment),

Commits for #624 and other commits during 1.3.3 dev cycle not only revealed #710 but retries that are supposed to trigger by leo_mq actually didn't in some cases with <= 1.3.2.1.

This is one of the logs caused by the above cases.

This probably is not as critical as original problem (i.e. it can't fill the disk); still, getting errors in tons on various nodes just because some other node is starting up fine is a bit alarming.

Yes this kind of logs should be reduced.
As you said, this is not as critical as original one, we will solve this in the next release 1.4. (milestone will be changed to 1.4 from 1.3.3 after 1.3.3 landed).

mocchira · 2017-04-28T06:31:36Z

@vstax

[E] [email protected]  2017-04-26 18:28:22.342160 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]

Now we are considering to get the above error not to happen many times but happen only once as LeoFS <= 1.3.2.1 does.
That fix might be included in 1.3.3 if we could reach the consensus how to fix before 1.3.3 landed.

mocchira · 2017-04-28T08:55:27Z

@vstax The above fix will be included in 1.3.3 :)

mocchira · 2017-05-02T07:06:15Z

@yosukehara @windkit

The quick fix by Bulk Loading has been landed with 1.3.3 so I will change the milestone of this issue to 1.4 from 1.3.3 for the permanent fix.

mocchira · 2017-05-10T02:02:56Z

Permanent Fix

Problem

As mentioned above, the quick fix included in 1.3.3 solve part of use cases however there are some cases remained to be fixed.

Nodes down for a long time
Write intensive workload

When a LeoFS cluster suffered such conditions,
The queue would be jammed up at a certain point and result in not being able to consume any item in the queue until the downed nodes get back.

Solution

Filtering out any item that depend on downed nodes when fetching items from a queue
- Pros
  - Easy to implement
  - No need to fix leo_redundant_manager
  - No additional queues needed
- Cons
  - Filtering out (Skipping) take a certain system resource (CPU, Disk I/O)
  - The greater length of existing queues, the more system resource LeoFS consume.
Making an additional queue (we call it retry-later on the prev comment) for failed items to retry later when the downed node failed items depend on get back
- Pros
  - Keep the length of existing queues stable while nodes down
- Cons
  - Cost to implement is relatively high compared to the Filtering out solution
  - Need to cooperate with leo_redundant_manager (some notification I/F need to be added on leo_redudant_manager)
  - Additional queues needed (that means additional Disk I/O could happen)

Given that comparing the above pros/cons, now we are considering to adopt the former one (filter out).

yosukehara · 2018-02-02T02:45:44Z

@mocchira I agree to "Filtering out any item that depend on downed nodes when fetching items from a queue" because its implementation cost is low and we can quickly recognize the effect of the implementation through tests.

If the result is not expectation, we need to consider the latter idea.

mocchira · 2018-02-02T02:56:15Z

@yosukehara Thanks for the review. I will get to work from now.

mocchira · 2018-02-02T08:12:07Z

@yosukehara Prototype on my dev-box make CPU usage less than the current develop. It looks improvement however not permanent fix to me. however IMHO, it's worth implementing until we polish the latter idea.
Just in case, Can you review the below thought to push the former idea into the production quality?

In order to push the application logic (filtering out items) into leo_storage, I'd like to propose adding the new field - filter_func in mq_properties which can be set through calling leo_mq_api:new and being passed to leo_backend_db_server:first_n instead of the default func (always return true) defined at https://github.com/leo-project/leo_mq/blob/develop/src/leo_mq_server.erl#L267-L269. With this change, leo_(manager|storage|gateway) get to be able to use leo_mq with user defined filter funcs.

yosukehara · 2018-02-02T08:34:23Z

@mocchira

I'd like to propose adding the new field - filter_func in mq_properties which can be set through calling leo_mq_api:new and being passed to leo_backend_db_server:first_n instead of the default func (always return true) defined at https://github.com/leo-project/leo_mq/blob/develop/src/leo_mq_server.erl#L267-L269. With this change, leo_(manager|storage|gateway) get to be able to use leo_mq with user defined filter funcs.

I've just reviewed your proposal. User Defined Filter Function for leo_mq is good idea I realized. it's finally become clear how to proceed. Let's adopt this way.

mocchira · 2018-02-02T08:39:35Z

@yosukehara Thanks. I will send PR next week.

mocchira · 2018-03-07T02:20:20Z

Unfortunately both UDFF and #994 couldn't solve this issue essentially so we have to seek other solutions like retry-later queue as described above. Postponed it to 1.4.1.

mocchira · 2018-05-15T07:01:05Z

We've come to the conclusion adopting the retry-later queue approach. Since this force us to use leo_redundant_manager v2 in order to detect the status changes happened on leo_storage(s), we put off to 2.0.0.

mocchira added Bug Priority-HIGH v1.3 _leo_mq labels Apr 18, 2017

mocchira added this to the 1.3.3 milestone Apr 18, 2017

mocchira self-assigned this Apr 18, 2017

mocchira mentioned this issue Apr 18, 2017

Objects created when storage node was down don't arrive there after it's up #702

Closed

This was referenced Apr 20, 2017

[leo_backend_db][eleveldb] Iterator-handler can be leaked #712

Closed

Add first_n/2 for the bulk loading leo-project/leo_backend_db#9

Merged

mocchira mentioned this issue Apr 28, 2017

Fixes for #710#issuecomment-297453955 leo-project/leo_redundant_manager#6

Merged

mocchira modified the milestones: 1.4.0, 1.3.3 May 2, 2017

mocchira added the v1.4 label May 2, 2017

mocchira assigned yosukehara Aug 24, 2017

mocchira mentioned this issue Feb 2, 2018

Node entered "strange" state with internal crashes (timer problems?) #966

Closed

mocchira mentioned this issue Feb 10, 2018

[eleveldb][leo_backend_db] Impl iterater_move in batch #994

Closed

mocchira mentioned this issue Feb 28, 2018

Add first_n function leo-project/eleveldb#2

Merged

yosukehara removed the v1.3 label Mar 6, 2018

mocchira modified the milestones: 1.4.0, 1.4.1 Mar 7, 2018

mocchira mentioned this issue Mar 7, 2018

[leo_storage] Measure rebalance/recover-node performance according to mq.num_of_mq_procs #987

Closed

mocchira modified the milestones: 1.4.1, 1.4.2 Mar 23, 2018

mocchira modified the milestones: 1.4.2, 2.0.0 May 15, 2018

yosukehara removed the v1.4 label Feb 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[leo_mq][leo_redundant_manager] Spamming error logs can happen #710

[leo_mq][leo_redundant_manager] Spamming error logs can happen #710

mocchira commented Apr 18, 2017 •

edited

Loading

yosukehara commented Apr 18, 2017

mocchira commented Apr 19, 2017

mocchira commented Apr 19, 2017 •

edited

Loading

windkit commented Apr 19, 2017

mocchira commented Apr 19, 2017

mocchira commented Apr 19, 2017

windkit commented Apr 19, 2017

mocchira commented Apr 19, 2017

windkit commented Apr 20, 2017

yosukehara commented Apr 20, 2017

mocchira commented Apr 20, 2017

vstax commented Apr 26, 2017

mocchira commented Apr 27, 2017

mocchira commented Apr 28, 2017 •

edited

Loading

mocchira commented Apr 28, 2017

mocchira commented May 2, 2017

mocchira commented May 10, 2017

yosukehara commented Feb 2, 2018

mocchira commented Feb 2, 2018

mocchira commented Feb 2, 2018

yosukehara commented Feb 2, 2018

mocchira commented Feb 2, 2018

mocchira commented Mar 7, 2018

mocchira commented May 15, 2018

[leo_mq][leo_redundant_manager] Spamming error logs can happen #710

[leo_mq][leo_redundant_manager] Spamming error logs can happen #710

Comments

mocchira commented Apr 18, 2017 • edited Loading

What versions

Where the problem happened

When

What you did

What happened

What should happen instead

Why

Solution

Related Links

yosukehara commented Apr 18, 2017

mocchira commented Apr 19, 2017

mocchira commented Apr 19, 2017 • edited Loading

windkit commented Apr 19, 2017

mocchira commented Apr 19, 2017

mocchira commented Apr 19, 2017

windkit commented Apr 19, 2017

mocchira commented Apr 19, 2017

windkit commented Apr 20, 2017

yosukehara commented Apr 20, 2017

mocchira commented Apr 20, 2017

vstax commented Apr 26, 2017

mocchira commented Apr 27, 2017

mocchira commented Apr 28, 2017 • edited Loading

mocchira commented Apr 28, 2017

mocchira commented May 2, 2017

mocchira commented May 10, 2017

Permanent Fix

Problem

Solution

yosukehara commented Feb 2, 2018

mocchira commented Feb 2, 2018

mocchira commented Feb 2, 2018

yosukehara commented Feb 2, 2018

mocchira commented Feb 2, 2018

mocchira commented Mar 7, 2018

mocchira commented May 15, 2018

mocchira commented Apr 18, 2017 •

edited

Loading

mocchira commented Apr 19, 2017 •

edited

Loading

mocchira commented Apr 28, 2017 •

edited

Loading