Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[leo_mq][leo_redundant_manager] Spamming error logs can happen #710

Open
mocchira opened this issue Apr 18, 2017 · 24 comments
Open

[leo_mq][leo_redundant_manager] Spamming error logs can happen #710

mocchira opened this issue Apr 18, 2017 · 24 comments

Comments

@mocchira
Copy link
Member

mocchira commented Apr 18, 2017

What versions

  • LeoFS: 1.3.3-rc3

Where the problem happened

[email protected]

When

During [email protected] was down.

What you did

package/leo_storage_0/bin/leo_storage stop.
package/leo_storage_0/bin/leo_storage start.

What happened

[E]     storage_2@192.168.3.55  2017-04-17 22:24:02.132330 +0300        1492457042      leo_mq_consumer:consume/4       516     [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E]     storage_2@192.168.3.55  2017-04-17 22:24:02.132926 +0300        1492457042      leo_mq_consumer:consume/4       516     [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E]     storage_2@192.168.3.55  2017-04-17 22:24:02.133397 +0300        1492457042      leo_mq_consumer:consume/4       516     [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
...

repeated endlessly while [email protected] was down.
So it can cause DISK FULL if the down time was too long.

What should happen instead

Ideally much less errors should be dumped rather than spamming the same error too many times.

Why

Given that how the leo_mq consumer is implemented in https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L502-L525,
Spamming the same error many times could happen when an exception raised in handle_call callback (https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L509) because the item is not consumed (skipped this line: https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L510) and leo_mq_consumer try to consume the same item until it's consumed.

In this case,
https://github.com/leo-project/leo_redundant_manager/blob/1.9.35/src/leo_membership_mq_client.erl#L157-L192 keep failing while [email protected] is down.

Solution

It's OK if an error is temporal however since we have no idea how long an error keeps occurring, we have to avoid consuming the same item in a same batch.
(give it another try on the next batch so the same error should/will appear once per one batch)

Related Links

@yosukehara
Copy link
Member

@mocchira

It's OK if an error is temporal however since we have no idea how long an error keeps occurring, we have to avoid consuming the same item in the same batch.

I totally agree with your idea.
I have one question regarding "in the same batch". What does it depend on?

@mocchira
Copy link
Member Author

@yosukehara

I have one question regarding "in the same batch". What does it depend on?

As you know,
leo_mq consumes items in its queue with a certain amount within a specific limit in order to avoid bursting Disk I/O. We call this processing unit batch.
Its consuming rate is determined by

  • interval
  • number of batch messages

implemented in https://github.com/leo-project/leo_mq/blob/1.4.15/src/leo_mq_api.erl#L98-L103.
This value can be set at callers through calling leo_mq_api:new so that depends on

  • leo_storage: apps/leo_storage/src/leo_storage_mq.erl
  • leo_manager: apps/leo_manager/src/leo_manager_mq_client.erl
  • leo_redundant_manager: deps/leo_redundant_manager/src/leo_membership_mq_client.erl

according to this script.

$ find deps/ apps/ -type f|xargs grep leo_mq_api:new|grep -v eunit|grep -v test
deps/leo_redundant_manager/src/leo_membership_mq_client.erl:    leo_mq_api:new(leo_mq_sup, InstanceId, [{?MQ_PROP_MOD, ?MODULE},
apps/leo_storage/src/leo_storage_mq.erl:    leo_mq_api:new(Sup, Id, [{?MQ_PROP_MOD, ?MODULE},
apps/leo_manager/src/leo_manager_mq_client.erl:    leo_mq_api:new(RefMqSup, ?QUEUE_ID_FAIL_REBALANCE

Let me know if the above answer is different from what you intended to ask.

@mocchira
Copy link
Member Author

mocchira commented Apr 19, 2017

@yosukehara
@windkit

A naive approach by

Given that now we have N failed items on memory as sets (red black tree) and M items on leo_mq,
The maximum complexity will be O(log N * M).
That said it doesn't perform well when M exceeds a certain threshold.
This might not be problem in some use cases however it can be problem in write intensive use cases with failed item being remained for a long time.

Now I'm considering other approaches.
Please let me know if you have any idea.

@windkit
Copy link
Contributor

windkit commented Apr 19, 2017

To me, failed message could be enqueued to a "retry-later" MQ.
A separate process would then scan through the batch and enqueue them to corresponding MQ.

It just seems better to log those failed messages on disk, and process it later.

@mocchira
Copy link
Member Author

One solution I just come up with is

This solution doesn't need failed items so that the maximum complexity will be O(M) also since now M items is bulk loaded at once, the actual complexity interacting Disk I/O is close to O(1).
Sounds perfect to me.
So I will go forward this direction.

@mocchira
Copy link
Member Author

@windkit thanks for sparing your time.

I also considered the similar approach you mentioned however the more promising one came to my mind as posted on the above. I will try this one at first.
Anyway thanks again.

@windkit
Copy link
Contributor

windkit commented Apr 19, 2017

@mocchira Could you elaborate more on that?
The original problem is that we are always peeking the first item, if the handling fails, the item is not removed and then next time we will keep hitting the same failed item.

Isn't loading a batch of M messages would fall into same loop if all of them fail?

Not removing the entry from queue would decrease the effective batch size, thus performance and eventually would jam up the queue.

@mocchira
Copy link
Member Author

The original problem is that we are always peeking the first item, if the handling fails, the item is not removed and then next time we will keep hitting the same failed item.

The original problem was to keep peeking the first item in a same batch.

Isn't loading a batch of M messages would fall into same loop if all of them fail?

Right.

In my tests with

  • 5 storage nodes
  • consistency.N = 3
  • one/two node down could happen randomly for a relatively short period (a few minutes)
  • read intensive workload (several thousands read ops while several dozen write ops)

In such cases,
The default batch size would not jam up the queue with the bulk load solution.

however in case multiple node downs for a long time with write intensive workload,
It would jam up the queue.
So something other solution like "retry-later" would be needed.

Anyway,
Bulk Loading can be combined with other solutions and at least solve a part of use cases so that
As a first step, I'm going to implement it.

Thoughts? > @windkit @yosukehara

@windkit
Copy link
Contributor

windkit commented Apr 20, 2017

@mocchira having the bulk load as a quick fix looks good to me for the moment

@yosukehara
Copy link
Member

@mocchira @windkit Almost situations are able to cover Bulk Loading method, we need to fix this issue by its approach with v1.3.3.

But It is certainly that there are few irregular cases, we need to make a list of those situations to fix them on the next version, v1.4.0 or later.

@mocchira
Copy link
Member Author

@yosukehara @windkit thank you for reviewing. I will go forward with bulk load for 1.3.3.

@vstax
Copy link
Contributor

vstax commented Apr 26, 2017

I don't know if this is the same problem or not, but on the latest development version this happens pretty often when some node that was stopped is starting up (I believe this happened with the older version which this bug was created about as well, I've noticed it at least once):

[W] [email protected]  2017-04-26 18:28:17.757349 +0300    1493220497  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'[email protected]',nodedown}
[E] [email protected]  2017-04-26 18:28:22.342160 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.353530 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.364200 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.375630 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:28:22.387463 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[...lots of lines skipped...]
[E] [email protected]  2017-04-26 18:28:24.729524 +0300    1493220504  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]

18:28:22 is around the time "leo_storage start" was executed on storage_2, and 18:28:25 is around time it finished starting up.
This is reproduced reliably. E.g. another case, logs from storage_1 this time:

[W] [email protected]  2017-04-26 18:37:31.276017 +0300    1493221051  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'[email protected]',nodedown}
[E] [email protected]  2017-04-26 18:37:39.207441 +0300    1493221059  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:37:39.219958 +0300    1493221059  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]
[E] [email protected]  2017-04-26 18:37:39.232078 +0300    1493221059  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]

This probably is not as critical as original problem (i.e. it can't fill the disk); still, getting errors in tons on various nodes just because some other node is starting up fine is a bit alarming.

@mocchira
Copy link
Member Author

@vstax

Thanks for reporting.
As I commented at #702 (comment),

Commits for #624 and other commits during 1.3.3 dev cycle not only revealed #710 but retries that are supposed to trigger by leo_mq actually didn't in some cases with <= 1.3.2.1.

This is one of the logs caused by the above cases.

This probably is not as critical as original problem (i.e. it can't fill the disk); still, getting errors in tons on various nodes just because some other node is starting up fine is a bit alarming.

Yes this kind of logs should be reduced.
As you said, this is not as critical as original one, we will solve this in the next release 1.4. (milestone will be changed to 1.4 from 1.3.3 after 1.3.3 landed).

@mocchira
Copy link
Member Author

mocchira commented Apr 28, 2017

@vstax

[E] [email protected]  2017-04-26 18:28:22.342160 +0300    1493220502  leo_mq_consumer:consume/4   526 [{module,leo_membership_mq_client},{id,mq_persistent_node},{cause,{noproc,{gen_server,call,[{rex,'[email protected]'},{call,erlang,node,[],<0.47.0>},10000]}}}]

Now we are considering to get the above error not to happen many times but happen only once as LeoFS <= 1.3.2.1 does.
That fix might be included in 1.3.3 if we could reach the consensus how to fix before 1.3.3 landed.

@mocchira
Copy link
Member Author

@vstax The above fix will be included in 1.3.3 :)

@mocchira
Copy link
Member Author

mocchira commented May 2, 2017

@yosukehara @windkit

The quick fix by Bulk Loading has been landed with 1.3.3 so I will change the milestone of this issue to 1.4 from 1.3.3 for the permanent fix.

@mocchira mocchira modified the milestones: 1.4.0, 1.3.3 May 2, 2017
@mocchira mocchira added the v1.4 label May 2, 2017
@mocchira
Copy link
Member Author

Permanent Fix

Problem

As mentioned above, the quick fix included in 1.3.3 solve part of use cases however there are some cases remained to be fixed.

  • Nodes down for a long time
  • Write intensive workload

When a LeoFS cluster suffered such conditions,
The queue would be jammed up at a certain point and result in not being able to consume any item in the queue until the downed nodes get back.

Solution

  • Filtering out any item that depend on downed nodes when fetching items from a queue

    • Pros
      • Easy to implement
      • No need to fix leo_redundant_manager
      • No additional queues needed
    • Cons
      • Filtering out (Skipping) take a certain system resource (CPU, Disk I/O)
      • The greater length of existing queues, the more system resource LeoFS consume.
  • Making an additional queue (we call it retry-later on the prev comment) for failed items to retry later when the downed node failed items depend on get back

    • Pros
      • Keep the length of existing queues stable while nodes down
    • Cons
      • Cost to implement is relatively high compared to the Filtering out solution
      • Need to cooperate with leo_redundant_manager (some notification I/F need to be added on leo_redudant_manager)
      • Additional queues needed (that means additional Disk I/O could happen)

Given that comparing the above pros/cons, now we are considering to adopt the former one (filter out).

@yosukehara
Copy link
Member

@mocchira I agree to "Filtering out any item that depend on downed nodes when fetching items from a queue" because its implementation cost is low and we can quickly recognize the effect of the implementation through tests.

If the result is not expectation, we need to consider the latter idea.

@mocchira
Copy link
Member Author

mocchira commented Feb 2, 2018

@yosukehara Thanks for the review. I will get to work from now.

@mocchira
Copy link
Member Author

mocchira commented Feb 2, 2018

@yosukehara Prototype on my dev-box make CPU usage less than the current develop. It looks improvement however not permanent fix to me. however IMHO, it's worth implementing until we polish the latter idea.
Just in case, Can you review the below thought to push the former idea into the production quality?

In order to push the application logic (filtering out items) into leo_storage, I'd like to propose adding the new field - filter_func in mq_properties which can be set through calling leo_mq_api:new and being passed to leo_backend_db_server:first_n instead of the default func (always return true) defined at https://github.com/leo-project/leo_mq/blob/develop/src/leo_mq_server.erl#L267-L269. With this change, leo_(manager|storage|gateway) get to be able to use leo_mq with user defined filter funcs.

@yosukehara
Copy link
Member

@mocchira

I'd like to propose adding the new field - filter_func in mq_properties which can be set through calling leo_mq_api:new and being passed to leo_backend_db_server:first_n instead of the default func (always return true) defined at https://github.com/leo-project/leo_mq/blob/develop/src/leo_mq_server.erl#L267-L269. With this change, leo_(manager|storage|gateway) get to be able to use leo_mq with user defined filter funcs.

I've just reviewed your proposal. User Defined Filter Function for leo_mq is good idea I realized. it's finally become clear how to proceed. Let's adopt this way.

@mocchira
Copy link
Member Author

mocchira commented Feb 2, 2018

@yosukehara Thanks. I will send PR next week.

@mocchira
Copy link
Member Author

mocchira commented Mar 7, 2018

Unfortunately both UDFF and #994 couldn't solve this issue essentially so we have to seek other solutions like retry-later queue as described above. Postponed it to 1.4.1.

@mocchira
Copy link
Member Author

We've come to the conclusion adopting the retry-later queue approach. Since this force us to use leo_redundant_manager v2 in order to detect the status changes happened on leo_storage(s), we put off to 2.0.0.

@mocchira mocchira modified the milestones: 1.4.2, 2.0.0 May 15, 2018
@yosukehara yosukehara removed the v1.4 label Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants