-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[leo_mq][leo_redundant_manager] Spamming error logs can happen #710
Comments
I totally agree with your idea. |
As you know,
implemented in https://github.com/leo-project/leo_mq/blob/1.4.15/src/leo_mq_api.erl#L98-L103.
according to this script. $ find deps/ apps/ -type f|xargs grep leo_mq_api:new|grep -v eunit|grep -v test
deps/leo_redundant_manager/src/leo_membership_mq_client.erl: leo_mq_api:new(leo_mq_sup, InstanceId, [{?MQ_PROP_MOD, ?MODULE},
apps/leo_storage/src/leo_storage_mq.erl: leo_mq_api:new(Sup, Id, [{?MQ_PROP_MOD, ?MODULE},
apps/leo_manager/src/leo_manager_mq_client.erl: leo_mq_api:new(RefMqSup, ?QUEUE_ID_FAIL_REBALANCE Let me know if the above answer is different from what you intended to ask. |
A naive approach by
Given that now we have N failed items on memory as sets (red black tree) and M items on leo_mq, Now I'm considering other approaches. |
To me, failed message could be enqueued to a "retry-later" MQ. It just seems better to log those failed messages on disk, and process it later. |
One solution I just come up with is
This solution doesn't need failed items so that the maximum complexity will be O(M) also since now M items is bulk loaded at once, the actual complexity interacting Disk I/O is close to O(1). |
@windkit thanks for sparing your time. I also considered the similar approach you mentioned however the more promising one came to my mind as posted on the above. I will try this one at first. |
@mocchira Could you elaborate more on that? Isn't loading a batch of M messages would fall into same loop if all of them fail? Not removing the entry from queue would decrease the effective batch size, thus performance and eventually would jam up the queue. |
The original problem was to keep peeking the first item in a same batch.
Right. In my tests with
In such cases, however in case multiple node downs for a long time with write intensive workload, Anyway, Thoughts? > @windkit @yosukehara |
@mocchira having the bulk load as a quick fix looks good to me for the moment |
@yosukehara @windkit thank you for reviewing. I will go forward with bulk load for 1.3.3. |
I don't know if this is the same problem or not, but on the latest development version this happens pretty often when some node that was stopped is starting up (I believe this happened with the older version which this bug was created about as well, I've noticed it at least once):
18:28:22 is around the time "leo_storage start" was executed on storage_2, and 18:28:25 is around time it finished starting up.
This probably is not as critical as original problem (i.e. it can't fill the disk); still, getting errors in tons on various nodes just because some other node is starting up fine is a bit alarming. |
Thanks for reporting.
This is one of the logs caused by the above cases.
Yes this kind of logs should be reduced. |
Now we are considering to get the above error not to happen many times but happen only once as LeoFS <= 1.3.2.1 does. |
@vstax The above fix will be included in 1.3.3 :) |
The quick fix by Bulk Loading has been landed with 1.3.3 so I will change the milestone of this issue to 1.4 from 1.3.3 for the permanent fix. |
Permanent FixProblemAs mentioned above, the quick fix included in 1.3.3 solve part of use cases however there are some cases remained to be fixed.
When a LeoFS cluster suffered such conditions, Solution
Given that comparing the above pros/cons, now we are considering to adopt the former one (filter out). |
@mocchira I agree to "Filtering out any item that depend on downed nodes when fetching items from a queue" because its implementation cost is low and we can quickly recognize the effect of the implementation through tests. If the result is not expectation, we need to consider the latter idea. |
@yosukehara Thanks for the review. I will get to work from now. |
@yosukehara Prototype on my dev-box make CPU usage less than the current develop. It looks improvement however not permanent fix to me. however IMHO, it's worth implementing until we polish the latter idea. In order to push the application logic (filtering out items) into leo_storage, I'd like to propose adding the new field - filter_func in mq_properties which can be set through calling leo_mq_api:new and being passed to leo_backend_db_server:first_n instead of the default func (always return true) defined at https://github.com/leo-project/leo_mq/blob/develop/src/leo_mq_server.erl#L267-L269. With this change, leo_(manager|storage|gateway) get to be able to use leo_mq with user defined filter funcs. |
I've just reviewed your proposal. User Defined Filter Function for |
@yosukehara Thanks. I will send PR next week. |
Unfortunately both UDFF and #994 couldn't solve this issue essentially so we have to seek other solutions like retry-later queue as described above. Postponed it to 1.4.1. |
We've come to the conclusion adopting the retry-later queue approach. Since this force us to use leo_redundant_manager v2 in order to detect the status changes happened on leo_storage(s), we put off to 2.0.0. |
What versions
Where the problem happened
[email protected]
When
During [email protected] was down.
What you did
package/leo_storage_0/bin/leo_storage stop.
package/leo_storage_0/bin/leo_storage start.
What happened
repeated endlessly while [email protected] was down.
So it can cause DISK FULL if the down time was too long.
What should happen instead
Ideally much less errors should be dumped rather than spamming the same error too many times.
Why
Given that how the leo_mq consumer is implemented in https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L502-L525,
Spamming the same error many times could happen when an exception raised in handle_call callback (https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L509) because the item is not consumed (skipped this line: https://github.com/leo-project/leo_mq/blob/1.4.14/src/leo_mq_consumer.erl#L510) and leo_mq_consumer try to consume the same item until it's consumed.
In this case,
https://github.com/leo-project/leo_redundant_manager/blob/1.9.35/src/leo_membership_mq_client.erl#L157-L192 keep failing while [email protected] is down.
Solution
It's OK if an error is temporal however since we have no idea how long an error keeps occurring, we have to avoid consuming the same item in a same batch.
(give it another try on the next batch so the same error should/will appear once per one batch)
Related Links
The text was updated successfully, but these errors were encountered: