-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After recover-node, tons of "not found" error are generating in logs #859
Comments
Maybe this will be helpful, grep for (all parts) of these three objects
in diagnose logs on each node that I've taken few days ago before this problem (but the objects are older than that anyway):
stor02:
stor03:
stor04:
stor05:
stor06:
|
WIP |
@vstax Just in case, can you share the result of |
@vstax still not reproduced yet on our env.
my assumption above will be verified by the result of dump-ring. |
@mocchira Thank you for looking The files ring_cur.dump and ring_prv.dump are identical on master, stor01 and gateway used to load these objects; I had to start other nodes, and new RING is pushed to them on start, I think? Either way, after I started them and made the dumps - they all came out identical as well. The files ring_cur and ring_prv are identical as well (as well as corresponding log files - identical for cur/prv and between all nodes). Full list of various timestamps in RING are
First one is 'Fri Sep 15 21:00:58 2017' here and rest are 'Mon Sep 18 17:41:21 2017' to 'Mon Sep 18 17:56:41 2017'. First date is when stor01 was first started (cluster wasn't initialized); other dates represent when stor02 to stor06 were first started. "start" command to start cluster happened at TS 1505746641881341 (2017-09-18 17:57:21.881193). Since then no rebalance command happened - cluster lived with these 6 nodes and that's it. Apparently RING never changed since initial creation. Either way, each date is way before upload of these objects has happened. On Sep 19 and 20, one more node (during experiment with multiple nodes) was connected to the manager, but I never attached it - it was connected for some time, then disconnected and gone, RING wasn't recalculated. Its name was [email protected] (it was running on the same stor01, with different queue/log/avs directories and rpc.server.listen_port, of course. There are no traces of that in manager logs. Split brain, you mean during upload of these objects? Well there were no errors on gateway and client and W=2. There were no known network problems and there was nothing in logs apart from errors listed in #845. Is it possible to identify node that was source of trouble (if any) given how it all happened? I.e. recover-node was running for stor01, each node seems to have reached 0 messages in queue before the problem started to happen and the numbers in that queue got non-zero again, accompanied by high load? Would repeat of experiment be any useful? (i.e. now, when we are sure that RING is identical, stop stor02-06, delete contents of that queue, start them, run recover-node again and wait). |
Contents of members_cur.dump with all nodes running:
RING dump attached as well. Assuming that leo_object_storage_api:head or some other method actually returns wrong data on one of the nodes, is it possible to check it directly with remote_console? (that is, what other methods to call first to get these AddrId and Key arguments, knowing object and node name?) |
@vstax thanks for sharing. seems no inconsistencies happened at all or inconsistencies already got recovered at some moment, however in the latter scenario, you can see something notifying you that leofs detected inconsistencies and got RING force-updated in log files so now I think no inconsistencies happened. we are now vetting further on this direction (how it happened without RING inconsistencies). |
Yes, assuming we have leofs-adm/CHANGELOG.md like below leofs@cat2neat:leofs.1.3.5$ ./leofs-adm whereis leofs-adm/CHANGELOG.md
-------+--------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
del? | node | ring address | size | checksum | has children | total chunks | clock | when
-------+--------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
| [email protected] | 6ae5ce88432a88e6ea2610059db0de9 | 204K | 16c159389a | false | 0 | 55ad92a35b816 | 2017-10-06 13:20:46 +0900
| [email protected] | 6ae5ce88432a88e6ea2610059db0de9 | 204K | 16c159389a | false | 0 | 55ad92a35b816 | 2017-10-06 13:20:46 +0900 then %% get an AddrId from a Key through leo_redundant_manager_api:get_redundancies_by_key
(storage_3@127.0.0.1)2> leo_redundant_manager_api:get_redundancies_by_key(<<"leofs-adm/CHANGELOG.md">>).
{ok,{redundancies,8880712031625058080169575614374415849,
8713827541981604648542427598869967970,
8909797327562015671913302618427450100,[],[],
[{redundant_node,'[email protected]',true,true,'L'},
{redundant_node,'[email protected]',true,true,'FL'}],
2,1,1,1,0,0,958648327}}
%% the first element of the record redundancies stores AddrId
(storage_3@127.0.0.1)4> {ok, Ret} = leo_object_storage_api:head({8880712031625058080169575614374415849, <<"leofs-adm/CHANGELOG.md">>}).
{ok,<<131,104,22,100,0,10,109,101,116,97,100,97,116,97,
95,51,109,0,0,0,22,108,101,111,102,115,45,...>>}
%% As leo_object_storage_api:head returns a raw binary, it needs to be deserialized to be readable for us.
(storage_3@127.0.0.1)5> binary_to_term(Ret).
%% Now we can see the metadata for leofs-adm/CHANGELOG.md
{metadata_3,<<"leofs-adm/CHANGELOG.md">>,
8880712031625058080169575614374415849,22,209043,
<<131,108,0,0,0,1,104,2,109,0,0,0,22,120,45,97,109,122,45,
109,101,116,97,...>>,
181,0,0,0,5243396,1507263646119958,63674482846,
30246938820081418282392782782160660966,958648327,undefined,
0,0,0,0,0,0}
%% then let's try to get through leo_storage_handler_object::get with make_ref builtin function
(storage_3@127.0.0.1)7> {ok, _, Meta, Body} = leo_storage_handler_object:get({make_ref(), <<"leofs-adm/CHANGELOG.md">>}).
%% Now we can see the metadata along with its Body
{ok,#Ref<0.2250545746.352059393.107826>,
{metadata_3,<<"leofs-adm/CHANGELOG.md">>,
8880712031625058080169575614374415849,22,209043,
<<131,108,0,0,0,1,104,2,109,0,0,0,22,120,45,97,109,122,45,
109,...>>,
181,0,0,0,5243396,1507263646119958,63674482846,
30246938820081418282392782782160660966,958648327,undefined,
0,0,0,0,0,0},
<<"# CHANGELOG\n## 1.3.7 (Sep 12, 2017)\n\n### Fixed Bugs\n\n* [#592](https://github.com/leo-project/leofs/i"...>>} let me know if you have any question. |
@vstax let me ask you that
Try leo_object_storage_api:head and leo_storage_handler_object:get on remote_console against above objects that caused leo_storage(s) to dump logs massively and share the result? Now we can only say
so if you get the inconsistent result (one is deleted, the other is present) between leo_object_storage_api:head and leo_storage_handler_object:get then I'd not recommend you to repeat the same experiment as probably it will reach the same end. if not then things might go well. |
@mocchira Well here is some output but it doesn't work as well as it did for you... From stor01:
From stor02:
Stor05:
Is leo_storage_handler_object:get() supposed to work with these remains of multipart uploads anyway? It works for me for real objects but not for these parts with "\n". |
The variable in Erlang is immutable so you can't use the same variable to store some different value so for instance, you have to use Eshell V9.0 (abort with ^G)
2> B = 1.
1
3> B = 2.
** exception error: no match of right hand side value 2
4> f().
ok
5> B = 2.
2
6> |
@mocchira Yeah I've tried that as well (I've removed few lines from output as the result was the same)
|
Yes. and the result you pasted looks fine now. {metadata_3,<<"bod1/00/ac/66/00ac66deef727a4538c57e1415d1f24a006c614c8f66e185561cc0d0580f27a61af8c36d5dc9d192876dbfa1192ee5"...>>,
333535048481842093793027197745423965434,162,0,<<>>,0,0,0,0,
522207252,1506115109738840,63673334309,
281949768489412648962353822266799178366,2263375396,
undefined,0,0,0,0,0,1} The last field is the delete flag and now it's set to 1 means (true) and
leo_storage_handler_object:get returns not_found so each result is now consistent so I believe the same error never happen for objects having such consistent state.
Now the repeat would be useful to me. |
@mocchira
Errors starts to spam in log just like that:
but the queue was still reducing at that point. Also, these errors are exactly the same as the last time, i.e. the names and order of objects in log match the previous attempt. Few minutes after, the queue gets to
and stops reducing. The number in other queues have settled at exactly the same numbers as last time (e.g. on stor05 it's 13982). Let me share some thoughts about the problem - maybe you'll notice something that will give you a hint about what can be wrong here. Also, there are exactly 1600 different objects in error logs (each is MU upload temporary object); each object name was repeated 16-18K times during few hours the node was working like this. The exact objects are somewhat different on each node, though. 1600 here is obviously MQ batch size. The fact that all objects in logs had either "bod1" or "bod2" prefix is just due to the fact queue never moved past first batch; looking at these queues directly, I can see objects from different buckets like "body", "body9" or "body12". I was uploading objects to various batches in several sessions - e.g. I was uploading to bucket "body" on Sep 22 and to bucket "body12" on Sep 28, with others between them. All are affected by this problem. In fact, looking at raw queues directly, I can see mentions of ~500-800K strings with object names, but considering that there are 3-10 duplicates of object mentions the real amount is probably closer to 70K per node which roughly matches total amount of these multipart headers: ~11K per bucket, 13 buckets (~140K total or 70K per node). I have no idea why each node only shows 12-18K messages in queue, though (I've tried crashing the node to make it count messages again but it was the same number). Anyhow, when looking at raw queue datafiles, it seems like every multipart object that I've uploaded is in there and no other objects at all. I also manually checked around 10 large and normal object and every large object is mentioned in these queues while every normal isn't present there. (unfortunately, I still don't know how to open this DB properly as LevelDB tools that I've tried don't like the format) Objects in at least half of these buckets were uploaded through single gateway, so LB+multiple gateways aren't affecting this as well. Still, for every upload, no matter how it was done, header of multipart object seems to give problems during recover-node. I think if I upload more big objects they will show this problem as well, but it's hard to prove it with queues being stuck on the first batch. By the way, how come queue wasn't throttling / reducing batch processing size? I mean, when diagnosing different problems before I've seen that to happen, but this time the same batch of 1600 messages is trying to process again and again without pause. Unrelated to the problem itself, though. I checked head() output for the first object from this log and it's deleted on all 3 nodes:
So, what else can be done to get more information? Add some logging? Try recover-node on different node like stor02? |
@mocchira
|
@vstax Thanks for further digging. I may find the root cause however it seems to be difficult to take time today for some reason so I will work for it tomorrow and ask you to try the patch once I send PR. |
@mocchira Thank you, it (almost) works. The queue gets processed, however the "not_found" errors still appear in logs as queue processes. I think that since everything is all right (there is nothing with having temporary multipart headers during recover-node) there should be no errors. Recover-file still doesn't work, nothing happens. I tried enabling debug logs, nothing in them. Queues are empty. Also, after I launched node - stor02 here - to process the old queue (that was stuck before) with this patch this happened:
The amount of "not_found" lines match amount of messages in queue. What's bothering me is "badarg" errors that happened before that queue started to process and errors about "hcheck/empty" object. This is the (zero-byte) object that I was using for health checking from LB. It was never deleted, whereis output for it:
Why would logs mention it at all? All other nodes were running and weren't doing anything when I launched stor02. That said, these problems seem minor and don't prevent me from finishing recover-node testing. |
@vstax Thanks for checking.
Good to hear that, I will keep working for your blocking issues here and #846.
Seems calling https://github.com/mocchira/leofs/blob/fix/issue859/apps/leo_storage/src/leo_storage_mq.erl#L731 with "hcheck/empty" returns not_found so that means there might be inconsistent RING records on stor02. can you check the result of leo_redundant_manager_api:get_redundancies_by_key with "hcheck/empty" on stor01 or stor04. The result would be like below (storage_3@127.0.0.1)3> leo_redundant_manager_api:get_redundancies_by_key(<<"hcheck/empty">>).
{ok,{redundancies,204712737994773042953409181065537888892,
204330855779912657295634281112907108201,
205479344241174860687721462239176707334,[],[],
[{redundant_node,'[email protected]',true,true,'L'},
{redundant_node,'[email protected]',true,true,'FL'}],
2,1,1,1,0,0,958648327}} then do
As it turned out sort of not_found error pasted above could be reduced, I will update the PR later so please check it out when you have time.
Can you share the status of the target of recover-file with whereis?
This badarg error is kind of unexpected one (in theory it should never happen) so I'd like to dig deeper with debug logs so if possible, can you do the same test which probably reproduce those badarg errors again with debug enabled once I update the PR including additional debug logs and share the debug logs later? |
Updated #876 for
and
|
@mocchira Thank you for support.
Output of get_redundancies_by_key is identical for all 3 nodes; the RING dumps and logs are identical as well, and identical to RING dumps that I made 7 days ago for #859 (comment). The second and third number are present in log file (but not the first one).
Log from RING dump on stor02:
Unfortunately, I lost (moved instead of copied) the copy of queues on stor02 so I can't try to process them again; I will repeat the original experiment ("recover-node stor01") which created these queues in the first place, but running with latest patches. Maybe I will see this error again.
Recover-file fails to work on any objects that are in inconsistent state because of #845, for example
It's only for these deleted temporary multipart header objects that recover-file doesn't work for me (it doesn't set "delete" flag on stor05 here). It works correctly for existing normal objects (including large objects). I can't say if it works or not for deleted normal objects as I don't have any that are in inconsistent state.
Unfortunately, it doesn't quite work. I do have badargs without this change, but with it I don't have neither not_found errors nor badargs nor extra debug output. I've tried older version again and copied queue directory that I've saved and both thousands of badargs and errors were back. Actually, I did get a single badarg once on one of the nodes even with latest patch version:
but it's just one line and only on this node; unfortunately, precisely on this node and at that launch I didn't have debug logs enabled. I've restarted this node multiple times after that (replacing empty queues with copy before each start) but wasn't able to repeat this badarg with debugs logs enabled. "send_object_to_remote_node" never appears in debug logs. Does this make any sense? EDIT: I have object with name "body//s/to/ra//storage/nfsshares/storage/00/ff/00ff10f0536f07ccc38af567bf2fb3edd8cefc7d5efd0fc66c7910c4da2ccd69b8d457bd44ee6b1be8b9230c130fbcb7c200000000000000.xz" on stor03 (that is, stor03 is primary node for that one); it was created due to incorrect launch of upload script. These double slashes in name don't seem to create any problems during PUT/GET but I wonder if it is really OK to have such an object (given that slashes are used as separator for creating index for prefix search and such). The reason I remembered about this object (which exists, was never deleted and all) is that it always appear in logs when launching stor03 with these (used to be problematic) queues:
I think it's the only object in these queues that doesn't fall into "deleted MU header" category, it's small as well (184 bytes). Not sure why it's there (maybe somehow it just was the last object in queues), but I doubt it's the cause of any problems with queues as it only exists on 3 nodes and problems are on each of them. I'm mentioning about it just in case because I never tried working with objects with double slashes in name before. |
@vstax Thanks for trying new patch.
Looks fine now so the error happened at https://github.com/mocchira/leofs/blob/fix/issue859/apps/leo_storage/src/leo_storage_mq.erl#L761-L762 will no longer happen with the current stable RING.
OK. please let me know if you see again.
Maybe. Other than debug logging, the prev patch and current one is logically doing the same things however physically there is a difference
In order to avoid getting the error log filled with not_found, I'd change to the current logic and this may affect the odds badarg could happen for some reason (reduced I/O load on the partition where log files lay down or reduced GET rate or something else may decrease the odds some unknown racy problem?). Given that the badarg rate drastically has decreased with the current patch, I'd like to change the log level for tracking these badarg to ERROR so please let me know if you face badarg again.
Thanks for sharing. ### there is test//readme
leofs@cat2neat:leofs.1.3.5$ s3cmd ls s3://test/
DIR s3://test//
### expect readme to be returned but the result was empty
leofs@cat2neat:leofs.1.3.5$ s3cmd ls s3://test// So would not recommend using a key having double slashes. |
WIP |
Updated #876 for
|
Updated #876 for
|
Hmm I don't think RING was broken at any point, though, so it's strange. Well, doesn't matter that much now, I guess. It didn't happen in the latest recover-node experiment as well.
Let's see.. The queue processed very fast and probably was IO bound. The logs (if you mean text logs) - don't think they should affect anything, as there is no fsync(), not enough writes to fill dirty buffers and OS flushes buffers once every 5 seconds, but badargs (with old patch version) happen all the time and on each node, so it's not about text logs. Queue files themselves: LevelDB should issue fsync() or at least fdatasync() under some conditions during write, I think? The queue is stored in /var which is located on soft RAID10 made from small partitions in the beginning of each drive used for AVS files. So intense seek around AVS files or their indexes will hurt IO for queues if they try to sync. That said, thousands of badargs always happen in 4-5 seconds, that's fast enough not to have serious IO expectations, I think? I mean, there might be IO load, but it's fast enough so it shouldn't hit any internal timeout or anything. At very least, it seems unlikely to me that when something happens that fast IO load should cause any errors directly. Interesting part is that amount of badarg messages on each node was some round number like 11200, 8400, 9200. The exact amount on each node isn't stable, it can vary from launch to launch. Plus, that case with a single badarg message - which I wasn't able to reproduce after, another argument for "some race" theory, I'd say. Unfortunately :( It should be better now with message being "error" as I can't possibly run recover-node with debug enabled (due to logger problems). Then again, I can't seem to be able to get badargs anymore anyway.
Oh, I did not plan to, it happened by mistake. Yes, it seems to hide from "ls" but once you get past double slashes it works again:
I can't test it now but googling shows that real AWS would've shown "storage" here when trying to do "ls s3://body//s/to/ra//". Well, it's an obscure and minor problem. Latest #876 allows me to do recover-file with MU header correctly, thanks. |
I don't have this issue anymore. Even though the reason for badargs wasn't found, they don't happen right now so maybe it's ok to close this. |
@vstax OK, please close this one. just in case, I will file another issue to track how badarg could happen. |
I was trying to perform some data loss / recovery experiment on (soon to become) production cluster. 6 nodes, N=3, holding currently around 10M objects (~5M on each node). For the first experiment I simply executed "leofs-adm recover-node [email protected]" - every node was in perfect condition before that command and I didn't do anything else other than waiting for recover to finish. There was no other load as well.
The performance of recovery was bound by 1Gb network - i.e. I saw >950 Mb incoming traffic on stor01 all the time. There was no write IO on stor01 and low-to-moderate read IO on each other node. Queues on stor01 were empty, queue "leo_per_object_queue" on all other nodes were slowly growing, then slowly dropping. I expected recover to finish when all queues drop to 0, and eventually that queue on all but the last node (stor02) reached zero. It took around 11 hours which is expected (~3300 GB of data to recover).
However, around the point when last, stor02's "leo_per_object_queue" was getting closer to 0 something seriously broke. All nodes except for stor01 started to get high CPU load again and generate insane amount of errors in log (8-9M lines per hour). All errors look like this:
on other node
I.e. nodes stor02-stor06 are generating logs related to parts of large objects. There aren't even so many large objects on nodes - each object name is repeated 4-5K times per hour in these logs. The queues on stor01 are empty, all other nodes got a bunch of messages in "leo_per_object_queue". The queue state is "running" but the numbers aren't changing:
The problem didn't start at the same moment, on stor06 it started at 20:40 and at stor02 at 22:08, for example. This very well could be around the time the queue on each node reached 0 for the first time.
There are no other errors on storage nodes and no errors on managers.
"whereis" output for some of these objects doesn't show anything strange. E.g. these 3 objects that are in bodies02's log:
I had to stop all storage nodes except for stor01 right now as each of them was generating 3GB of logs per hour (which is OK for now since the cluster isn't under real load yet).
The text was updated successfully, but these errors were encountered: