-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing uneven distribution of data between nodes #846
Comments
WIP |
@vstax After discussing among the team, we've come to the conclusion below.
P.S. With the erasure coding (EC) that will be shipped on 1.4 release, you could expect uneven distribution problem to be mitigated if you set the fragment size used by EC to more smaller one rather than 5MB (the current default for large object). |
@mocchira Thank you for looking at this. Actually, I'm always getting somewhat uneven distributions. E.g. cluster for experiments (filled with older "dev" data twice, 2M objects):
(11% difference) Dev cluster (just like production, but for dev environment. Object distribution is a bit different from production and leaning towards bigger ones. 1.1M objects):
Around 7% difference. Both these clusters are are N=2 with 3 storage nodes. It doesn't matter how many objects are uploaded - for each cluster these distributions are fixed when only 10% of objects are there and don't change from that point on. Production nodes have N=3 and 6 storage nodes. Amount of objects was 5.5M at the point when I gave you these numbers (it's about twice of that now), but the difference between disk usage on nodes was the same whether it was 1M objects, 5M or 10M. With this amount of objects, the average object size on each node is nearly the same in call cases (i.e. it doesn't matter how to compare different nodes - by number of objects or their size). Configuration for experimental cluster (in conf.d format, only changes from default config. I removed changed directory paths as well), manager:
Storage (3 nodes):
Configuration for dev cluster, manager nodes - same as experimental cluster. Storage (3 nodes):
Histogram for dev cluster (experimental cluster will be the same, just scaled numbers). Out of 1095117 objects, 12% less than 10K size, 28% between 10K and 50K size, 33% between 50K and 100K size, 18% between 100K and 1M size, 6.5% between 1M and 5M size, 2% above 5M size. Production cluster, manager nodes:
Storage (6 nodes):
As for histogram. Right now (for this upload) a certain subset of production data was picked, one which heavily favors large files. So right now histogram is: 2% objects less than 100K size, 83% from 100K to 1M, 13% from 1M to 5M, less than 2% above 5M. For real production data histogram will be - 47% less than 10K size, 11% between 10K and 50K, 7% between 50K and 100K, 32% between 100K and 1M, 2% between 1M and 5M, 0.5% above 5M. That will change somewhat after we got LeoFS fully running as we plan to raise limits on object sizes that we store (right now we don't storage objects >20M except for special circumstances). About num_of_vnodes tweaking, there is one thing that bothers me. Supposedly default RING for 6 nodes calculates so node A has 15% more objects than node B for my data. I, knowing that, tweak num_of_vnodes to compensate for that. I upload data and everything is even. At some point more nodes are needed, say node G and node H are added. Now we have 8 nodes, however if I started from 8 nodes I'd definitely had different distribution between nodes A and B (it could be that B had 15% more objects than A, for example). So how will it be with 8 nodes when num_of_vnodes was tweaked between A and B for 6 nodes case? Will it still work as before or do more harm than good? Also, if I tweaked that values for all 6 nodes, I'll have absolutely no idea how to set it for nodes G and H except for default values - since I'll be able to see new distribution only after rebalance. At which point it'll be too late since I obviously can't change num_of_vnodes on working cluster with data safely. Regarding EC feature, it sounds really nice but we are definitely launching production as soon as possible (i.e. after this and some more minor issues are resolved). Also it will probably change a lot so it'll require lots of testing and such anyway first. |
@vstax Thanks for sharing the detailed info. we've started to vet further.
Got it. |
@mocchira Personally, since I'm building and using my own packages I don't care much about release right now, more of there being no bugs (I understand that releases get more tests than just develop branch but since I've checked the changes in repo since last release it's the same for me). Currently the only stopping issues are this one and #859 - which was totally unexpected and made me stop testing process. I know I could clear these queues but some tests I wanted to do right now (while LeoFS is working in production as secondary source of data, not primary) involve how system behaves during just recover-node operation, during recover-node under load after losing a drive and double failure (losing drive, recover-node is launched, and during recovery another node completely gone and installed anew and second recovery is launched as well). So if this is caused by some bug, it's pointless to do second and third test until it's fixed... |
@vstax Thanks for the reply. we will re-prioritize those issues based on your feedback. |
WIP |
@mocchira
Yes it's extremely ineffective and takes a long time to execute (maybe you can rewrite it directly around leo_manager_api:whereis?) but shows the problem precisely. The distribution will always be like that when executed again (with different random strings) or when increasing amount of names. If you could integrate this code into function (input parameters: length of each (random) name and amount of random names, result: amount of times each storage node appears in the output of whereis function) it can be used to quickly evaluate how well balanced current RING is. I've tried with 20-byte random strings as well (exactly the same distribution) and more real names (checking if the problem only exists for longer random strings) like
but the distribution is always the same. The fact that I used only alphanumeric characters for these tests shouldn't matter for correct hash function - and since there seem to be no differences in distribution between short real words, longer random strings and even very long random strings, that part definitely works correctly. |
@vstax Thanks for sharing your experiment. actually I have been doing the same kind of tests and now suspect there may be something wrong in RING distribution. still not sure but I've found a few things that may affect the distribution range so now I'm writing a script to check the actual range (32bit space) assigned to each node and detect which part may be wrong. |
@vstax Can you share the ring_cur_worker.log.xxxx on the cluster you did experiments? if my guess is correct then the first record of vnodeid_nodes stores the below 3 nodes
the reason why this causes uneven distribution is |
@mocchira
Full version attached: ring_cur_worker.log.63674417934.gz |
LeoFS' redundant-manager randomly assigns a virtual-nodes on the RING (2 ^ 128). Therefore, its bias is inevitable. I have been considering that I design the virtual-node rebalance feature to fix the issue in near future, v1.4 or later. |
As the virtual-node rebalance feature will depend heavily on leo_konsul (leo_redundant_manager replacement) and leo_konsol will have lots of code changes and may impact the LeoFS quality, we will postpone this issue to 2.0.0. |
This kind of turned into discussion about how to solve uneven distribution after it happened - but how about just making sure it doesn't happen instead? For example, when creating new cluster first RING generated had really uneven distribution. (I checked by supplying ~2000 strings to "whereis" command). So I generated it again and again and third one was much more even than the first one. Maybe - if changing generation algorithm to fix the problem is too complicated - it's possible to just generate multiple times? Unfortunately, external "whereis" command is very slow and what I did took minutes, but maybe it will be ok performance-wise if done internally. Like, provide amount of tries to RING generation and it's generated multiple times and the most even one is picked? :) Yes it seems pretty hackish but this is one-time problem anyway (or "few times", if counting rebalance) so maybe this would be the simplest solution. Some solution is needed for fixing generation anyway, even if virtual-node rebalance is implemented, it would be no good if distribution is still uneven after first generation and before rebalance? |
@vstax Thanks for pointing this issue again. WIP however we are going to provide some kind of solution also for 1.4.x. and also will conduct some experiments to validate whether or not a degree of the current uneven distribution is expected with various factors which can affect the distribution (file name, num_of_vnodes, number of storages, replication factor). |
When loading nodes with data (N=3, 6 nodes) the distribution of data is uneven between nodes:
The difference is stable, it doesn't change no matter how many objects are stored. Distribution difference in amount of objects (or size - which is the same here) between 5th and 6th node is 16% and always remains the same.
If taking amount of objects (or size) on 6th node as 100% usage, then all 6 nodes have 113% / 111% / 103% / 112% / 116% / 100% usage.
Nodes are exactly the same. There are no problems in distribution between different AVS directories:
Is it possible to fix this difference manually? (I understand that this will recalculate RING and will require rebalance / compaction to be performed, it's not a problem). I'd like to set these settings, wipe the data and load it again. Because the difference in distribution is always stable no matter how many objects are loaded, I believe that such change will work the same in the future, no matter how many objects are on nodes. While 16% difference isn't that big, that's still up to ~3.6 TB difference in data size between nodes, I'd like to make it more even, if possible - reduce to few % or something.
The text was updated successfully, but these errors were encountered: