-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Add shard/block size recs #1890
Conversation
docs/troubleshooting/index.md
Outdated
1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly not recommended in production. | ||
2. Check to make sure you are running adequate block sizes based on the retention of your namespace. See above. | ||
3. Ensure that you have 30-40% memory overhead in the normal running state. You want to ensure enough overhead to handle bursts of metrics, especially ones with new IDs as those will take more memory initially. | ||
4. Ensure that you do not have high cardinality metrics - a unique metric is defined by a unique combination of tags and values. If some metrics have UUIDs, timestamps or durations as the tag value, then the cardinality of your metrics will be extremely high and this will generally lead to OOMs. If you have high cardinality tag values, you should consider moving values into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might consider rewriting #4 to be something along the lines of "don't use more than you've provisioned". cardinality is relative, so it's hard to offer blanket guidance that applies to all use cases - instead, it'd probably be better stated as a ceiling relative to capacity. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - updated it. Let me know what you think.
docs/how_to/cluster_hard_way.md
Outdated
| 24h | 1h | 4h | N/A | N/A | | ||
| 168h | 6h | 12h | 24h | 48h | | ||
| 720h | 24h | 24h | 48h | 96h | | ||
| 8760h | 24h | 24h | 48h | 96h | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these 1-1 the same as what we have created here?
https://github.com/m3db/m3/blob/master/src/query/api/v1/handler/database/create.go#L95-L116
It would be good to make sure they are 1-1 or else we're kind of telling people different advice depending on whether they're creating themselves or using the database/create
API.
| 24h | 1h | 4h | N/A | N/A | | ||
| 168h | 6h | 12h | 24h | 48h | | ||
| 720h | 24h | 24h | 48h | 96h | | ||
| 8760h | 24h | 24h | 48h | 96h | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use a snippet instead of repeating this content.
See the following example (search for where this is used to see how to pull in from the common
directory):
https://github.com/m3db/m3/blob/master/docs/common/headers_optional_read_write.md
more nodes you have, the more shards you want because you want the shards to be evenly distributed amongst your nodes. However, | ||
because each shard requires more files to be created, you also don’t want to have too many shards per node. Below are some guidelines | ||
depending on how many nodes you will have in your cluster eventually - you will need to decide the number of shards up front, you | ||
cannot change this once the cluster is created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add why you cannot change it, due to the fact each bit of data would need to be repartitioned and moved around the cluster. (i.e. every bit of data would need to be moved all at once)
docs/troubleshooting/index.md
Outdated
|
||
## What to do if my M3DB node is OOM’ing? | ||
|
||
1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly discouraged in production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: m3db
-> M3DB
docs/troubleshooting/index.md
Outdated
|
||
1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly discouraged in production. | ||
2. Check to make sure you are running adequate block sizes based on the retention of your namespace. See [namespace configuration](../operational_guide/namespace_configuration.md) for more information. | ||
3. Ensure that you have 30-40% memory overhead in the normal running state. You want to ensure enough overhead to handle bursts of metrics, especially ones with new IDs as those will take more memory initially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say, "use at most 50-60% memory utilization" would be a better recommendation perhaps.
docs/troubleshooting/index.md
Outdated
1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly discouraged in production. | ||
2. Check to make sure you are running adequate block sizes based on the retention of your namespace. See [namespace configuration](../operational_guide/namespace_configuration.md) for more information. | ||
3. Ensure that you have 30-40% memory overhead in the normal running state. You want to ensure enough overhead to handle bursts of metrics, especially ones with new IDs as those will take more memory initially. | ||
4. High cardinality metrics can also lead to OOMs especially if you are not adequatly provisioned. If you have high cardinality metrics such as ones containing UUIDs or timestamps as tag values, you should consider eliminating or lessening these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sp: "adequately" (missing an e)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should consider eliminating or lessening these
maybe "If you have many unique timeseries such as ones containing [...], you should consider mitigating their cardinality"?
| 24h | 1h | | ||
| 168h | 2h | | ||
| 720h | 12h | | ||
| 8760h | 24h | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the larger ones don't seem right? I know we wanted to match these with the database create call, but 24 hours seem small for 1 year and we also don't take into account the resolution at all here. Can we make a TODO to update the create to take scrape interval or aggregation resolution into account?
Also, thoughts on index block size vs TS block size? @robskillington I've seen some folks with good success using larger index block sizes than TS block sizes, esp is there's not a huge amount of new IDs. Maybe we can configure those into create and update them here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we want to be consistent with the database create call which uses this for resolution:
idealDatapointsPerBlock = 720
blockSizeFromExpectedSeriesScalar = idealDatapointsPerBlock * int64(time.Hour)
value := r.BlockSize.ExpectedSeriesDatapointsPerHour
blockSize = time.Duration(blockSizeFromExpectedSeriesScalar / value)
But to be honest, if we push users towards the database create api, no one should really need to know about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e539c52
to
5ab414a
Compare
What this PR does / why we need it:
Adds documentation regarding recommended config settings for namespace and placement.
Special notes for your reviewer:
Does this PR introduce a user-facing and/or backwards incompatible change?:
Does this PR require updating code package or user-facing documentation?: