[docs] Add shard/block size recs #1890

benraskin92 · 2019-08-19T14:38:47Z

What this PR does / why we need it:

Adds documentation regarding recommended config settings for namespace and placement.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

Does this PR require updating code package or user-facing documentation?:

docs/troubleshooting/index.md

docs/operational_guide/placement_configuration.md

docs/troubleshooting/index.md

mway · 2019-08-19T15:29:12Z

docs/troubleshooting/index.md

+1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly not recommended in production.
+2. Check to make sure you are running adequate block sizes based on the retention of your namespace. See above.
+3. Ensure that you have 30-40% memory overhead in the normal running state. You want to ensure enough overhead to handle bursts of metrics, especially ones with new IDs as those will take more memory initially.
+4. Ensure that you do not have high cardinality metrics - a unique metric is defined by a unique combination of tags and values. If some metrics have UUIDs, timestamps or durations as the tag value, then the cardinality of your metrics will be extremely high and this will generally lead to OOMs. If you have high cardinality tag values, you should consider moving values into the  


i might consider rewriting #4 to be something along the lines of "don't use more than you've provisioned". cardinality is relative, so it's hard to offer blanket guidance that applies to all use cases - instead, it'd probably be better stated as a ceiling relative to capacity. thoughts?

Cool - updated it. Let me know what you think.

robskillington · 2019-08-19T19:52:31Z

docs/how_to/cluster_hard_way.md

+| 24h                    | 1h  | 4h  | N/A | N/A |
+| 168h                   | 6h  | 12h | 24h | 48h |
+| 720h                   | 24h | 24h | 48h | 96h |
+| 8760h                  | 24h | 24h | 48h | 96h |


Are these 1-1 the same as what we have created here?
https://github.com/m3db/m3/blob/master/src/query/api/v1/handler/database/create.go#L95-L116

It would be good to make sure they are 1-1 or else we're kind of telling people different advice depending on whether they're creating themselves or using the database/create API.

robskillington · 2019-08-19T19:53:29Z

docs/operational_guide/namespace_configuration.md

+| 24h                    | 1h  | 4h  | N/A | N/A |
+| 168h                   | 6h  | 12h | 24h | 48h |
+| 720h                   | 24h | 24h | 48h | 96h |
+| 8760h                  | 24h | 24h | 48h | 96h |


We should use a snippet instead of repeating this content.

See the following example (search for where this is used to see how to pull in from the common directory):
https://github.com/m3db/m3/blob/master/docs/common/headers_optional_read_write.md

robskillington · 2019-08-19T19:54:19Z

docs/operational_guide/placement_configuration.md

+more nodes you have, the more shards you want because you want the shards to be evenly distributed amongst your nodes. However,
+because each shard requires more files to be created, you also don’t want to have too many shards per node. Below are some guidelines 
+depending on how many nodes you will have in your cluster eventually - you will need to decide the number of shards up front, you
+cannot change this once the cluster is created. 


Maybe add why you cannot change it, due to the fact each bit of data would need to be repartitioned and moved around the cluster. (i.e. every bit of data would need to be moved all at once)

robskillington · 2019-08-19T19:54:42Z

docs/troubleshooting/index.md

+
+## What to do if my M3DB node is OOM’ing?
+
+1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly discouraged in production.


nit: m3db -> M3DB

robskillington · 2019-08-19T19:55:18Z

docs/troubleshooting/index.md

+
+1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly discouraged in production.
+2. Check to make sure you are running adequate block sizes based on the retention of your namespace. See [namespace configuration](../operational_guide/namespace_configuration.md) for more information.
+3. Ensure that you have 30-40% memory overhead in the normal running state. You want to ensure enough overhead to handle bursts of metrics, especially ones with new IDs as those will take more memory initially.


I would say, "use at most 50-60% memory utilization" would be a better recommendation perhaps.

mway · 2019-08-19T19:54:01Z

docs/troubleshooting/index.md

+1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly discouraged in production.
+2. Check to make sure you are running adequate block sizes based on the retention of your namespace. See [namespace configuration](../operational_guide/namespace_configuration.md) for more information.
+3. Ensure that you have 30-40% memory overhead in the normal running state. You want to ensure enough overhead to handle bursts of metrics, especially ones with new IDs as those will take more memory initially.
+4. High cardinality metrics can also lead to OOMs especially if you are not adequatly provisioned. If you have high cardinality metrics such as ones containing UUIDs or timestamps as tag values, you should consider eliminating or lessening these. 


sp: "adequately" (missing an e)

you should consider eliminating or lessening these

maybe "If you have many unique timeseries such as ones containing [...], you should consider mitigating their cardinality"?

martin-mao · 2019-08-20T03:23:03Z

docs/operational_guide/namespace_configuration.md

+| 24h       | 1h         |
+| 168h      | 2h         |
+| 720h      | 12h        |
+| 8760h     | 24h        |


Hmm, the larger ones don't seem right? I know we wanted to match these with the database create call, but 24 hours seem small for 1 year and we also don't take into account the resolution at all here. Can we make a TODO to update the create to take scrape interval or aggregation resolution into account?

Also, thoughts on index block size vs TS block size? @robskillington I've seen some folks with good success using larger index block sizes than TS block sizes, esp is there's not a huge amount of new IDs. Maybe we can configure those into create and update them here too.

Yeah, I think we want to be consistent with the database create call which uses this for resolution:

idealDatapointsPerBlock = 720 blockSizeFromExpectedSeriesScalar = idealDatapointsPerBlock * int64(time.Hour) value := r.BlockSize.ExpectedSeriesDatapointsPerHour blockSize = time.Duration(blockSizeFromExpectedSeriesScalar / value)

But to be honest, if we push users towards the database create api, no one should really need to know about this.

martin-mao

LGTM

benraskin92 requested review from richardartoul, prateek, martin-mao and robskillington August 19, 2019 14:42

mway reviewed Aug 19, 2019

View reviewed changes

robskillington reviewed Aug 19, 2019

View reviewed changes

mway reviewed Aug 19, 2019

View reviewed changes

martin-mao reviewed Aug 20, 2019

View reviewed changes

martin-mao approved these changes Aug 21, 2019

View reviewed changes

benraskin92 added 5 commits August 21, 2019 16:03

Add shard/block size recs

db704a1

Fix typo

73c5642

Address comments

e2f2832

More comments

d2efb16

Fix build

5ab414a

benraskin92 force-pushed the braskin/add_shard_recs branch from e539c52 to 5ab414a Compare August 21, 2019 20:03

benraskin92 merged commit f9361d6 into master Aug 21, 2019

benraskin92 deleted the braskin/add_shard_recs branch August 21, 2019 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Add shard/block size recs #1890

[docs] Add shard/block size recs #1890

benraskin92 commented Aug 19, 2019 •

edited

Loading

mway Aug 19, 2019

benraskin92 Aug 19, 2019

robskillington Aug 19, 2019

robskillington Aug 19, 2019

robskillington Aug 19, 2019

robskillington Aug 19, 2019

robskillington Aug 19, 2019

mway Aug 19, 2019

mway Aug 19, 2019

martin-mao Aug 20, 2019

benraskin92 Aug 20, 2019

martin-mao left a comment


		## What to do if my M3DB node is OOM’ing?

		1. Ensure that you are not co-locating coordinator, etcd or query nodes with your m3db nodes. Colocation or embedded mode is fine for a development environment, but highly discouraged in production.

[docs] Add shard/block size recs #1890

[docs] Add shard/block size recs #1890

Conversation

benraskin92 commented Aug 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martin-mao left a comment

Choose a reason for hiding this comment

benraskin92 commented Aug 19, 2019 •

edited

Loading