v2: Investigate which limits/configs are useful for intake v2 #1299

roncohen · 2018-08-20T10:25:45Z

We should investigate the behavior and applicability of configurations settings like MaxUnzippedSize, Read/WriteTimeout and RateLimit for v2 and see if there are new limits that make sense to apply instead.

The text was updated successfully, but these errors were encountered:

roncohen · 2018-08-21T09:48:23Z

Due to a technical limitation in the Go http.Server,read_timeout can only be set server-wide. That means it will be effective for both v1 and v2 endpoints at the same time. Agents have decided to default to a request time of 10s. This is well below our default read_timeout of 30s - so we're in the clear regarding read_timeout. When we drop the v1 endpoints, we can discuss increasing the default read_timeout in the server.

@graphaelli proposed having max_unzipped_size per line for v2. I like the idea of a per line limit, but unfortunately we can't repurpose max_unzipped_size to mean that for v2, because people might be using v1 and v2 at the same time.

We could introduce a new setting ndjson_max_event_size or similar, to protect against clients sending very large events. This could be especially important in the RUM scenario, where anyone can access the API.

concurrent_requests was something we introduced primarily to put a bound on the memory usage on the server. With v1, each request body will be read into memory before being processed, so a limit on the number of requests we would do that with, made sense. For v2, we only read up to batchSize number of events per request at any one time into memory, before it's handed over to the the libbeat queue. This makes the problem much less serious. For v2, some agents will keep a http request open all the time. If we blindly applied the same concurrent_requests limit with v2, it would limit the number of agents that APM Server could service.

As mentioned, with v2, we will read up to batchSize events from each http request. I suggest we introduce concurrent_requests_per_ip to limit the number of concurrent requests from a single IP. We could set it to two, for example.

rate_limit today specifies the number of http requests per second per IP for RUM. That doesn't seem right for v2. What we would probably want is a rate limit on the number of events sent per IP + concurrent_requests_per_ip + ndjson_max_event_size mentioned above.

to summarize:

v1 with current defaults stay the same
we introduce these limits for v2:

Backend:

ndjson_max_event_size
concurrent_requests_per_ip

RUM:

ndjson_max_event_size
concurrent_requests_per_ip
event_rate_limit_per_ip

event_rate_limit_per_ip could perhaps be useful on the backend, with a different limit than for RUM.

simitt · 2018-08-22T08:46:40Z

Thanks for the great writeup @roncohen !

I agree that we cannot change default values for v1. But I also think we should not change semantics of config values between v1 and v2 (e.g. read_timeout). Thus, I like the idea to introduce new config options wherever necessary and suggest to deprecate the v1 config options in apm-server.yml.

read_timeout: I was struggling with using the http.ReadTimeout for some time, see official description. How about we investigate the behaviour of http.ReadHeaderTimeout for v2?

+1 on ndjson_max_event_size

concurrent_requests: I see limiting per ip problematic for all setups where multiple backend services are monitored and share an ip address pool. Some time ago we were discussing adaptive request allowance based on a max_memory_usage setting comparing to real time memory usage. That'd probably be even more tricky with v2, as we don't know in advance how many events will be sent over an accepted http connection, but could make sense in combination with ndjson_max_event_size.

rate_limit: This one is tricky. Rate limiting the amount of requests sent would mean the server might interrupt a http request after processing events for some time. This seems unusual behaviour for a rate limiter.

graphaelli · 2018-08-22T11:59:16Z

Since the goal of max_unzipped_size is to protect apm-server from reading too much of a payload into memory, I don't follow why it shouldn't be applied directly to each event read in v2. I think it would be reasonable to have those be the same value in v1 and v2. That said, I'm neutral on ndjson_max_event_size except not a fan of the ndjson_ prefix as it will be confusing if we do decide to support other encodings later.

Rate limiting the amount of requests sent would mean the server might interrupt a http request after processing events for some time

I think this is ok and a natural compliment to read_timeout where we stop listening mid-request to clients that are requesting too slowly.

roncohen · 2018-08-22T12:45:50Z

Since the goal of max_unzipped_size is to protect apm-server from reading too much of a payload into memory, I don't follow why it shouldn't be applied directly to each event read in v2. I think it would be reasonable to have those be the same value in v1 and v2.

Maybe I overlooked something, but if a user sets:

apm-server.max_unzipped_size: 30mb

it would mean the max size per event is 30mb when data comes in through v2, while the full payload is limited to 30mb if the data comes in through v1. If we change the default max_unzipped_size to something reasonable for an event size, say 100kb, it would suddenly limit payloads sent to v1 to 100kb.

+1 on calling it max_event_size

graphaelli · 2018-08-22T15:21:29Z

Clarified IRL, recapping here: I intended to propose removal of the max http payload size for v2, that is the size of the entire stream of data, as v2 is most efficient when using a long running persistent connection, and instead apply that per-event, current per ndjson size. I agree it would be more clear if a new setting was introduced max_event_size for that.

roncohen · 2018-08-22T15:41:17Z

summary of an offline conversation we just had:

These are the types of limits we'll apply for v2:

Backend:

max_event_size

RUM:

max_event_size
concurrent_requests_per_ip
event_rate_limit_per_ip - implemented as a throttle (e.g. read speed from the http request is capped) or events will be dropped if the limit is surpassed for a sustained amount of time (e.g. allow for bursts).

I should add that the ReadTimeout already in place stays where it is. It's effective for both v1 and v2 and defaults to 30s

roncohen · 2018-09-03T16:13:09Z

after working on #1352 and having though about this some more:

One of the things we discussed above was to protect the APM Server to make it difficult to get the APM Server to run out of memory by sending specific payloads. For example, by sending very big events. At the moment, we'll read batchSize number of events into memory for each request.

We decided to limiting the size of each even + limiting the number of concurrent requests per IP. So the memory that any IP address can cause the APM Server to allocate is bound by max_event_size * batchSize * concurrent_requests_per_ip.

Coming up with a max_event_size isn't that simple, and we end up making it very large to ensure that legitimate usage never encounters the limit, for example 100kb. If you then multiply it by the batchSize: 10 * 100kb = 1mb. This means every client can now use a 1mb. I would say that seems unnecessarily high.

If we're aiming to set a memory bound per IP, we could also consider simply doing that. If we had memory_limit_per_ip set at, say 200kb, we would read events from the stream until we hit the limit and submit the full events we've read so far, before continuing to read from the stream. In a sense, this would be like setting a batchSize that dynamically changes according the size of each event we read and the current arbitrary constant batchSize goes away. With concurrent requests, things get a little more complicated, but it should be doable.

simitt · 2018-09-04T07:47:34Z

The idea of making batchSize dynamic is very appealing, especially as non of the customers will be aware of the currently hardcoded batchSize of 10 and thus would not be aware of concrete memory implications when setting the max_event_size.

Referring to your former comment where you summarize our offline discussions I assume that with a memory_limit_per_ip you are referring to limits for RUM.
For consistency between backend and RUM I suggest to come up with the same settings though and only add an additional setting to RUM, e.g.:

max_batch_size that does what you suggest: read data up to this size and process them, before it resets the counter and reads the next batch from the same stream.
use something like max_connections for limiting how many stream requests are concurrently accepted (maybe we can reuse existing max_connections setting?)
RUM only: ip rate limiting for how many stream requests per second per ip can be sent, so basically max_connections_per_ip_per_sec. This would be roughly the same logic as we use for v1, but we'll need to introduce a new setting, as the value will need to be lower than it is now.

roncohen · 2018-09-04T16:21:32Z

@simitt we discussed offline how to progress here. It seems to me that an event_rate_limit_per_ip for RUM makes sense regardless of how we end up doing batching, so I think we should continue working on that. We can make it an option for backend as well, but have it be disabled by default. I could see it being useful for backend in situations where there's a team that offers APM Server as a service to the rest of the organization and they don't want a single external team to overload the system.

* Enabled for RUM endpoints with default value 5000. * Deny request if rate limit is hit when http request is established. * Throttle read if http request is successfully established, but rate limit is hit while reading the request. * Keep rate limiting for v1 unchanged. part of elastic#1299

* Enabled for RUM endpoints with default value 5000. * Throttle read if rate limit is hit while reading the request. * Check rate limit in batches of 10. * Keep rate limiting for v1 unchanged. part of elastic#1299

* Enabled for RUM endpoints with default value 5000. * Deny request if rate limit is hit when http request is established. * Throttle read if http request is successfully established, but rate limit is hit while reading the request. * Keep rate limiting for v1 unchanged. part of elastic#1299

* Enabled for RUM endpoints with default value 5000. * Throttle read if rate limit is hit while reading the request. * Check rate limit in batches of 10. * Keep rate limiting for v1 unchanged. part of elastic#1299

roncohen · 2018-09-24T16:39:55Z

opening up this discussion again:
max_event_size has been merged and there's work ongoing for an event_rate_limit_per_ip which also rejects new connections for an IP address if the rate limit has been reached for that IP.

I think we should be fine to start with those. We can re-open this if it turns out there's more needed at a later stage. WDYT @simitt @graphaelli ?

graphaelli · 2018-09-24T17:08:17Z

Agreed that those are sufficient to start with and we should revisit after more testing and real world usage.

simitt · 2018-09-24T17:50:10Z

SGTM

* Enabled for RUM endpoints with default value 5000. * Throttle read if rate limit is hit while reading the request. * Check rate limit in batches of 10. * Keep rate limiting for v1 unchanged. part of elastic#1299

* Enabled for RUM endpoints with default value 300, burst multiplier 3 * Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started. * Check rate limit in batches of 10. * Use eviction handling for LRU cache holding the rate limiters. * Keep rate limiting for v1 unchanged. partly implements #1299

* Enabled for RUM endpoints with default value 300, burst multiplier 3 * Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started. * Check rate limit in batches of 10. * Use eviction handling for LRU cache holding the rate limiters. * Keep rate limiting for v1 unchanged. partly implements elastic#1299

roncohen · 2018-09-26T13:41:40Z

Closing now that #1352 and #1367 are in.

* Enabled for RUM endpoints with default value 300, burst multiplier 3 * Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started. * Check rate limit in batches of 10. * Use eviction handling for LRU cache holding the rate limiters. * Keep rate limiting for v1 unchanged. partly implements elastic#1299

* Enabled for RUM endpoints with default value 300, burst multiplier 3 * Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started. * Check rate limit in batches of 10. * Use eviction handling for LRU cache holding the rate limiters. * Keep rate limiting for v1 unchanged. partly implements #1299

This was referenced Aug 20, 2018

Intake protocol v2 #1260

Merged

Intake v2 support #1237

Closed

alvarolobato assigned roncohen Aug 20, 2018

alvarolobato added [zube]: Inbox and removed [zube]: Inbox labels Aug 20, 2018

roncohen mentioned this issue Sep 3, 2018

v2: add max_line_length #1352

Merged

simitt added [zube]: In Progress and removed [zube]: Ready labels Sep 4, 2018

simitt mentioned this issue Sep 7, 2018

1299 event rate limit per ip #1367

Merged

simitt self-assigned this Sep 19, 2018

roncohen closed this as completed Sep 26, 2018

zube bot added [zube]: Done and removed [zube]: In Progress labels Sep 26, 2018

alvarolobato added this to the 6.5 milestone Oct 26, 2018

jalvz removed the [zube]: Done label Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2: Investigate which limits/configs are useful for intake v2 #1299

v2: Investigate which limits/configs are useful for intake v2 #1299

roncohen commented Aug 20, 2018

roncohen commented Aug 21, 2018 •

edited

Loading

simitt commented Aug 22, 2018

graphaelli commented Aug 22, 2018

roncohen commented Aug 22, 2018

graphaelli commented Aug 22, 2018

roncohen commented Aug 22, 2018 •

edited

Loading

roncohen commented Sep 3, 2018 •

edited

Loading

simitt commented Sep 4, 2018

roncohen commented Sep 4, 2018 •

edited

Loading

roncohen commented Sep 24, 2018

graphaelli commented Sep 24, 2018

simitt commented Sep 24, 2018

roncohen commented Sep 26, 2018

v2: Investigate which limits/configs are useful for intake v2 #1299

v2: Investigate which limits/configs are useful for intake v2 #1299

Comments

roncohen commented Aug 20, 2018

roncohen commented Aug 21, 2018 • edited Loading

simitt commented Aug 22, 2018

graphaelli commented Aug 22, 2018

roncohen commented Aug 22, 2018

graphaelli commented Aug 22, 2018

roncohen commented Aug 22, 2018 • edited Loading

roncohen commented Sep 3, 2018 • edited Loading

simitt commented Sep 4, 2018

roncohen commented Sep 4, 2018 • edited Loading

roncohen commented Sep 24, 2018

graphaelli commented Sep 24, 2018

simitt commented Sep 24, 2018

roncohen commented Sep 26, 2018

roncohen commented Aug 21, 2018 •

edited

Loading

roncohen commented Aug 22, 2018 •

edited

Loading

roncohen commented Sep 3, 2018 •

edited

Loading

roncohen commented Sep 4, 2018 •

edited

Loading