Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: Investigate which limits/configs are useful for intake v2 #1299

Closed
roncohen opened this issue Aug 20, 2018 · 13 comments
Closed

v2: Investigate which limits/configs are useful for intake v2 #1299

roncohen opened this issue Aug 20, 2018 · 13 comments
Assignees
Milestone

Comments

@roncohen
Copy link
Contributor

We should investigate the behavior and applicability of configurations settings like MaxUnzippedSize, Read/WriteTimeout and RateLimit for v2 and see if there are new limits that make sense to apply instead.

@roncohen
Copy link
Contributor Author

roncohen commented Aug 21, 2018

Due to a technical limitation in the Go http.Server,read_timeout can only be set server-wide. That means it will be effective for both v1 and v2 endpoints at the same time. Agents have decided to default to a request time of 10s. This is well below our default read_timeout of 30s - so we're in the clear regarding read_timeout. When we drop the v1 endpoints, we can discuss increasing the default read_timeout in the server.

@graphaelli proposed having max_unzipped_size per line for v2. I like the idea of a per line limit, but unfortunately we can't repurpose max_unzipped_size to mean that for v2, because people might be using v1 and v2 at the same time.

We could introduce a new setting ndjson_max_event_size or similar, to protect against clients sending very large events. This could be especially important in the RUM scenario, where anyone can access the API.

concurrent_requests was something we introduced primarily to put a bound on the memory usage on the server. With v1, each request body will be read into memory before being processed, so a limit on the number of requests we would do that with, made sense. For v2, we only read up to batchSize number of events per request at any one time into memory, before it's handed over to the the libbeat queue. This makes the problem much less serious. For v2, some agents will keep a http request open all the time. If we blindly applied the same concurrent_requests limit with v2, it would limit the number of agents that APM Server could service.

As mentioned, with v2, we will read up to batchSize events from each http request. I suggest we introduce concurrent_requests_per_ip to limit the number of concurrent requests from a single IP. We could set it to two, for example.

rate_limit today specifies the number of http requests per second per IP for RUM. That doesn't seem right for v2. What we would probably want is a rate limit on the number of events sent per IP + concurrent_requests_per_ip + ndjson_max_event_size mentioned above.

to summarize:

  • v1 with current defaults stay the same

  • we introduce these limits for v2:

Backend:

  • ndjson_max_event_size
  • concurrent_requests_per_ip

RUM:

  • ndjson_max_event_size
  • concurrent_requests_per_ip
  • event_rate_limit_per_ip

event_rate_limit_per_ip could perhaps be useful on the backend, with a different limit than for RUM.

@simitt
Copy link
Contributor

simitt commented Aug 22, 2018

Thanks for the great writeup @roncohen !

I agree that we cannot change default values for v1. But I also think we should not change semantics of config values between v1 and v2 (e.g. read_timeout). Thus, I like the idea to introduce new config options wherever necessary and suggest to deprecate the v1 config options in apm-server.yml.

read_timeout: I was struggling with using the http.ReadTimeout for some time, see official description. How about we investigate the behaviour of http.ReadHeaderTimeout for v2?

+1 on ndjson_max_event_size

concurrent_requests: I see limiting per ip problematic for all setups where multiple backend services are monitored and share an ip address pool. Some time ago we were discussing adaptive request allowance based on a max_memory_usage setting comparing to real time memory usage. That'd probably be even more tricky with v2, as we don't know in advance how many events will be sent over an accepted http connection, but could make sense in combination with ndjson_max_event_size.

rate_limit: This one is tricky. Rate limiting the amount of requests sent would mean the server might interrupt a http request after processing events for some time. This seems unusual behaviour for a rate limiter.

@graphaelli
Copy link
Member

Since the goal of max_unzipped_size is to protect apm-server from reading too much of a payload into memory, I don't follow why it shouldn't be applied directly to each event read in v2. I think it would be reasonable to have those be the same value in v1 and v2. That said, I'm neutral on ndjson_max_event_size except not a fan of the ndjson_ prefix as it will be confusing if we do decide to support other encodings later.

Rate limiting the amount of requests sent would mean the server might interrupt a http request after processing events for some time

I think this is ok and a natural compliment to read_timeout where we stop listening mid-request to clients that are requesting too slowly.

@roncohen
Copy link
Contributor Author

Since the goal of max_unzipped_size is to protect apm-server from reading too much of a payload into memory, I don't follow why it shouldn't be applied directly to each event read in v2. I think it would be reasonable to have those be the same value in v1 and v2.

Maybe I overlooked something, but if a user sets:

apm-server.max_unzipped_size: 30mb

it would mean the max size per event is 30mb when data comes in through v2, while the full payload is limited to 30mb if the data comes in through v1. If we change the default max_unzipped_size to something reasonable for an event size, say 100kb, it would suddenly limit payloads sent to v1 to 100kb.

+1 on calling it max_event_size

@graphaelli
Copy link
Member

Clarified IRL, recapping here: I intended to propose removal of the max http payload size for v2, that is the size of the entire stream of data, as v2 is most efficient when using a long running persistent connection, and instead apply that per-event, current per ndjson size. I agree it would be more clear if a new setting was introduced max_event_size for that.

@roncohen
Copy link
Contributor Author

roncohen commented Aug 22, 2018

summary of an offline conversation we just had:

These are the types of limits we'll apply for v2:

Backend:

  • max_event_size

RUM:

  • max_event_size
  • concurrent_requests_per_ip
  • event_rate_limit_per_ip - implemented as a throttle (e.g. read speed from the http request is capped) or events will be dropped if the limit is surpassed for a sustained amount of time (e.g. allow for bursts).

I should add that the ReadTimeout already in place stays where it is. It's effective for both v1 and v2 and defaults to 30s

@roncohen
Copy link
Contributor Author

roncohen commented Sep 3, 2018

after working on #1352 and having though about this some more:

One of the things we discussed above was to protect the APM Server to make it difficult to get the APM Server to run out of memory by sending specific payloads. For example, by sending very big events. At the moment, we'll read batchSize number of events into memory for each request.

We decided to limiting the size of each even + limiting the number of concurrent requests per IP. So the memory that any IP address can cause the APM Server to allocate is bound by max_event_size * batchSize * concurrent_requests_per_ip.

Coming up with a max_event_size isn't that simple, and we end up making it very large to ensure that legitimate usage never encounters the limit, for example 100kb. If you then multiply it by the batchSize: 10 * 100kb = 1mb. This means every client can now use a 1mb. I would say that seems unnecessarily high.

If we're aiming to set a memory bound per IP, we could also consider simply doing that. If we had memory_limit_per_ip set at, say 200kb, we would read events from the stream until we hit the limit and submit the full events we've read so far, before continuing to read from the stream. In a sense, this would be like setting a batchSize that dynamically changes according the size of each event we read and the current arbitrary constant batchSize goes away. With concurrent requests, things get a little more complicated, but it should be doable.

@simitt
Copy link
Contributor

simitt commented Sep 4, 2018

The idea of making batchSize dynamic is very appealing, especially as non of the customers will be aware of the currently hardcoded batchSize of 10 and thus would not be aware of concrete memory implications when setting the max_event_size.

Referring to your former comment where you summarize our offline discussions I assume that with a memory_limit_per_ip you are referring to limits for RUM.
For consistency between backend and RUM I suggest to come up with the same settings though and only add an additional setting to RUM, e.g.:

  • max_batch_size that does what you suggest: read data up to this size and process them, before it resets the counter and reads the next batch from the same stream.
  • use something like max_connections for limiting how many stream requests are concurrently accepted (maybe we can reuse existing max_connections setting?)
  • RUM only: ip rate limiting for how many stream requests per second per ip can be sent, so basically max_connections_per_ip_per_sec. This would be roughly the same logic as we use for v1, but we'll need to introduce a new setting, as the value will need to be lower than it is now.

@roncohen
Copy link
Contributor Author

roncohen commented Sep 4, 2018

@simitt we discussed offline how to progress here. It seems to me that an event_rate_limit_per_ip for RUM makes sense regardless of how we end up doing batching, so I think we should continue working on that. We can make it an option for backend as well, but have it be disabled by default. I could see it being useful for backend in situations where there's a team that offers APM Server as a service to the rest of the organization and they don't want a single external team to overload the system.

simitt added a commit to simitt/apm-server that referenced this issue Sep 7, 2018
* Enabled for RUM endpoints with default value 5000.
* Deny request if rate limit is hit when http request is established.
* Throttle read if http request is successfully established, but rate
  limit is hit while reading the request.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
simitt added a commit to simitt/apm-server that referenced this issue Sep 7, 2018
* Enabled for RUM endpoints with default value 5000.
* Deny request if rate limit is hit when http request is established.
* Throttle read if http request is successfully established, but rate
  limit is hit while reading the request.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
simitt added a commit to simitt/apm-server that referenced this issue Sep 7, 2018
* Enabled for RUM endpoints with default value 5000.
* Deny request if rate limit is hit when http request is established.
* Throttle read if http request is successfully established, but rate
  limit is hit while reading the request.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
simitt added a commit to simitt/apm-server that referenced this issue Sep 11, 2018
* Enabled for RUM endpoints with default value 5000.
* Deny request if rate limit is hit when http request is established.
* Throttle read if http request is successfully established, but rate
  limit is hit while reading the request.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
simitt added a commit to simitt/apm-server that referenced this issue Sep 13, 2018
* Enabled for RUM endpoints with default value 5000.
* Throttle read if rate limit is hit while reading the request.
* Check rate limit in batches of 10.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
simitt added a commit to simitt/apm-server that referenced this issue Sep 14, 2018
* Enabled for RUM endpoints with default value 5000.
* Deny request if rate limit is hit when http request is established.
* Throttle read if http request is successfully established, but rate
  limit is hit while reading the request.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
@simitt simitt self-assigned this Sep 19, 2018
simitt added a commit to simitt/apm-server that referenced this issue Sep 20, 2018
* Enabled for RUM endpoints with default value 5000.
* Throttle read if rate limit is hit while reading the request.
* Check rate limit in batches of 10.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
@roncohen
Copy link
Contributor Author

opening up this discussion again:
max_event_size has been merged and there's work ongoing for an event_rate_limit_per_ip which also rejects new connections for an IP address if the rate limit has been reached for that IP.

I think we should be fine to start with those. We can re-open this if it turns out there's more needed at a later stage. WDYT @simitt @graphaelli ?

@graphaelli
Copy link
Member

Agreed that those are sufficient to start with and we should revisit after more testing and real world usage.

@simitt
Copy link
Contributor

simitt commented Sep 24, 2018

SGTM

simitt added a commit to simitt/apm-server that referenced this issue Sep 24, 2018
* Enabled for RUM endpoints with default value 5000.
* Throttle read if rate limit is hit while reading the request.
* Check rate limit in batches of 10.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
simitt added a commit to simitt/apm-server that referenced this issue Sep 25, 2018
* Enabled for RUM endpoints with default value 5000.
* Throttle read if rate limit is hit while reading the request.
* Check rate limit in batches of 10.
* Keep rate limiting for v1 unchanged.

part of elastic#1299
simitt added a commit that referenced this issue Sep 25, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements #1299
simitt added a commit to simitt/apm-server that referenced this issue Sep 25, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
@roncohen
Copy link
Contributor Author

Closing now that #1352 and #1367 are in.

roncohen pushed a commit to roncohen/apm-server that referenced this issue Oct 7, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
roncohen pushed a commit to roncohen/apm-server that referenced this issue Oct 15, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
roncohen pushed a commit to roncohen/apm-server that referenced this issue Oct 15, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
roncohen pushed a commit to roncohen/apm-server that referenced this issue Oct 15, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
roncohen pushed a commit to roncohen/apm-server that referenced this issue Oct 15, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
roncohen pushed a commit to roncohen/apm-server that referenced this issue Oct 15, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
roncohen pushed a commit to roncohen/apm-server that referenced this issue Oct 16, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements elastic#1299
roncohen pushed a commit that referenced this issue Oct 16, 2018
* Enabled for RUM endpoints with default value 300, burst multiplier 3
* Throttle read if rate limit is hit while reading the request, deny request if limit is already hit when the request is started.
* Check rate limit in batches of 10.
* Use eviction handling for LRU cache holding the rate limiters. 
* Keep rate limiting for v1 unchanged.

partly implements #1299
@alvarolobato alvarolobato added this to the 6.5 milestone Oct 26, 2018
@jalvz jalvz removed the [zube]: Done label Nov 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants