Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load #1470

richardartoul · 2019-03-17T14:37:20Z

I was looking through the WriteBatchRaw path and I realized that we don't check for db.Overloaded there. So basically what happens is when the commitlog queue is full (and all of our most important pools are empty like writeBatchedPoolReqBool and writeBatchPool which are both extremely expensive to allocate) we allocate both of them anyways, then we write all of the data into our in memory data structures and THEN we try and write to the commitlog, get an error, and reject the write. If we just did a check for db.IsOverloaded at the beginning of the method, we could make the cost of rejecting a write while we're under heavy load go down substantially which make allow the node to recover instead of getting stuck in this cycle with a perpetually full commitlog queues and all the goroutines / pool allocations eventually OOMing the node. The only issue with this approach is that we would end up rejecting writes that previously we would have probably accepted in some scenarios where we're under a lot of load, but in practice I bet this would be a big net win for reliability even if we need to tune our queue size for production or something (especially with all the M3msg buffering and retries we have at our disposal). If we wanted to be aggressive, we could probably even inject a preflight check into the thrift server to prevent it from even allocating the thrift structs if we know we're gonna reject a request.
Nodes are still getting OOM'd / annihilated by expensive index queries. I think if we just pushed down the context further into the query code so that in-between expensive operations like querying a block or querying a segment we could check if we should even continue. Because I think right now what happens is an expensive query comes in, eventually times out, but M3DB keeps chugging along allocating and querying even though the user has already received the timeout error.

The text was updated successfully, but these errors were encountered:

richardartoul · 2019-03-17T14:37:55Z

@Haijuncao I'd love your thoughts on this since I know you did some similar work to improve Schemaless's ability to load shed

richardartoul · 2019-03-25T14:28:38Z

First portion completed by #1482

richardartoul assigned prateek, robskillington, simonrobb, richardartoul and Haijuncao Mar 17, 2019

richardartoul changed the title ~~Improve M3DBs ability to apply backpressure and prevent catastrophic failure when under too much load~~ Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load Mar 17, 2019

richardartoul added P: High T: Reliability area:db All issues pertaining to dbnode labels Mar 17, 2019

gibbscullen closed this as completed Sep 18, 2020

gibbscullen unassigned simonrobb Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load #1470

Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load #1470

richardartoul commented Mar 17, 2019 •

edited

Loading

richardartoul commented Mar 17, 2019

richardartoul commented Mar 25, 2019

Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load #1470

Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load #1470

Comments

richardartoul commented Mar 17, 2019 • edited Loading

richardartoul commented Mar 17, 2019

richardartoul commented Mar 25, 2019

richardartoul commented Mar 17, 2019 •

edited

Loading