Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load #1470

Closed
1 of 2 tasks
richardartoul opened this issue Mar 17, 2019 · 2 comments
Assignees
Labels
area:db All issues pertaining to dbnode P: High T: Reliability

Comments

@richardartoul
Copy link
Contributor

richardartoul commented Mar 17, 2019

  • I was looking through the WriteBatchRaw path and I realized that we don't check for db.Overloaded there. So basically what happens is when the commitlog queue is full (and all of our most important pools are empty like writeBatchedPoolReqBool and writeBatchPool which are both extremely expensive to allocate) we allocate both of them anyways, then we write all of the data into our in memory data structures and THEN we try and write to the commitlog, get an error, and reject the write. If we just did a check for db.IsOverloaded at the beginning of the method, we could make the cost of rejecting a write while we're under heavy load go down substantially which make allow the node to recover instead of getting stuck in this cycle with a perpetually full commitlog queues and all the goroutines / pool allocations eventually OOMing the node. The only issue with this approach is that we would end up rejecting writes that previously we would have probably accepted in some scenarios where we're under a lot of load, but in practice I bet this would be a big net win for reliability even if we need to tune our queue size for production or something (especially with all the M3msg buffering and retries we have at our disposal). If we wanted to be aggressive, we could probably even inject a preflight check into the thrift server to prevent it from even allocating the thrift structs if we know we're gonna reject a request.

  • Nodes are still getting OOM'd / annihilated by expensive index queries. I think if we just pushed down the context further into the query code so that in-between expensive operations like querying a block or querying a segment we could check if we should even continue. Because I think right now what happens is an expensive query comes in, eventually times out, but M3DB keeps chugging along allocating and querying even though the user has already received the timeout error.

@richardartoul
Copy link
Contributor Author

@Haijuncao I'd love your thoughts on this since I know you did some similar work to improve Schemaless's ability to load shed

@richardartoul richardartoul changed the title Improve M3DBs ability to apply backpressure and prevent catastrophic failure when under too much load Improve M3DBs ability to loadshed / apply backpressure and prevent catastrophic failure when under too much load Mar 17, 2019
@richardartoul richardartoul added P: High T: Reliability area:db All issues pertaining to dbnode labels Mar 17, 2019
@richardartoul
Copy link
Contributor Author

First portion completed by #1482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:db All issues pertaining to dbnode P: High T: Reliability
Projects
None yet
Development

No branches or pull requests

6 participants