Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] 🔧 Request for API Health Monitoring and Backpressure Controls in Managed Runtime #2181

Open
johnboxall opened this issue Jan 3, 2025 · 0 comments

Comments

@johnboxall
Copy link
Collaborator

In apps deployed on Managed Runtime, most of the response time is spent waiting on network I/O. This introduces performance variability, driven by the health and responsiveness of upstream APIs.

To improve observability and resilience, it would be valuable to:

  • Give visibility into the health of upstream APIs (e.g., SCAPI).
  • Add tooling to manage backpressure and mitigate the impact of slow or degraded upstream services.

Currently, diagnosing slow API response times is cumbersome. Without detailed telemetry on API performance, it can be difficult to pinpoint the source of delays. For instance:

  • Is SCAPI generally healthy, or is a specific endpoint struggling?
  • How often are error rates spiking?
  • Are certain endpoints consistently slower than others?

API Health Observability

Introduce real-time monitoring of upstream API health with granular filtering capabilities, enabling deeper analysis of response times and status codes. Example use cases:

  • Overall Health: What’s the p99 response time for SCAPI requests?
  • Filtered by Status Code: What’s the p99 for HTTP 200 responses?
  • Endpoint-Level Analysis: What’s the p99 for Shopper Search requests?
  • Param-Level Insight: What’s the p99 for Shopper Search with no expands?

This visibility would help surface degradation trends and identify where optimization efforts should focus.

Backpressure and Fail-Safe Mechanisms

To prevent cascading failures, provide mechanisms to control and adjust how the system behaves when upstream APIs are slow or failing.

Timeout Enforcement

  • Implement configurable timeouts for upstream API calls (e.g., 10-second hard limit).
  • Default to SCAPI’s documented timeout of 10 seconds but allow overrides through environment variables.
  • Ensure timeouts are only enforced server-side to avoid client-side variability.
  • Log requests exceeding timeout limits with clear error messaging for easy troubleshooting.

Circuit Breakers

  • Integrate circuit breakers to temporarily halt traffic to slow or failing upstream APIs.
  • Allow per-host/proxy circuit breakers, configurable by environment variables.
  • Maintain circuit breaker state centrally to ensure consistency across executions.
  • Use existing frameworks (e.g., opossum)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant