[FEATURE] 🔧 Request for API Health Monitoring and Backpressure Controls in Managed Runtime #2181

johnboxall · 2025-01-03T05:40:21Z

In apps deployed on Managed Runtime, most of the response time is spent waiting on network I/O. This introduces performance variability, driven by the health and responsiveness of upstream APIs.

To improve observability and resilience, it would be valuable to:

Give visibility into the health of upstream APIs (e.g., SCAPI).
Add tooling to manage backpressure and mitigate the impact of slow or degraded upstream services.

Currently, diagnosing slow API response times is cumbersome. Without detailed telemetry on API performance, it can be difficult to pinpoint the source of delays. For instance:

Is SCAPI generally healthy, or is a specific endpoint struggling?
How often are error rates spiking?
Are certain endpoints consistently slower than others?

API Health Observability

Introduce real-time monitoring of upstream API health with granular filtering capabilities, enabling deeper analysis of response times and status codes. Example use cases:

Overall Health: What’s the p99 response time for SCAPI requests?
Filtered by Status Code: What’s the p99 for HTTP 200 responses?
Endpoint-Level Analysis: What’s the p99 for Shopper Search requests?
Param-Level Insight: What’s the p99 for Shopper Search with no expands?

This visibility would help surface degradation trends and identify where optimization efforts should focus.

Backpressure and Fail-Safe Mechanisms

To prevent cascading failures, provide mechanisms to control and adjust how the system behaves when upstream APIs are slow or failing.

Timeout Enforcement

Implement configurable timeouts for upstream API calls (e.g., 10-second hard limit).
Default to SCAPI’s documented timeout of 10 seconds but allow overrides through environment variables.
Ensure timeouts are only enforced server-side to avoid client-side variability.
Log requests exceeding timeout limits with clear error messaging for easy troubleshooting.

Circuit Breakers

Integrate circuit breakers to temporarily halt traffic to slow or failing upstream APIs.
Allow per-host/proxy circuit breakers, configurable by environment variables.
Maintain circuit breaker state centrally to ensure consistency across executions.
Use existing frameworks (e.g., opossum)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] 🔧 Request for API Health Monitoring and Backpressure Controls in Managed Runtime #2181

[FEATURE] 🔧 Request for API Health Monitoring and Backpressure Controls in Managed Runtime #2181

johnboxall commented Jan 3, 2025

[FEATURE] 🔧 Request for API Health Monitoring and Backpressure Controls in Managed Runtime #2181

[FEATURE] 🔧 Request for API Health Monitoring and Backpressure Controls in Managed Runtime #2181

Comments

johnboxall commented Jan 3, 2025

API Health Observability

Backpressure and Fail-Safe Mechanisms

Timeout Enforcement

Circuit Breakers