-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dashboard] add RAY_PROMETHEUS_HEADERS env for carrying additional headers to Prometheus #49353
Conversation
b667fec
to
4b9357a
Compare
a1b293a
to
8e42e75
Compare
8e42e75
to
2b39527
Compare
GRAFANA_DATASOURCE_TEMPLATE = """apiVersion: 1 | ||
|
||
datasources: | ||
- name: {prometheus_name} | ||
url: {prometheus_host} | ||
type: prometheus | ||
isDefault: true | ||
access: proxy | ||
""" | ||
def GRAFANA_DATASOURCE_TEMPLATE( | ||
prometheus_name, prometheus_host, jsonData, secureJsonData | ||
): | ||
return yaml.safe_dump( | ||
{ | ||
"apiVersion": 1, | ||
"datasources": [ | ||
{ | ||
"name": prometheus_name, | ||
"url": prometheus_host, | ||
"type": "prometheus", | ||
"isDefault": True, | ||
"access": "proxy", | ||
"jsonData": jsonData, | ||
"secureJsonData": secureJsonData, | ||
} | ||
], | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using yaml.safe_dump
is better because a simple string template is now not enough for complex values in jsonData
and secureJsonData
and there might also be some values that require escaping to fit into a yaml file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with our Prometheus setup, but the Ray head node has several metric endpoints, including 8080, Autoscaler metrics, and Dashboard metrics. Are they in separate codepaths?
@@ -13,6 +13,8 @@ | |||
|
|||
DEFAULT_PROMETHEUS_HOST = "http://localhost:9090" | |||
PROMETHEUS_HOST_ENV_VAR = "RAY_PROMETHEUS_HOST" | |||
DEFAULT_PROMETHEUS_HEADERS = "{}" | |||
PROMETHEUS_HEADERS_ENV_VAR = "RAY_PROMETHEUS_HEADERS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add the comments about the format here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the comment below and next to the parse_prom_headers
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with our Prometheus setup, but the Ray head node has several metric endpoints, including 8080, Autoscaler metrics, and Dashboard metrics. Are they in separate codepaths?
Yes, they are in the different code paths. Prometheus will periodically scrape metrics from these metrics endpoints, while this PR targets the endpoints of the Prometheus server.
@@ -73,6 +76,10 @@ def __init__(self, dashboard_head): | |||
self.prometheus_host = os.environ.get( | |||
PROMETHEUS_HOST_ENV_VAR, DEFAULT_PROMETHEUS_HOST | |||
) | |||
self.prometheus_headers = os.environ.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do some basic validation of the format before using the env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new parse_prom_headers
function is added for both validation and parsing. This is what it looks like when the validation fails:
▶ export RAY_PROMETHEUS_HEADERS='[["XAuth"],["XAuth2"]]'
▶ ray start --head --include-dashboard=true --metrics-export-port=8080
Local node IP: 127.0.0.1
2024-12-28 13:32:56,919 ERROR services.py:1353 -- Failed to start the dashboard , return code 1
2024-12-28 13:32:56,920 ERROR services.py:1378 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
Traceback (most recent call last):
...
File "/Users/ruian/Code/python/ray/python/ray/dashboard/modules/metrics/metrics_head.py", line 97, in __init__
self.prometheus_headers = parse_prom_headers(
^^^^^^^^^^^^^^^^^^^
File "/Users/ruian/Code/python/ray/python/ray/dashboard/modules/metrics/metrics_head.py", line 74, in parse_prom_headers
raise ValueError(
ValueError: RAY_PROMETHEUS_HEADERS should be a JSON string in one of the formats:
1) An object with string keys and string values.
2) an array of string arrays with 2 string elements each.
…aders to Prometheus Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
2b39527
to
7a654e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test?
return parsed | ||
raise ValueError( | ||
f"{PROMETHEUS_HEADERS_ENV_VAR} should be a JSON string in one of the formats:\n" | ||
+ "1) An object with string keys and string values.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the example to the ValueError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
# parse_prom_headers will make sure the input is in one of the following formats: | ||
# 1. {"H1": "V1", "H2": "V2"} | ||
# 2. [["H1", "V1"], ["H2", "V2"], ["H2", "V3"]] | ||
def parse_prom_headers(prometheus_headers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems to be highly similar to metrics_head.py
. What are the differences between them? Perhaps @alanwguo knows the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this file is used at:
ray/release/ray_release/command_runner/_anyscale_job_wrapper.py
Lines 120 to 130 in b25c72c
subprocess.run( | |
[ | |
"python", | |
"prometheus_metrics.py", | |
str(time_taken), | |
"--path", | |
os.environ["METRICS_OUTPUT_JSON"], | |
], | |
timeout=metrics_timeout, | |
check=True, | |
) |
ray/release/ray_release/command_runner/job_runner.py
Lines 81 to 84 in b25c72c
def save_metrics(self, start_time: float, timeout: float = 900): | |
self.run_prepare_command( | |
f"python prometheus_metrics.py {start_time}", timeout=timeout | |
) |
According to their commit messages, these usages are related to CI procedures.
It also looks like the script intentionally does not import Ray, therefore, I copied the parse_prom_headers
function from metrics_head.py
to here instead of importing it. But if we don't need this in the CI procedures, I will just remove the change. Do we need this in the CI procedures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the change from the _prometheus_metrics.py
. It seems we will never need this feature in the release procedures.
…sage Signed-off-by: Rueian <[email protected]>
Sure! How about adding the test to python/ray/dashboard/tests/test_dashboard.py? I have added one there and used the |
b630a53
to
d8d1e53
Compare
Signed-off-by: Rueian <[email protected]>
d8d1e53
to
be3b9e5
Compare
…trics.py release tool Signed-off-by: Rueian <[email protected]>
|
||
# Test the unsupported case. | ||
with pytest.raises(ValueError): | ||
os.environ[PROMETHEUS_HEADERS_ENV_VAR] = '{"H1": "V1", "H2": ["V1", "V2"]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use monkeypatch.setenv
to avoid env var leak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint. Updated.
Signed-off-by: Rueian <[email protected]>
1c19a57
to
626e098
Compare
Trigger the whole CI tests. |
…in minimal tests Signed-off-by: Rueian <[email protected]>
Hi @kevin85421, the previous premerge check failed on minimal installation tests and I pushed a fix for that. Could you help trigger the whole CI test again? Thank you. |
@rueian With the go label on the PR, all tests will be executed. |
…aders to Prometheus (ray-project#49353) Signed-off-by: Rueian <[email protected]>
…aders to Prometheus (ray-project#49353) Signed-off-by: Rueian <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>
…aders to Prometheus (ray-project#49353) Signed-off-by: Rueian <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
Why are these changes needed?
Some users from Slack channels have an authentication proxy in front of their Prometheus server and that proxy requires requests to have specific HTTP headers to pass. Currently, we don't have a way for them to customize headers for those requests.
This PR introduces a new
RAY_PROMETHEUS_HEADERS
environment variable, next to the existingRAY_PROMETHEUS_HOST
, to allow users to add custom headers for requests to Prometheus from Ray Dashboard and Grafana (ref)The format of
RAY_PROMETHEUS_HEADERS
should be a JSON string in one of the following:{"H1": "V1", "H2": "V2"}
[["H1", "V1"], ["H2", "V2"], ["H2", "V3"]]
The second format supports multiple headers with the same name (which is valid by HTTP spec) while the first format doesn't due to the aiohttp limitation. i.e. the format like
{"H2": ["V2", "V3"]}
is not supported.A Grafana datasource.yml example of the first format
A Grafana datasource.yml example of the second format
A request dump example from Ray Dashboard to Prometheus
A request dump example from Grafana to Prometheus
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.