Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train V2] Output JSON structured log files for system and application logs #49414

Merged
merged 1 commit into from
Dec 23, 2024

Conversation

justinvyu
Copy link
Contributor

Summary

Structured logging means outputting system logs (ex: internal Ray library logs) and application logs (ex: log outputs from user-defined functions) in JSON format with extra metadata to be ingested by some external log-viewer. This gives users a better experience searching through logs. For example, Ray Train users will be able to filter logs with tags such as worker rank.

import logging
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_func_per_worker():
    logger = logging.getLogger(__name__)
    print("UDF print information")
    logger.info("UDF logger information")


def train_dummy(num_workers=1):
    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=False)

    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    return result


if __name__ == "__main__":
    train_dummy()

In the ray-session-id/logs/ray-train/ folder, there are four log files listed:

ray-train-sys-controller-070d6.log,  ray-train-app-controller-070d6.log
ray-train-sys-worker-56ab6.log,  ray-train-sys-worker-56ab6.log 

where worker_id=070d6 is the the ray train controller process; worker_id=56ab6 is the ray train worker process.
Note that internal logs contains only system logs and are not interesting to the users, it's only for ray engineer teams. The user logs contain both udf and system logs that the user would be interested about.

In ray-train-sys-controller-070d6.log is below:

{"asctime": "2024-12-16 17:14:25,230", "levelname": "INFO", "message": "Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1", "filename": "worker_group.py", "lineno": 264, "job_id": "01000000", "worker_id": "070d69941fe3c67eaea32592a2c400e897201f744ee90b510fdc7253", "node_id": "2355adfd9a831b6f0dc07f98822e14f0c3883ec1fbbfdcbf33e68822", "actor_id": "df24d5772eb9396846fd459501000000", "task_id": "16310a0f0a45af5cdf24d5772eb9396846fd459501000000", "task_name": "TrainController.run", "task_func_name": "ray.train.v2._internal.execution.controller.TrainController.run", "actor_name": "", "run_name": "ray_train_run-2024-12-16_17-14-23", "component": "controller"}

In ray-train-app-controller-070d6 is below:

{"asctime": "2024-12-16 17:14:25,230", "levelname": "INFO", "message": "Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1", "filename": "worker_group.py", "lineno": 264, "job_id": "01000000", "worker_id": "070d69941fe3c67eaea32592a2c400e897201f744ee90b510fdc7253", "node_id": "2355adfd9a831b6f0dc07f98822e14f0c3883ec1fbbfdcbf33e68822", "actor_id": "df24d5772eb9396846fd459501000000", "task_id": "16310a0f0a45af5cdf24d5772eb9396846fd459501000000", "task_name": "TrainController.run", "task_func_name": "ray.train.v2._internal.execution.controller.TrainController.run", "actor_name": "", "run_name": "ray_train_run-2024-12-16_17-14-23", "component": "controller"}

In ray-train-sys-worker-56ab6.log is:

{"asctime": "2024-12-16 17:14:27,121", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=1]", "filename": "config.py", "lineno": 83, "job_id": "01000000", "worker_id": "56ab64b5f4555f2729ab740089dc56ba66e1c19b7851736ae5e2faed", "node_id": "2355adfd9a831b6f0dc07f98822e14f0c3883ec1fbbfdcbf33e68822", "actor_id": "2997593f7bd933cf4632137c01000000", "task_id": "c791fa3761a9f9fa2997593f7bd933cf4632137c01000000", "task_name": "execute._setup_torch_process_group", "task_func_name": "ray.train.v2._internal.execution.worker_group.worker.RayTrainWorker.execute", "actor_name": "", "run_name": "ray_train_run-2024-12-16_17-14-23", "world_rank": 0, "local_rank": 0, "node_rank": 0, "component": "worker"}

In ray-train-app-worker-56ab6.log is below, where logger output and print output are both captured here.:

{"asctime": "2024-12-16 17:14:27,121", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=1]", "filename": "config.py", "lineno": 83, "job_id": "01000000", "worker_id": "56ab64b5f4555f2729ab740089dc56ba66e1c19b7851736ae5e2faed", "node_id": "2355adfd9a831b6f0dc07f98822e14f0c3883ec1fbbfdcbf33e68822", "actor_id": "2997593f7bd933cf4632137c01000000", "task_id": "c791fa3761a9f9fa2997593f7bd933cf4632137c01000000", "task_name": "execute._setup_torch_process_group", "task_func_name": "ray.train.v2._internal.execution.worker_group.worker.RayTrainWorker.execute", "actor_name": "", "run_name": "ray_train_run-2024-12-16_17-14-23", "world_rank": 0, "local_rank": 0, "node_rank": 0, "component": "worker"}
{"asctime": "2024-12-16 17:14:27,129", "levelname": "INFO", "message": "UDF print information\n", "filename": "_logging_test.py", "lineno": 12, "job_id": "01000000", "worker_id": "56ab64b5f4555f2729ab740089dc56ba66e1c19b7851736ae5e2faed", "node_id": "2355adfd9a831b6f0dc07f98822e14f0c3883ec1fbbfdcbf33e68822", "actor_id": "2997593f7bd933cf4632137c01000000", "task_id": "25b9de50193ae1a23ff5b2507e9fcff72107d4a401000000", "task_name": "", "task_func_name": "", "actor_name": "", "run_name": "ray_train_run-2024-12-16_17-14-23", "world_rank": 0, "local_rank": 0, "node_rank": 0, "component": "worker"}
{"asctime": "2024-12-16 17:14:27,129", "levelname": "INFO", "message": "UDF logger information", "filename": "_logging_test.py", "lineno": 13, "job_id": "01000000", "worker_id": "56ab64b5f4555f2729ab740089dc56ba66e1c19b7851736ae5e2faed", "node_id": "2355adfd9a831b6f0dc07f98822e14f0c3883ec1fbbfdcbf33e68822", "actor_id": "2997593f7bd933cf4632137c01000000", "task_id": "25b9de50193ae1a23ff5b2507e9fcff72107d4a401000000", "task_name": "", "task_func_name": "", "actor_name": "", "run_name": "ray_train_run-2024-12-16_17-14-23", "world_rank": 0, "local_rank": 0, "node_rank": 0, "component": "worker"}

In console output, we have below plain text output:

2024-12-16 17:14:23,216 INFO worker.py:1827 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(TrainController pid=98656) Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1
(RayTrainWorker pid=98665) Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=98665) UDF print information
(RayTrainWorker pid=98665) 
(RayTrainWorker pid=98665) UDF logger information

…wer experience. (ray-project#1239)

This PR implements the structured logging proposed in this design doc
[doc](https://docs.google.com/document/d/1GofIo5KddHvpcSw17-KX6yszogzpDNfw93IqkGBBVKM/edit?usp=sharing)

This PR is built based on the train context revamp PR  ray-project#1227
---------

Signed-off-by: Hongpeng Guo <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
@justinvyu justinvyu changed the title [Train V2] Output JSON structured log files for system and application logs (#1239) [Train V2] Output JSON structured log files for system and application logs Dec 23, 2024
@justinvyu justinvyu enabled auto-merge (squash) December 23, 2024 19:59
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 23, 2024
@justinvyu justinvyu merged commit 9f5b57c into ray-project:master Dec 23, 2024
6 checks passed
@justinvyu justinvyu deleted the cp_structured_logging branch December 23, 2024 21:57
srinathk10 pushed a commit that referenced this pull request Jan 3, 2025
…n logs (#49414)

Structured logging means outputting system logs (ex: internal Ray
library logs) and application logs (ex: log outputs from user-defined
functions) in JSON format with extra metadata to be ingested by some
external log-viewer. This gives users a better experience searching
through logs. For example, Ray Train users will be able to filter logs
with tags such as worker rank.

Signed-off-by: Hongpeng Guo <[email protected]>
Co-authored-by: Hongpeng Guo <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
…n logs (ray-project#49414)

Structured logging means outputting system logs (ex: internal Ray
library logs) and application logs (ex: log outputs from user-defined
functions) in JSON format with extra metadata to be ingested by some
external log-viewer. This gives users a better experience searching
through logs. For example, Ray Train users will be able to filter logs
with tags such as worker rank.

Signed-off-by: Hongpeng Guo <[email protected]>
Co-authored-by: Hongpeng Guo <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants