Charlie/dev/v0.7.0 #2236

charlieyl · 2024-12-17T11:07:05Z

[bugfix]1.Enhance GPU cache management by setting initial available GPU IDs when reset available gpus 2.set shm_size to 8G if not specified.

[CoreEngine] make the server status work.

Alexleung/dev v070 for refactor

Fix Docker Issues

Alexleung/dev branch latest

[Deploy] Try to convert the gpu_topology value type to int.

[Deploy] Fix timezone issue when using pandas

…ntainer inference logs to the single dir.

[CoreEngine] In order to make the inference logs work, we save the co…

In order to make the inference logs work, we save the container inference logs to the single dir

Dev/v0.7.0

Alexleung/dev v070 for refactor

[Deploy] Avoid re-download the same model serving package.

…y_logging

Enabling grpc to work with docker containers.

Parameterizing deploy host, port.

[Deploy] Edge Case Handling.

1.The work inference proxy port needs to be read from the configuration file 2.occupy_gpu_ids fail,get gpu ids from cache may return [] instead of None

[fixbug]1.read inference proxy port from config file 2.occupy_gpu_ids fail,get gpu ids from cache may return [] instead of None

… with init gpu ids because of the system gpu resource may change) and logging in hardware and job utilities

…U IDs when reset available gpus

…nd initial_available_gpu_ids

Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0

Revert "Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0"

python/fedml/computing/scheduler/master/cloud_server_manager.py

+                + " [email protected] -n fedml-devops-aggregator-"
+                + self.version
+        )
+        logging.info("Create secret cmd: " + registry_secret_cmd)


To fix the problem, we need to avoid logging sensitive information in clear text. Instead of logging the entire registry_secret_cmd, we can log a sanitized version of it that omits the sensitive parts. This way, we still get useful logging information without exposing sensitive data.

Identify the lines where sensitive information is being logged.

Replace the logging statements with sanitized versions that exclude sensitive data.

Ensure that the functionality of the code remains unchanged.

python/fedml/computing/scheduler/model_scheduler/device_model_deployment.py

    except Exception:
        logging.error("Failed to connect to the docker daemon, please ensure that you have "
                      "installed Docker Desktop or Docker Engine, and the docker is running")
        return "", "", None, None, None

+    # Pull the inference image
+    logging.info(f"Start pulling the inference image {inference_image_name}... with policy {image_pull_policy}")


To fix the problem, we should ensure that sensitive information such as passwords is not logged. We can achieve this by sanitizing the inference_image_name before logging it or by avoiding logging it altogether if it contains sensitive data. In this case, we will sanitize the inference_image_name to ensure it does not contain sensitive information before logging it.

Identify the logging statement that includes potentially sensitive information.

Sanitize the inference_image_name to ensure it does not contain sensitive data.

Update the logging statement to use the sanitized version of inference_image_name.

python/fedml/computing/scheduler/model_scheduler/device_model_inference.py

+    except Exception as e:
+        inference_response = {"error": True, "message": f"{traceback.format_exc()}"}
+
+    return inference_response


To fix the problem, we need to ensure that stack traces are not exposed to end users. Instead, we should log the stack trace on the server and return a generic error message to the user. This can be achieved by modifying the exception handling code to log the stack trace and return a generic error message.

Modify the exception handling code in the predict, predict_openai, predict_with_end_point_id, and custom_inference functions to log the stack trace and return a generic error message.

Ensure that the logging configuration is set up to capture the stack trace information.

gitguardian · 2024-12-17T11:17:58Z

⚠️ GitGuardian has uncovered 6 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
11782617	Triggered	Generic High Entropy Secret	`0491bb7`	.github/workflows/registry-runners/Dockerfile	View secret
5451874	Triggered	Generic Password	`87ae30a`	python/fedml/computing/scheduler/model_scheduler/master_job_runner.py	View secret
11782618	Triggered	Generic High Entropy Secret	`a5bbcd2`	.github/workflows/registry-runners/windows.ps1	View secret
5692101	Triggered	Generic High Entropy Secret	`a932082`	python/fedml/computing/scheduler/model_scheduler/device_model_deployment.py	View secret
9453265	Triggered	Generic High Entropy Secret	`87ae30a`	python/fedml/api/api_test.py	View secret
8762943	Triggered	Generic Password	`87ae30a`	python/fedml/computing/scheduler/scheduler_core/compute_cache_manager.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

fedml-alex and others added 30 commits May 22, 2024 02:19

[CoreEngine] make the server status work.

cd84d82

Merge pull request #2128 from FedML-AI/alexleung/dev_branch_latest

84d6156

[CoreEngine] make the server status work.

Update setup.py

f7ab709

Merge pull request #2126 from FedML-AI/alexleung/dev_v070_for_refactor

3fbcc2c

Alexleung/dev v070 for refactor

Remove Docker Client Timeout

649e42f

Merge pull request #2129 from FedML-AI/alaydshah-patch-docker

215efb0

Fix Docker Issues

[CoreEngine] change the edge status in the status center.

8d9c8ed

[CoreEngine] forward the stopping request to the cloud server.

1162f6c

Merge pull request #2130 from FedML-AI/alexleung/dev_branch_latest

2ad110b

Alexleung/dev branch latest

Merge pull request #2131 from FedML-AI/alexleung/dev_v070_for_refactor

ca03eda

[Deploy] Try to convert the gpu_topology value type to int.

92b7e16

[Deploy] Fix version diff function.

0a6eba9

Merge pull request #2133 from FedML-AI/raphael/fix-rolling-update

de480be

Merge pull request #2132 from FedML-AI/raphael/hot-fix-deploy

a24e962

[Deploy] Try to convert the gpu_topology value type to int.

[Deploy] Fix timezone issue using pandas

e8844d3

Merge pull request #2134 from FedML-AI/raphael/fix-timezone-autoscale

dcac1e2

[Deploy] Fix timezone issue when using pandas

[CoreEngine] In order to make the inference logs work, we save the co…

b58720c

…ntainer inference logs to the single dir.

Merge pull request #2135 from FedML-AI/alexleung/dev_branch_latest

f4c49c9

[CoreEngine] In order to make the inference logs work, we save the co…

Merge pull request #2136 from FedML-AI/alexleung/dev_v070_for_refactor

28e4af4

In order to make the inference logs work, we save the container inference logs to the single dir

Merge pull request #2137 from FedML-AI/dev/v0.7.0

38e4453

Dev/v0.7.0

Merge pull request #2138 from FedML-AI/alexleung/dev_v070_for_refactor

1377a0d

Alexleung/dev v070 for refactor

[Deploy] Avoid re-download the same model serving package.

7625075

Merge pull request #2139 from FedML-AI/raphael/fix-pkg-download

e70837e

[Deploy] Avoid re-download the same model serving package.

Add inference gateway logs

9d8b0df

Merge remote-tracking branch 'origin' into alaydshah/inference_gatewa…

c2fd5bd

…y_logging

Merge branch 'dev/v0.7.0' into alaydshah/inference_gateway_logging

94757eb

Make Inference Gateway Daemon Process

3fb45aa

Adding fail fast and timeout enforcement per request policies.

8595e0f

[Deploy] Fix config reading from redis.

9296884

Add global env file

b1312e1

fedml-dimitris and others added 27 commits October 16, 2024 22:27

Adding more docker client existence checkpoints.

f299a8e

Fixing grpc readme file.

3349667

Remove circular dependency.

d2484fa

Extending grpc support to also consider docker container ips.

a959802

Fixing notation and attribute names in grpc config files.

aa69122

testing with ingress ip.

c302749

Polishing grpc + docker examples.

292bfb3

Merge pull request #2229 from FedML-AI/dimitris/grpc_with_docker

c6d4daf

Enabling grpc to work with docker containers.

Parameterizing deploy host, port.

55ff447

Merge pull request #2230 from FedML-AI/inference_runner_custom_host_port

9fc5b4d

Parameterizing deploy host, port.

[Deploy] Edge Case Handling.

a108a8a

Merge pull request #2232 from FedML-AI/raphael/quick-fix-error-catch

98e084a

[Deploy] Edge Case Handling.

[fixbug]

698e95e

1.The work inference proxy port needs to be read from the configuration file 2.occupy_gpu_ids fail,get gpu ids from cache may return [] instead of None

Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0

56f6059

[fixbug]1.read inference proxy port from config file 2.occupy_gpu_ids fail,get gpu ids from cache may return [] instead of None

add log in get_available_gpu_ids[hardware_utils.py]

757e5f0

add logs

c5c22c8

add logs

ffa54a3

add logs

589dc47

add logs

40735d9

[bugfix]Enhance GPU management(need compare the readtime availabe gpu…

90c1191

… with init gpu ids because of the system gpu resource may change) and logging in hardware and job utilities

[debug]add logs

30e1f70

[bugfix] Enhance GPU cache management by setting initial available GP…

21a374c

…U IDs when reset available gpus

[bugfix]calculate the difference between realtime_available_gpu_ids a…

02b87f4

…nd initial_available_gpu_ids

[bugfix]set shm_size to 8G if not specified

9ff4a56

Merge pull request #2234 from FedML-AI/dev/v0.7.0

fee49a4

Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0

Revert "Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0"

d5831b9

Merge pull request #2235 from FedML-AI/revert-2234-dev/v0.7.0

9fa8499

Revert "Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0"

github-advanced-security bot found potential problems Dec 17, 2024

View reviewed changes

charlieyl closed this Dec 17, 2024

@@ -83,3 +83,19 @@
                     )
-                    logging.info("Create secret cmd: " + registry_secret_cmd)
+                    sanitized_registry_secret_cmd = (
+                        "kubectl create namespace fedml-devops-aggregator-"
+                        + self.version
+                        + ";kubectl -n fedml-devops-aggregator-"
+                        + self.version
+                        + " delete secret secret-"
+                        + self.cloud_server_name
+                        + " ;kubectl create secret docker-registry secret-"
+                        + self.cloud_server_name
+                        + " --docker-server="
+                        + self.agent_config["docker_config"]["registry_server"]
+                        + " --docker-username=***"
+                        + " --docker-password=***"
+                        + " [email protected] -n fedml-devops-aggregator-"
+                        + self.version
+                    )
+                    logging.info("Create secret cmd: " + sanitized_registry_secret_cmd)
                     os.system(registry_secret_cmd)

@@ -144,3 +144,4 @@
                 # Pull the inference image
-                logging.info(f"Start pulling the inference image {inference_image_name}... with policy {image_pull_policy}")
+                sanitized_inference_image_name = inference_image_name.split('/')[-1]  # Extract the image name without registry info
+                logging.info(f"Start pulling the inference image {sanitized_inference_image_name}... with policy {image_pull_policy}")
                 ContainerUtils.get_instance().pull_image_with_policy(image_pull_policy, inference_image_name)

@@ -117,3 +117,4 @@
                 except Exception as e:
-                    response = {"error": True, "message": f"{traceback.format_exc()}"}
+                    logging.error(traceback.format_exc())
+                    response = {"error": True, "message": "An internal error has occurred!"}
@@ -141,3 +142,4 @@
                 except Exception as e:
-                    response = {"error": True, "message": f"{traceback.format_exc()}, exception {e}"}
+                    logging.error(traceback.format_exc())
+                    response = {"error": True, "message": "An internal error has occurred!"}
@@ -167,3 +169,4 @@
                 except Exception as e:
-                    inference_response = {"error": True, "message": f"{traceback.format_exc()}"}
+                    logging.error(traceback.format_exc())
+                    inference_response = {"error": True, "message": "An internal error has occurred!"}
@@ -183,3 +186,4 @@
                 except Exception as e:
-                    inference_response = {"error": True, "message": f"{traceback.format_exc()}"}
+                    logging.error(traceback.format_exc())
+                    inference_response = {"error": True, "message": "An internal error has occurred!"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charlie/dev/v0.7.0 #2236

Charlie/dev/v0.7.0 #2236

charlieyl commented Dec 17, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

gitguardian bot commented Dec 17, 2024

Charlie/dev/v0.7.0 #2236

Charlie/dev/v0.7.0 #2236

Conversation

charlieyl commented Dec 17, 2024

gitguardian bot commented Dec 17, 2024

⚠️ GitGuardian has uncovered 6 secrets following the scan of your pull request.