Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates #613

dmsuehir · 2024-08-16T16:50:32Z

Description

This PR has a few updates based on issues that I ran into when deploying the CodeGen example on a cluster for xeon and Gaudi. The following issues are addressed in the PR:

I added a note about potentially using a persistent volume claim instead of having to create the /mnt/opea-models directory on the nodes

Deploying the codegen.yaml files gave an error like:

error: error validating "codegen.yaml": error validating data: [unknown object type "nil" in ConfigMap.data.http_proxy, unknown object type "nil" in ConfigMap.data.https_proxy, unknown object type "nil" in ConfigMap.data.no_proxy]; if you choose to ignore these errors, turn validation off with --validate=false

This error is because the ConfigMap in the yaml has a few env vars that are just empty (nil). Changing these to have empty quotes "" fixes the issue. [EDIT: this was resolved in PR 630]

I added a note about it taking a couple of minutes for the service to start and how to check the logs, because I ran into an issue where the curl command failed like "curl: (18) transfer closed with outstanding read data remaining" and it was just because the service wasn't ready yet. Also, knowing how to check the logs is useful for watching the status and figuring out if the curl command is failing because of an error.
When running on Gaudi wasn't working for me ("RuntimeError: synStatus=26 [Generic failure] Device acquire failed.") until I added the hugepages-2Mi/memory to the resource limits. The habana documentation for Kubernetes shows it using hugepages-2Mi and memory in the resources, so that seems to be the recommended config.

Issues

N/A

Type of change

List the type of change like below. Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Others (enhancement, documentation, validation, etc.)

Dependencies

N/A

Tests

Manually tested the changes on a Kubernetes cluster with Xeon and Gaudi 2 nodes.

…c updates Signed-off-by: dmsuehir <[email protected]>

…degen_k8s_manifest_fixes

CodeGen/kubernetes/manifests/gaudi/codegen.yaml

CodeGen/kubernetes/manifests/xeon/codegen.yaml

…degen_k8s_manifest_fixes Signed-off-by: dmsuehir <[email protected]>

Signed-off-by: dmsuehir <[email protected]>

daisy-ycguo

lgtm

…degen_k8s_manifest_fixes

…/GenAIExamples into dina/codegen_k8s_manifest_fixes

lianhao · 2024-08-23T05:19:53Z

I don't have objection to have this merged in. I'm just curious about how to get outdated conversion resolved on github PR pages.

…c updates (#613) * Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates Signed-off-by: dmsuehir <[email protected]> (cherry picked from commit c25063f)

…c updates (opea-project#613) * Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates Signed-off-by: dmsuehir <[email protected]>

* update upload_training_files format Signed-off-by: Yue, Wenjiao <[email protected]>

Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and do…

a1798b5

…c updates Signed-off-by: dmsuehir <[email protected]>

dmsuehir requested a review from lvliang-intel as a code owner August 16, 2024 16:50

dmsuehir mentioned this pull request Aug 16, 2024

Update CodeGen Xeon and Gaudi Kubernetes codegen.yaml and docs #595

Closed

2 tasks

ashahba approved these changes Aug 16, 2024

View reviewed changes

chensuyue requested a review from lianhao August 19, 2024 12:37

chensuyue added this to the v0.9 milestone Aug 19, 2024

Merge branch 'main' of github.com:dmsuehir/GenAIExamples into dina/co…

5f13865

…degen_k8s_manifest_fixes

lianhao reviewed Aug 20, 2024

View reviewed changes

dmsuehir added 2 commits August 20, 2024 08:30

Merge branch 'main' of github.com:dmsuehir/GenAIExamples into dina/co…

8a7a89c

…degen_k8s_manifest_fixes Signed-off-by: dmsuehir <[email protected]>

Reduce hugepages

3833e52

Signed-off-by: dmsuehir <[email protected]>

daisy-ycguo approved these changes Aug 22, 2024

View reviewed changes

daisy-ycguo and others added 3 commits August 22, 2024 15:06

Merge branch 'main' into dina/codegen_k8s_manifest_fixes

7a9b174

Merge branch 'main' of github.com:dmsuehir/GenAIExamples into dina/co…

36b83ec

…degen_k8s_manifest_fixes

Merge branch 'dina/codegen_k8s_manifest_fixes' of github.com:dmsuehir…

3eea9cb

…/GenAIExamples into dina/codegen_k8s_manifest_fixes

daisy-ycguo merged commit c25063f into opea-project:main Aug 23, 2024
10 checks passed

wangkl2 pushed a commit to wangkl2/GenAIExamples that referenced this pull request Dec 11, 2024

update upload_training_files format (opea-project#613)

3367b76

* update upload_training_files format Signed-off-by: Yue, Wenjiao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates #613

Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates #613

dmsuehir commented Aug 16, 2024 •

edited

Loading

daisy-ycguo left a comment

lianhao commented Aug 23, 2024

Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates #613

Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates #613

Conversation

dmsuehir commented Aug 16, 2024 • edited Loading

Description

Issues

Type of change

Dependencies

Tests

daisy-ycguo left a comment

Choose a reason for hiding this comment

lianhao commented Aug 23, 2024

dmsuehir commented Aug 16, 2024 •

edited

Loading