Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates #613

Merged

Conversation

dmsuehir
Copy link
Contributor

@dmsuehir dmsuehir commented Aug 16, 2024

Description

This PR has a few updates based on issues that I ran into when deploying the CodeGen example on a cluster for xeon and Gaudi. The following issues are addressed in the PR:

  • I added a note about potentially using a persistent volume claim instead of having to create the /mnt/opea-models directory on the nodes
  • Deploying the codegen.yaml files gave an error like:
    error: error validating "codegen.yaml": error validating data: [unknown object type "nil" in ConfigMap.data.http_proxy, unknown object type "nil" in ConfigMap.data.https_proxy, unknown object type "nil" in ConfigMap.data.no_proxy]; if you choose to ignore these errors, turn validation off with --validate=false
    
    This error is because the ConfigMap in the yaml has a few env vars that are just empty (nil). Changing these to have empty quotes "" fixes the issue. [EDIT: this was resolved in PR 630]
  • I added a note about it taking a couple of minutes for the service to start and how to check the logs, because I ran into an issue where the curl command failed like "curl: (18) transfer closed with outstanding read data remaining" and it was just because the service wasn't ready yet. Also, knowing how to check the logs is useful for watching the status and figuring out if the curl command is failing because of an error.
  • When running on Gaudi wasn't working for me ("RuntimeError: synStatus=26 [Generic failure] Device acquire failed.") until I added the hugepages-2Mi/memory to the resource limits. The habana documentation for Kubernetes shows it using hugepages-2Mi and memory in the resources, so that seems to be the recommended config.

Issues

N/A

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

N/A

Tests

Manually tested the changes on a Kubernetes cluster with Xeon and Gaudi 2 nodes.

@chensuyue chensuyue requested a review from lianhao August 19, 2024 12:37
@chensuyue chensuyue added this to the v0.9 milestone Aug 19, 2024
CodeGen/kubernetes/manifests/gaudi/codegen.yaml Outdated Show resolved Hide resolved
CodeGen/kubernetes/manifests/gaudi/codegen.yaml Outdated Show resolved Hide resolved
CodeGen/kubernetes/manifests/gaudi/codegen.yaml Outdated Show resolved Hide resolved
CodeGen/kubernetes/manifests/xeon/codegen.yaml Outdated Show resolved Hide resolved
CodeGen/kubernetes/manifests/xeon/codegen.yaml Outdated Show resolved Hide resolved
Copy link
Collaborator

@daisy-ycguo daisy-ycguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lianhao
Copy link
Collaborator

lianhao commented Aug 23, 2024

I don't have objection to have this merged in. I'm just curious about how to get outdated conversion resolved on github PR pages.

@daisy-ycguo daisy-ycguo merged commit c25063f into opea-project:main Aug 23, 2024
10 checks passed
chensuyue pushed a commit that referenced this pull request Aug 23, 2024
…c updates (#613)

* Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates

Signed-off-by: dmsuehir <[email protected]>
(cherry picked from commit c25063f)
dmsuehir added a commit to dmsuehir/GenAIExamples that referenced this pull request Sep 11, 2024
…c updates (opea-project#613)

* Minor fixes for CodeGen Xeon and Gaudi Kubernetes codegen.yaml and doc updates

Signed-off-by: dmsuehir <[email protected]>
wangkl2 pushed a commit to wangkl2/GenAIExamples that referenced this pull request Dec 11, 2024
* update upload_training_files format

Signed-off-by: Yue, Wenjiao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants