Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initialize genai shared_fs permissions to agent group in helm deployment #9065

Merged
merged 2 commits into from
Mar 29, 2024

Conversation

tayritenour
Copy link
Contributor

@tayritenour tayritenour commented Mar 27, 2024

Description

Our helm deployments for GenAI have been not working for environments that use specific agent group IDs.

This change makes sure that we give the user a way to define the agent group ID for the shared drive, and initialize it to both use the group and allow all permissions underneath it to propagate this user with the setgid flag.

Test Plan

Add the following to your values.yaml

## Configure GenAI Deployment
genai:
  ## Version of GenAI to use. If unset, GenAI will not be deployed
  version: "0.1.3"
  
  ## Port for GenAI to backend use
  port: 9011
  
  ## Port for GenAI message queue
  messageQueuePort: 9013
  
  ## Secret to pull the GenAI image
  # imagePullSecretName:

  ## GenAI pod memory request
  memRequest: 1Gi

  ## GenAI pod cpu request
  cpuRequest: 100m

  ## GenAI pod memory limit
  # memLimit: 1Gi

  ## GenAI pod cpu limit
  # cpuLimit: 2

  ## PVC Name for the shared file system for GenAI.
  ## Note: Either `sharedPVCName` or `generatedPVC.storageSize` (to
  ## generate a new PVC) is required for GenAI deployment
  # sharedPVCName:

  ## Spec for the generated PVC for GenAI
  ## Note: In order to generate a shared PVC, you will need access to a
  ## StorageClass that can provide a ReadWriteMany volume
  generatedPVC:
    ## Storage class name for the generated PVC
    storageClassName: standard

    ## Size of the generated PVC
    storageSize: 10Gi

  ## Unix Agent Group ID for the Shared Filesystem.
  ## Note: All users that work with GenAI need to have this assigned as their
  ## Agent Group ID in the User Admin settings
  agentGroupID: 1104

  ## Unix Agent Group Name for the Shared Filesystem.
  agentGroupName: determined

  ## Whether or not we should attempt to run a Helm hook to initialize
  ## the shared filesystem to use the agentGroupID as its group.
  ## This must be turned off on clusters that disable pods that can run as root.
  shouldInitializeSharedFSGroupPermissions: true

  ## Extra Resource Pool Metadata is hardcoded information about the
  ## GPUs available to the resource pools. This information
  ## is not provided in k8s so we provide it directly.
  ## Note: All resource pools defined here need to also be reflected in
  ## the .Values.resourcePools.
  extraResourcePoolMetadata: {}
  • Deploy: helm install --generate-name . --set maxSlotsPerPod=1 --debug --dry-run

  • It should contain a apiVersion: batch/v1 job that starts with the name genai-initialize-shared-fs-permissions

  • If you toggle shouldInitializeSharedFSGroupPermissions to false, it will not be present in the dry-run output

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Mar 27, 2024
Copy link

netlify bot commented Mar 27, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit e94d425
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/660745e77f6bbf000850e454

Copy link

codecov bot commented Mar 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 47.73%. Comparing base (133d127) to head (e94d425).
Report is 21 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9065   +/-   ##
=======================================
  Coverage   47.72%   47.73%           
=======================================
  Files        1167     1167           
  Lines      143898   143898           
  Branches     2379     2379           
=======================================
+ Hits        68682    68689    +7     
+ Misses      75057    75050    -7     
  Partials      159      159           
Flag Coverage Δ
backend 42.97% <ø> (+<0.01%) ⬆️
harness 63.85% <ø> (+<0.01%) ⬆️
web 40.74% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 6 files with indirect coverage changes

@tayritenour
Copy link
Contributor Author

We have a documentation page for the user to do this themselves if they have restrictions on root pods: https://hpe-ai-solutions-documentation.netlify.app/products/gen-ai/latest/admin/set-up/deployment-guides/kubernetes/enforce-shared-user-agent-ids/

And they can opt in to this feature with the .Values.genai.shouldInitializeSharedFSGroupPermissions flag

@tayritenour tayritenour merged commit b726cf9 into main Mar 29, 2024
73 of 85 checks passed
@tayritenour tayritenour deleted the GAS-690 branch March 29, 2024 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants