Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test CKF 1.9 in airgapped environment #915

Closed
NohaIhab opened this issue May 30, 2024 · 4 comments
Closed

Test CKF 1.9 in airgapped environment #915

NohaIhab opened this issue May 30, 2024 · 4 comments
Labels
enhancement New feature or request Kubeflow 1.9

Comments

@NohaIhab
Copy link
Contributor

Context

We need to test CKF 1.9 in airgapped to make sure it's working correctly and we can configure all the needed images.

What needs to get done

  1. Setup an airgapped environment
  2. Deploy 1.9 in airgapped with the script created in Write a script for deploying CKF 1.9 in airgapped #914
  3. Test 1.9 using the plan defined Define a test plan for airgapped deployment of CKF #898

Definition of Done

CKF 1.9 is tested in airgapped environment

@NohaIhab NohaIhab added the enhancement New feature or request label May 30, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5759.

This message was autogenerated

@mvlassis
Copy link
Contributor

mvlassis commented Aug 26, 2024

[Outdated, see the comment below for the test results after the scripts have been updated]

I tested the following for CKF 1.9:

  • Katib: The tests are successful.
  • Knative: I get the following output:
NAME    URL                                      LATESTCREATED   LATESTREADY   READY     REASON
hello   http://hello.admin.10.64.140.43.nip.io   hello-00001     hello-00001   Unknown   IngressNotConfigured

Curling the login page with curl -L http://helloworld.admin.10.64.140.43.nip.io seems to bring a "Log in" page:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <title>dex</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link href="../../static/main.css" rel="stylesheet">
    <link href="../../theme/styles.css" rel="stylesheet">
    <link rel="icon" href="../../theme/favicon.png">
  </head>

  <body class="theme-body">
    <div class="theme-navbar">
      <div class="theme-navbar__logo-wrap">
        <img class="theme-navbar__logo" src="../../theme/logo.png">
      </div>
    </div>

    <div class="dex-container">


<div class="theme-panel">
  <h2 class="theme-heading">Log in to Your Account</h2>
  <form method="post" action="/dex/auth/local/login?back=&amp;state=ttpv7fksgayin3o4vtsodgef5">
    <div class="theme-form-row">
      <div class="theme-form-label">
        <label for="userid">Email Address</label>
      </div>
	 <input tabindex="1" required id="login" name="login" type="text" class="theme-form-input" placeholder="email address"  autofocus />
    </div>
    <div class="theme-form-row">
      <div class="theme-form-label">
        <label for="password">Password</label>
      </div>
	 <input tabindex="2" required id="password" name="password" type="password" class="theme-form-input" placeholder="password" />
    </div>

    

    <button tabindex="3" id="submit-login" type="submit" class="dex-btn theme-btn--primary">Login</button>

  </form>
  
</div>

    </div>
  </body>
</html>
  • Pipelines: The status of the pod utorial-data-passing-mb7vs-system-container-impl is Error after a while. The following are logs from the pod:
time="2024-08-26T15:38:35.441Z" level=info msg="capturing logs" argo=true
time="2024-08-26T15:38:35.506Z" level=info msg="capturing logs" argo=true
I0826 15:38:35.732552      29 launcher_v2.go:90] input ComponentSpec:{
  "inputDefinitions": {
    "parameters": {
      "message": {
        "parameterType": "STRING"
      }
    }
  },
  "outputDefinitions": {
    "artifacts": {
      "output_dataset_one": {
        "artifactType": {
          "schemaTitle": "system.Dataset",
          "schemaVersion": "0.0.1"
        }
      },
      "output_dataset_two_path": {
        "artifactType": {
          "schemaTitle": "system.Dataset",
          "schemaVersion": "0.0.1"
        }
      }
    },
    "parameters": {
      "output_bool_parameter_path": {
        "parameterType": "BOOLEAN"
      },
      "output_dict_parameter_path": {
        "parameterType": "STRUCT"
      },
      "output_list_parameter_path": {
        "parameterType": "LIST"
      },
      "output_parameter_path": {
        "parameterType": "STRING"
      }
    }
  },
  "executorLabel": "exec-preprocess"
}
I0826 15:38:35.740573      29 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I0826 15:38:35.740896      29 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x77141052ae90>: Failed to establish a new connection: [Errno 101] Network unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x771412ae9910>: Failed to establish a new connection: [Errno 101] Network unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x771410533a10>: Failed to establish a new connection: [Errno 101] Network unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x771410538490>: Failed to establish a new connection: [Errno 101] Network unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x771410538f90>: Failed to establish a new connection: [Errno 101] Network unreachable')': /simple/kfp/
ERROR: Could not find a version that satisfies the requirement kfp==2.7.0 (from versions: none)
ERROR: No matching distribution found for kfp==2.7.0
I0826 15:39:57.893041      29 launcher_v2.go:151] publish success.
F0826 15:39:57.894106      29 main.go:49] failed to execute component: exit status 1
time="2024-08-26T15:39:58.551Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1
time="2024-08-26T15:39:59.503Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

The issue seems to be the version of kfp used in the image.

Error from server (InternalError): error when creating "./tfjob-simple.yaml": Internal error occurred: failed calling webhook "validator.tfjob.training-operator.kubeflow.org": failed to call webhook: Post "https://training-operator-workload.kubeflow.svc:443/validate-kubeflow-org-v1-tfjob?timeout=10s": dial tcp 10.152.183.115:443: connect: connection refused

@NohaIhab
Copy link
Contributor Author

NohaIhab commented Sep 2, 2024

Hey @mvlassis, we need to update the test images for CKF 1.9 before we can start testing. Can you create a task for that?

@mvlassis
Copy link
Contributor

mvlassis commented Sep 9, 2024

Running the tests with the updated script and the new images from #1053 I got the following:

  • Katib: The simple-pbt test is successful.
  • Knative: After applying helloworld.yaml, we get the following KNative service:
NAME    URL                                      LATESTCREATED   LATESTREADY   READY   REASON
hello   http://hello.admin.10.64.140.43.nip.io   hello-00001     hello-00001   True 

Running curl -L http://hello.admin.10.64.140.43.nip.io returns Hello World!, so the tests are successful.

  • Pipelines: Uploading and running the tutorial-data-passing pipelines is successful.
  • Training: microk8s kubectl get tfjob -n admin creates a job that runs and succeeds:
NAME           STATE       AGE
tfjob-simple   Succeeded   3m16s

The test is successful.

4/4 of the tests are successful.

@mvlassis mvlassis closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Kubeflow 1.9
Projects
None yet
Development

No branches or pull requests

2 participants