Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

kata: introduce kata container support #3465

Merged
merged 1 commit into from
Jul 13, 2018

Conversation

egernst
Copy link
Contributor

@egernst egernst commented Jul 11, 2018

Signed-off-by: Eric Ernst [email protected]

What this PR does / why we need it:

PR adds support for Kata Containers.

Which issue this PR fixes :

fixes # 3463

Special notes for your reviewer:

Should look very familiar to initial Clear Container support PRs.

If applicable:

  • documentation
  • unit tests

Release note:

@msftclas
Copy link

msftclas commented Jul 11, 2018

CLA assistant check
All CLA requirements met.

@egernst
Copy link
Contributor Author

egernst commented Jul 11, 2018

testing still wip

@codecov
Copy link

codecov bot commented Jul 11, 2018

Codecov Report

Merging #3465 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3465      +/-   ##
==========================================
- Coverage   55.83%   55.83%   -0.01%     
==========================================
  Files         105      105              
  Lines       15885    15884       -1     
==========================================
- Hits         8870     8869       -1     
  Misses       6270     6270              
  Partials      745      745

@egernst
Copy link
Contributor Author

egernst commented Jul 12, 2018

To test this PR, I built my branch of acs-engine locally and then tried the basic kata containers test I introduced. Unfortunately it failed with an error that isn't real clear -- perhaps just a timeout? If anyone has pointers, I'd appreciate it.

Details:

eernst-mac02:acs-engine eernst$ ./bin/acs-engine deploy --subscription-id xxxxx --dns-prefix kata-test --location westus2 --api-model kubernetes-kata-containers.json

WARN[0003] apimodel: missing masterProfile.dnsPrefix will use "kata-test"
WARN[0003] --resource-group was not specified. Using the DNS prefix from the apimodel as the resource group name: kata-test
WARN[0006] apimodel: ServicePrincipalProfile was missing or empty, creating application...
WARN[0008] created application with applicationID (ef542ac5-28c0-45c0-bbf4-f8eefec6a701) and servicePrincipalObjectID (2d64c93b-de51-4801-ba8b-9ab2bc3e5c55).
WARN[0008] apimodel: ServicePrincipalProfile was empty, assigning role to application...
INFO[0045] Starting ARM Deployment (kata-test-1196995142). This will take some time...
INFO[0259] Finished ARM Deployment (kata-test-1196995142). Error: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details."
ERRO[0259] {"status":"Failed","error":{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.","details":[{"code":"Conflict","message":"{
 \"status\": \"Failed\",
 \"error\": {
   \"code\": \"ResourceDeploymentFailure\",
   \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",
   \"details\": [
     {
       \"code\": \"VMExtensionProvisioningError\",
       \"message\": \"VM has reported a failure when processing extension 'cse-agent-0'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=2\\n[stdout]\\n\\n[stderr]\\n\\\".\"
     }
   ]
 }
}"},{"code":"Conflict","message":"{
 \"status\": \"Failed\",
 \"error\": {
   \"code\": \"ResourceDeploymentFailure\",
   \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",
   \"details\": [
     {
       \"code\": \"VMExtensionProvisioningError\",
       \"message\": \"VM has reported a failure when processing extension 'cse-agent-2'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=2\\n[stdout]\\n\\n[stderr]\\n\\\".\"
     }
   ]
 }
}"},{"code":"Conflict","message":"{
 \"status\": \"Failed\",
 \"error\": {
   \"code\": \"ResourceDeploymentFailure\",
   \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",
   \"details\": [
     {
       \"code\": \"VMExtensionProvisioningError\",
       \"message\": \"VM has reported a failure when processing extension 'cse-agent-1'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=2\\n[stdout]\\n\\n[stderr]\\n\\\".\"
     }
   ]
 }
}"},{"code":"Conflict","message":"{
 \"status\": \"Failed\",
 \"error\": {
   \"code\": \"ResourceDeploymentFailure\",
   \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",
   \"details\": [
     {
       \"code\": \"VMExtensionProvisioningError\",
       \"message\": \"VM has reported a failure when processing extension 'cse-master-0'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=2\\n[stdout]\\n\\n[stderr]\\n\\\".\"
     }
   ]
 }
}"}]}}
FATA[0259] resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details."

@egernst
Copy link
Contributor Author

egernst commented Jul 12, 2018

@scooley, @jessfraz - let me know if this failure is familiar, or if you have tips on where to look for more details. Thx!

@adelina-t
Copy link
Contributor

adelina-t commented Jul 12, 2018

@egernst I ran into the same issue while running windows k8s CI jobs on this PR. Basically the k8s script that provisions the master node failed. You can see logs on the master node in /var/log/azure/cluster-provision.log . You basically forgot a "||" operator at L367 and L373 ( letf a comment inline as well ) :) That's the reason for "VM has reported a failure when processing extension 'cse-master-0'"

@@ -344,13 +364,13 @@ function installContainerd() {
sed -i '/\[Service\]/a ExecStartPost=\/sbin\/iptables -P FORWARD ACCEPT' /etc/systemd/system/containerd.service

echo "Successfully installed cri-containerd..."
if [[ "$CONTAINER_RUNTIME" == "clear-containers" ]] || [[ "$CONTAINER_RUNTIME" == "containerd" ]]; then
if [[ "$CONTAINER_RUNTIME" == "clear-containers" ]] [[ "$CONTAINER_RUNTIME" == "kata-containers" ]] || [[ "$CONTAINER_RUNTIME" == "containerd" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot "||" operator here :)

setupContainerd
fi
}

function ensureContainerd() {
if [[ "$CONTAINER_RUNTIME" == "clear-containers" ]] || [[ "$CONTAINER_RUNTIME" == "containerd" ]]; then
if [[ "$CONTAINER_RUNTIME" == "clear-containers" ]] [[ "$CONTAINER_RUNTIME" == "kata-containers" ]] || [[ "$CONTAINER_RUNTIME" == "containerd" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here :) that's why you got cse-master-0 error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's embarrassing -- thanks for the help.

@egernst
Copy link
Contributor Author

egernst commented Jul 12, 2018

@adelina-t - sweet, thanks for teaching me to fish - the /var/log/azure/cluster-provision.log is very helpful; that's what I was missing.

@egernst
Copy link
Contributor Author

egernst commented Jul 12, 2018

@adelina-t -- repushed with these fixed.

I verified I could create a cluster using the added example, kubernetes-kata-containers.json

You mentioned a comment inline? Can you clarify.

@adelina-t
Copy link
Contributor

@egernst By "comment inline" I was referring to the comments I left on the commit. Probably not the clearest choice of words on my part :)

@@ -0,0 +1,40 @@
{
"apiVersion": "vlabs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the example json in both places, we have a //todo item to move the feature examples in e2e-tests/ so you can probably keep just the other one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

echo "Adding Kata Containers repository key..."
KATA_RELEASE_KEY_TMP=/tmp/kata-containers-release.key
KATA_URL=http://download.opensuse.org/repositories/home:/katacontainers:/release/xUbuntu_16.04/Release.key
retrycmd_if_failure_no_stats 20 1 5 curl -fsSL $KATA_URL > $KATA_RELEASE_KEY_TMP || exit $ERR_APT_INSTALL_TIMEOUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a new exit code specific to Kata install here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see installDocker() for a model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.


# Install Kata Containers runtime
echo "Installing Kata Containers runtime..."
apt_get_update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be apt_get_update || exit $ERR_APT_UPDATE_TIMEOUT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

# Install Kata Containers runtime
echo "Installing Kata Containers runtime..."
apt_get_update
apt_get_install 20 30 120 kata-runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and apt_get_install 20 30 120 kata-runtime || exit $ERR_KATA_INSTALL_TIMEOUT here, ERR_KATA_INSTALL_TIMEOUT needs to be defined at the top of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -281,6 +281,24 @@ function configNetworkPlugin() {
fi
}

function installKataContainersRuntime() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that my comments in this function also apply to installClearContainersRuntime(), which is probably why you did it that way but I think installClearContainersRuntime needs to be changed as well to have proper exit codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. Okay to update the example function, installClearContainersRuntime(), in a follow-on PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course!


if [[ "$CONTAINER_RUNTIME" == "kata-containers" ]]; then
# Ensure we can nest virtualization
if grep -q vmx /proc/cpuinfo; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if this is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, the given node would not install kata container artifacts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the node be functional? Does that mean there would be no container runtime installed?

Copy link
Contributor Author

@egernst egernst Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be fully functional, though the user may hope to use Kata, but in reality be using runc.
When Kata is installed, the operator deploying workloads would have option of either using runc or kata-runtime. In the case VMX isn't supported on the node, any workloads targeting kata-runtime would be handled by the default runtime, runc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for clarifying

@@ -1083,6 +1083,18 @@ func Test_Properties_ValidateContainerRuntime(t *testing.T) {
)
}

p.OrchestratorProfile.KubernetesConfig.ContainerRuntime = "kata-containers"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kata-containers supported for all k8s versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kata is more tightly couple with the CRI-shim version (in this case, containerd). I think if there's an error, it'll likely be a mismatch between containerd + k8s?

Copy link
Contributor Author

@egernst egernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback, @CecileRobertMichon. The updates error codes will definitely make debug much easier - updated per your feedback. PTAL.

@@ -0,0 +1,40 @@
{
"apiVersion": "vlabs",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

@@ -1083,6 +1083,18 @@ func Test_Properties_ValidateContainerRuntime(t *testing.T) {
)
}

p.OrchestratorProfile.KubernetesConfig.ContainerRuntime = "kata-containers"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kata is more tightly couple with the CRI-shim version (in this case, containerd). I think if there's an error, it'll likely be a mismatch between containerd + k8s?

@@ -281,6 +281,24 @@ function configNetworkPlugin() {
fi
}

function installKataContainersRuntime() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. Okay to update the example function, installClearContainersRuntime(), in a follow-on PR?


# Install Kata Containers runtime
echo "Installing Kata Containers runtime..."
apt_get_update
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

# Install Kata Containers runtime
echo "Installing Kata Containers runtime..."
apt_get_update
apt_get_install 20 30 120 kata-runtime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

echo "Adding Kata Containers repository key..."
KATA_RELEASE_KEY_TMP=/tmp/kata-containers-release.key
KATA_URL=http://download.opensuse.org/repositories/home:/katacontainers:/release/xUbuntu_16.04/Release.key
retrycmd_if_failure_no_stats 20 1 5 curl -fsSL $KATA_URL > $KATA_RELEASE_KEY_TMP || exit $ERR_APT_INSTALL_TIMEOUT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.


if [[ "$CONTAINER_RUNTIME" == "kata-containers" ]]; then
# Ensure we can nest virtualization
if grep -q vmx /proc/cpuinfo; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, the given node would not install kata container artifacts

"kubernetesConfig": {
"networkPlugin": "flannel",
"containerRuntime": "kata-containers",
"addons": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why tiller and dashboard are disabled explicitly? If not, we can probably let them go to default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was just me being lazy and copying what was done for Clear Containers in the past.

@egernst
Copy link
Contributor Author

egernst commented Jul 12, 2018

@CecileRobertMichon I verified example after suggested lines. PTAL

@egernst
Copy link
Contributor Author

egernst commented Jul 12, 2018

I'm not sure if there is any action required on my end for kicking this CI (waiting for status to be reported), but please let me know if I can do anything to help push this along.

@CecileRobertMichon
Copy link
Contributor

@egernst I'll kick off ci once the PR is approved, will finish reviewing later today if I have time. Thank you for following this through!

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@CecileRobertMichon CecileRobertMichon merged commit 81cf27c into Azure:master Jul 13, 2018
julienstroheker pushed a commit to julienstroheker/acs-engine that referenced this pull request Jul 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants