-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataVolume import is failing to import from GCP and Azure #2838
Comments
Is it related to #2836? |
I can't say, because this error is happening during first import of disk from quay into cluster. The #2836 is happening during clone of the imported disk (the one about which is this bug) to new dataVolume |
The images are created the same method as in the past:
|
Are you able to reproduce this with the standard containerDisk images (ie. quay.io/containerdisks/centos-stream:9)? This definitely seems like an issue with this specific containerDisk. |
yes, centos stream 8, 9, ubuntu, rhcos (all four tested from quay.io/containerdisks), windows (our custom) are not working (the same error), fedora (both our custom and from quay.io/containerdisks) and opensuse are working. The same is happening on azure clusters |
CDI is not working with ocp 4.14 -kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>
CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>
CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>
CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>
I noticed that the download speed inside the importer pod is reduced:
on
@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded? |
I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good. Here's the VM definition I'm using:
I'll follow along in this issue to see what the resolution is. |
Thanks for jumping into this issue! [ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
...
I0810 10:53:18.868180 1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397 1 util.go:191] Writing data...
E0810 11:41:35.344127 1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811 1 transport.go:161] Error copying file: unable to write to file: unexpected EOF The PVC yamls corresponding to the import operation would also be helpful:
|
Can't reproduce these numbers on $ oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 255M 0 0:00:03 0:00:03 --:--:-- 255M
$ oc exec -i -n default importer-prime-9614b3bd-7e71-4306-96b2-c042efc26929 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 98.3M 0 0:00:09 0:00:09 --:--:-- 102M In general, it would be surprising to see a slowdown with mirrors, as we have recently merged and released #2841 which makes us download the entire image and only then convert it. Is it possible the pods you picked are scheduled to different nodes? |
The logs are here: importer-fcos-data-volume-importer.txt Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday. |
@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ? |
@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.
I tried it and importer pod is slower than prometheus pod, eventhough pods are on the same node
vs
|
Yep that's the same issue
Maybe this depends on which mirror centos.org redirects to: $ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 63.9M 0 0:00:14 0:00:14 --:--:-- 36.2M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 156M 0 0:00:05 0:00:05 --:--:-- 162M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 284M 0 0:00:03 0:00:03 --:--:-- 284M |
|
A couple notes from the SIG-storage discussion of this issue:
|
So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry) I created a PR to bump this library in CDI to hopefully get this extra resiliency: |
So I just verified this PR on a cluster-bot gcp cluster: I0828 15:03:32.885930 1 importer.go:103] Starting importer
...
I0828 15:03:33.380544 1 util.go:194] Writing data...
time="2023-08-28T15:57:29Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 766851345 bytes…"
time="2023-08-28T16:34:40Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 157415654 bytes…"
...
I0828 17:06:46.600304 1 data-processor.go:255] New phase: Complete
I0828 17:06:46.600386 1 importer.go:216] Import Complete You can see the retry mechanism had to kick in, and the pull is still very slow. It took more than two hours |
maybe there is some issue when pulling from quay.io when in GCP? |
Thought so too, but the slowness reproduces with other HTTP sources like apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
...
spec:
source:
http:
url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 (Both with curl and the cdi importer process) |
this is not only gcp issue, azure is affected as well |
/cc |
@maya-r suggested it may have to do with resource usage of the CDI importer process, We bumped (2x) the CDI default requests&limits and the import completed quickly apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
...
spec:
config:
featureGates:
- HonorWaitForFirstConsumer
podResourceRequirements:
limits:
cpu: 1500m
memory: 1200M
requests:
cpu: 100m
memory: 60M I have no idea why we get throttled for a simple image pull so will have to figure that out.. |
as you can see in this PR kubevirt/common-templates#542 |
/cc |
Yes, issue is fixed |
What happened:
During run of common templates e2e tests, import of DV fails on GCP env
What you expected to happen:
DV is imported without error
How to reproduce it (as minimally and precisely as possible):
Run common templates e2e test - I can help setting the env
OR
request new cluster via cluster bot with command
launch 4.14 gcp,virtualization-support
deploy KubeVirt, cdi and create datavolume:
Environment:
kubectl get deployments cdi-deployment -o yaml
): v1.56.1[ksimon:13:03:16~/go/src/kubevirt.io/common-templates]$ oc version
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-08-10-021647
Kubernetes Version: v1.27.4+54fa6e1
DV definition:
Log from importer pod:
The text was updated successfully, but these errors were encountered: