DataVolume import is failing to import from GCP and Azure #2838

ksimon1 · 2023-08-10T12:38:19Z

What happened:
During run of common templates e2e tests, import of DV fails on GCP env

What you expected to happen:
DV is imported without error

How to reproduce it (as minimally and precisely as possible):
Run common templates e2e test - I can help setting the env
OR
request new cluster via cluster bot with command launch 4.14 gcp,virtualization-support
deploy KubeVirt, cdi and create datavolume:

oc apply -n kubevirt -f - <<EOF
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.bind.immediate.requested: "true"
  name: centos-stream9
spec:
  source:
    registry:
      url: docker://quay.io/kubevirt/common-templates:centos-stream9
  pvc:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 30Gi
EOF

Environment:

CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.56.1

[ksimon:13:03:16~/go/src/kubevirt.io/common-templates]$ oc version
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-08-10-021647
Kubernetes Version: v1.27.4+54fa6e1

DV definition:

[ksimon:13:03:07~/go/src/kubevirt.io/common-templates]$ oc describe datavolume centos-stream9-datavolume-original
Name:         centos-stream9-datavolume-original
Namespace:    kubevirt
Labels:       <none>
Annotations:  cdi.kubevirt.io/storage.bind.immediate.requested: true
API Version:  cdi.kubevirt.io/v1beta1
Kind:         DataVolume
Metadata:
  Creation Timestamp:  2023-08-10T10:52:55Z
  Generation:          1
  Managed Fields:
    API Version:  cdi.kubevirt.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:cdi.kubevirt.io/storage.bind.immediate.requested:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:pvc:
          .:
          f:accessModes:
          f:resources:
            .:
            f:requests:
              .:
              f:storage:
        f:source:
          .:
          f:registry:
            .:
            f:url:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-08-10T10:52:55Z
    API Version:  cdi.kubevirt.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:claimName:
        f:conditions:
        f:phase:
        f:progress:
    Manager:         cdi-controller
    Operation:       Update
    Subresource:     status
    Time:            2023-08-10T11:03:14Z
  Resource Version:  126758
  UID:               452699ab-ab72-4a25-988b-0e82d9fd44cf
Spec:
  Pvc:
    Access Modes:
      ReadWriteOnce
    Resources:
      Requests:
        Storage:  30Gi
  Source:
    Registry:
      URL:  docker://quay.io/kubevirt/common-templates:centos-stream9
Status:
  Claim Name:  centos-stream9-datavolume-original
  Conditions:
    Last Heartbeat Time:   2023-08-10T10:53:01Z
    Last Transition Time:  2023-08-10T10:53:01Z
    Message:               PVC centos-stream9-datavolume-original Bound
    Reason:                Bound
    Status:                True
    Type:                  Bound
    Last Heartbeat Time:   2023-08-10T11:03:14Z
    Last Transition Time:  2023-08-10T10:52:55Z
    Reason:                TransferRunning
    Status:                False
    Type:                  Ready
    Last Heartbeat Time:   2023-08-10T10:53:14Z
    Last Transition Time:  2023-08-10T10:53:14Z
    Reason:                Pod is running
    Status:                True
    Type:                  Running
  Phase:                   ImportInProgress
  Progress:                N/A
Events:
  Type    Reason            Age   From                          Message
  ----    ------            ----  ----                          -------
  Normal  Pending           10m   datavolume-import-controller  PVC centos-stream9-datavolume-original Pending
  Normal  Claim Pending     10m   datavolume-import-controller  target PVC centos-stream9-datavolume-original Pending and Claim Pending
  Normal  ImportScheduled   10m   datavolume-import-controller  Import into centos-stream9-datavolume-original scheduled
  Normal  Claim Pending     10m   datavolume-import-controller  Claim Pending
  Normal  Bound             10m   datavolume-import-controller  PVC centos-stream9-datavolume-original Bound
  Normal  ImportInProgress  10m   datavolume-import-controller  Import into centos-stream9-datavolume-original in progress

Log from importer pod:

[ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
I0810 10:53:14.498944       1 importer.go:103] Starting importer
I0810 10:53:14.498982       1 importer.go:168] begin import process
I0810 10:53:14.499040       1 registry-datasource.go:172] Copying proxy certs
I0810 10:53:14.499058       1 registry-datasource.go:57] Error creating allCertDir open /proxycerts/: no such file or directory
I0810 10:53:14.499099       1 data-processor.go:379] Calculating available size
I0810 10:53:14.499118       1 data-processor.go:391] Checking out file system volume size.
I0810 10:53:14.499127       1 data-processor.go:399] Request image size not empty.
I0810 10:53:14.499141       1 data-processor.go:404] Target size 31509590016.
I0810 10:53:14.499226       1 util.go:38] deleting file: /scratch/lost+found
I0810 10:53:14.499829       1 util.go:38] deleting file: /data/lost+found
I0810 10:53:14.500146       1 data-processor.go:282] New phase: TransferScratch
I0810 10:53:14.500227       1 registry-datasource.go:92] Copying registry image to scratch space.
I0810 10:53:14.500233       1 transport.go:176] Downloading image from 'docker://quay.io/kubevirt/common-templates:centos-stream9', copying file from 'disk' to '/scratch'
I0810 10:53:15.108005       1 transport.go:200] Processing layer {Digest:sha256:8f6ac7ed4a91c9630083524efcef2f59f27404320bfee44397f544c252ad4bd4 Size:91270228 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:17.902869       1 transport.go:200] Processing layer {Digest:sha256:219eea7f39e324e3014bed9c8bb5eb893731fe04e76bda10d0f6f1c41d3e428f Size:3606081 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.255363       1 transport.go:200] Processing layer {Digest:sha256:a61b3233d0cb45893335986c0c4faa3da5ec5369d20d696d20b376084c02a3e0 Size:1283 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.462437       1 transport.go:200] Processing layer {Digest:sha256:6736fe12ce1296f721c6676a09ff55dfa719e9335f48882768b34ea52a29b475 Size:943688263 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.868180       1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397       1 util.go:191] Writing data...
E0810 11:41:35.344127       1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811       1 transport.go:161] Error copying file: unable to write to file: unexpected EOF
E0810 11:41:35.409873       1 transport.go:214] Failed to find VM disk image file in the container image
E0810 11:41:35.409898       1 data-processor.go:278] Failed to find VM disk image file in the container image
kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage
	pkg/importer/transport.go:215
kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage
	pkg/importer/transport.go:262
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:93
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Failed to read registry image
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:95
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Unable to transfer source data to scratch space
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:213
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
I0810 11:41:35.410061       1 util.go:38] deleting file: /scratch/disk
E0810 11:41:35.410113       1 importer.go:177] Failed to find VM disk image file in the container image
kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage
	pkg/importer/transport.go:215
kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage
	pkg/importer/transport.go:262
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:93
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Failed to read registry image
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:95
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Unable to transfer source data to scratch space
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:213
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571

The text was updated successfully, but these errors were encountered:

codingben · 2023-08-10T12:46:39Z

Is it related to #2836?

ksimon1 · 2023-08-10T12:51:39Z

I can't say, because this error is happening during first import of disk from quay into cluster. The #2836 is happening during clone of the imported disk (the one about which is this bug) to new dataVolume

akalenyu · 2023-08-10T13:15:49Z

Did anything change in the push process of the common template images?
Usually, the disk image file is 107:107 but in this case, I see

BTW, why do you need the other files in the container image? (boot/,dev/ etc)

ksimon1 · 2023-08-11T10:17:26Z

The images are created the same method as in the past:

FROM kubevirt/container-disk-v1alpha
ADD centos-stream-9.qcow2 /disk

aglitke · 2023-08-14T12:17:17Z

Are you able to reproduce this with the standard containerDisk images (ie. quay.io/containerdisks/centos-stream:9)? This definitely seems like an issue with this specific containerDisk.

ksimon1 · 2023-08-14T12:28:03Z

yes, centos stream 8, 9, ubuntu, rhcos (all four tested from quay.io/containerdisks), windows (our custom) are not working (the same error), fedora (both our custom and from quay.io/containerdisks) and opensuse are working. The same is happening on azure clusters

CDI is not working with ocp 4.14 -kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>

CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>

ksimon1 · 2023-08-22T12:54:36Z

@aglitke, @akalenyu I just tested this on ocp 4.15 and it is happening there too.

dominikholler · 2023-08-25T15:43:56Z

I noticed that the download speed inside the importer pod is reduced:

Check that the cluster is fine:

oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  96.4M      0  0:00:09  0:00:09 --:--:--  101M

Run the same download at the same time inside a importer pod:

oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  3920k      0  0:03:56  0:03:56 --:--:-- 94.0M

on

OCP 4.14.0-0.nightly-2023-08-11-055332
CNV 4.14.0-1744 provided by Red Hat

@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?

dustymabe · 2023-08-26T03:11:15Z

I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.

Here's the VM definition I'm using:

---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: fcos
spec:
  runStrategy: Always
  dataVolumeTemplates:
  - metadata:
      name: fcos-data-volume
    spec:
      storage:
        volumeMode: Block
        resources:
          requests:
            storage: 10Gi
      source:
        registry:
          url: "docker://quay.io/fedora/fedora-coreos-kubevirt:stable"
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: fcos-data-volume-disk1
          - disk:
              bus: virtio
            name: cloudinitdisk
          rng: {}
        resources:
          requests:
            memory: 2048M
      volumes:
      - dataVolume:
          name: fcos-data-volume
        name: fcos-data-volume-disk1
      - name: cloudinitdisk
        cloudInitConfigDrive:
          userData: |
            {
              "ignition": {
                "version": "3.3.0"
              },
              "passwd": {
                "users": [
                  {
                    "name": "core",
                    "sshAuthorizedKeys": [
                      "ecdsa-sha2-nistp521 AAAA..."
                    ]
                  }
                ]
              }
            }

I'll follow along in this issue to see what the resolution is.

akalenyu · 2023-08-27T10:52:18Z

I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.

Here's the VM definition I'm using:

I'll follow along in this issue to see what the resolution is.

Thanks for jumping into this issue!
It would be great to see if this is indeed the same by following the importer pod logs,
similarly to how the author did:

[ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
...
I0810 10:53:18.868180       1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397       1 util.go:191] Writing data...
E0810 11:41:35.344127       1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811       1 transport.go:161] Error copying file: unable to write to file: unexpected EOF

The PVC yamls corresponding to the import operation would also be helpful:

fcos-data-volume
fcos-data-volume-scratch

akalenyu · 2023-08-27T11:37:54Z

I noticed that the download speed inside the importer pod is reduced:

Check that the cluster is fine:

oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  96.4M      0  0:00:09  0:00:09 --:--:--  101M

Run the same download at the same time inside a importer pod:

oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  3920k      0  0:03:56  0:03:56 --:--:-- 94.0M

on

OCP 4.14.0-0.nightly-2023-08-11-055332
CNV 4.14.0-1744 provided by Red Hat

@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?

Can't reproduce these numbers on 4.14.0-1763:

$ oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   255M      0  0:00:03  0:00:03 --:--:--  255M


$ oc exec -i -n default importer-prime-9614b3bd-7e71-4306-96b2-c042efc26929 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  98.3M      0  0:00:09  0:00:09 --:--:--  102M

In general, it would be surprising to see a slowdown with mirrors, as we have recently merged and released #2841 which makes us download the entire image and only then convert it.

Is it possible the pods you picked are scheduled to different nodes?

dustymabe · 2023-08-27T22:09:55Z

Thanks for jumping into this issue!
It would be great to see if this is indeed the same by following the importer pod logs,

The logs are here: importer-fcos-data-volume-importer.txt

Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.

dominikholler · 2023-08-28T06:29:04Z

Can't reproduce these numbers on 4.14.0-1763:

@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?

ksimon1 · 2023-08-28T08:18:57Z

@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.

Is it possible the pods you picked are scheduled to different nodes?

I tried it and importer pod is slower than prometheus pod, eventhough pods are on the same node

[ksimon:10:14:35~]$ oc exec -i -n openshift-monitoring prometheus-k8s-1 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 10  906M   10 91.2M    0     0  70.2M      0  0:00:12  0:00:01  0:00:11 70.1M
 20  906M   20  185M    0     0  80.4M      0  0:00:11  0:00:02  0:00:09 80.4M

vs

[ksimon:10:15:29~]$ oc exec -i  importer-prime-eda481be-064c-4a9f-967d-752cb4b3c2fb -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
Defaulted container "importer" out of: importer, server, init (init)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  906M    0 1366k    0     0   503k      0  0:30:44  0:00:02  0:30:42  503k

akalenyu · 2023-08-28T08:50:02Z

Thanks for jumping into this issue!
It would be great to see if this is indeed the same by following the importer pod logs,

The logs are here: importer-fcos-data-volume-importer.txt

Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.

Yep that's the same issue

Can't reproduce these numbers on 4.14.0-1763:

@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?

Maybe this depends on which mirror centos.org redirects to:

$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  63.9M      0  0:00:14  0:00:14 --:--:-- 36.2M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   156M      0  0:00:05  0:00:05 --:--:--  162M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   284M      0  0:00:03  0:00:03 --:--:--  284M

akalenyu · 2023-08-28T08:58:28Z

@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.

pullMethod: node uses the container runtime on the node to pull images, similarly to how it would
do that for regular pod images. Are those super slow too? you can try a simple crictl pull on the node

aglitke · 2023-08-28T12:36:04Z

A couple notes from the SIG-storage discussion of this issue:

Could it be a host caching issue? We may want to test adding an fsync call after the container image is extracted to scratch space
Need to scrutinize the flow. Could there be an nbdkit issue? How is qemu-img convert involved?
Could there be a network connectivity issue to the registry server specific to this environment?
From the node, try a skopeo pull to see if the image can be retrieved (you should be able to provide credentials on the command line)
Can the same image be pushed to a registry other than quay.io to see if it's related to that registry.

akalenyu · 2023-08-28T13:46:40Z

Could there be a network connectivity issue to the registry server specific to this environment?

So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry)
and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors:
containers/image#1145 (comment)
Seems to specifically impact large images like our use case

I created a PR to bump this library in CDI to hopefully get this extra resiliency:
#2874

akalenyu · 2023-08-28T18:05:41Z

Could there be a network connectivity issue to the registry server specific to this environment?

So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry) and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors: containers/image#1145 (comment) Seems to specifically impact large images like our use case

I created a PR to bump this library in CDI to hopefully get this extra resiliency: #2874

So I just verified this PR on a cluster-bot gcp cluster:

I0828 15:03:32.885930       1 importer.go:103] Starting importer
...
I0828 15:03:33.380544       1 util.go:194] Writing data...
time="2023-08-28T15:57:29Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 766851345 bytes…"
time="2023-08-28T16:34:40Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 157415654 bytes…"
...
I0828 17:06:46.600304       1 data-processor.go:255] New phase: Complete
I0828 17:06:46.600386       1 importer.go:216] Import Complete

You can see the retry mechanism had to kick in, and the pull is still very slow. It took more than two hours

dustymabe · 2023-08-28T18:14:26Z

maybe there is some issue when pulling from quay.io when in GCP?

akalenyu · 2023-08-28T20:21:58Z

maybe there is some issue when pulling from quay.io when in GCP?

Thought so too, but the slowness reproduces with other HTTP sources like

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
...
spec:
  source:
    http:
      url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2

(Both with curl and the cdi importer process)

ksimon1 · 2023-08-29T07:51:57Z

maybe there is some issue when pulling from quay.io when in GCP?

this is not only gcp issue, azure is affected as well

lyarwood · 2023-08-29T10:59:19Z

/cc

akalenyu · 2023-08-29T12:26:49Z

@maya-r suggested it may have to do with resource usage of the CDI importer process,
and it seems like that is the case here.

We bumped (2x) the CDI default requests&limits and the import completed quickly
I suggest that you also give this a try, basically, just edit the CDI resource with

apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
...
spec:
  config:
    featureGates:
    - HonorWaitForFirstConsumer
    podResourceRequirements:
      limits:
        cpu: 1500m
        memory: 1200M
      requests:
        cpu: 100m
        memory: 60M

I have no idea why we get throttled for a simple image pull so will have to figure that out..
One difference between 4.13 and 4.14 is cgroupsv2 being the default so throttles will happen instead of OOMs

ksimon1 · 2023-08-29T14:38:45Z

as you can see in this PR kubevirt/common-templates#542
many tests failed on importing DV into cluster even with the change.

jcanocan · 2023-09-19T13:01:07Z

/cc

dominikholler · 2023-09-25T10:06:26Z

@akalenyu thanks for your impressive investigation!
@ksimon1 can you confirm that the issue is fixed?

ksimon1 · 2023-10-05T08:58:56Z

Yes, issue is fixed

ksimon1 added the kind/bug label Aug 10, 2023

codingben mentioned this issue Aug 10, 2023

refactor: cleanup tekton operands kubevirt/ssp-operator#641

Merged

ksimon1 mentioned this issue Aug 16, 2023

[kubevirt-ssp-operator] revert: switch ocp 4.14 back to 4.13 openshift/release#42410

Closed

ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023

revert: revert v0.18.0 release to ocp 4.13

aba7313

CDI is not working with ocp 4.14 -kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>

ksimon1 mentioned this issue Aug 22, 2023

[kubevirt-ssp-operator] revert: revert v0.18.0 release to ocp 4.13 openshift/release#42534

Closed

ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023

revert: revert common-templates CI to use 4.13 ocp

291dc14

CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>

ksimon1 mentioned this issue Aug 22, 2023

[kubevirt-common-templates] revert: revert common-templates CI to use 4.13 ocp openshift/release#42535

Closed

ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023

revert: revert tekton-tasks CI to use 4.13 ocp

26aad95

CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>

ksimon1 mentioned this issue Aug 22, 2023

[kubevirt-tekton-tasks] revert: revert tekton-tasks CI to use 4.13 ocp openshift/release#42536

Closed

ksimon1 mentioned this issue Aug 22, 2023

[kubevirt-tekton-tasks] feat: introduce ci for tekton tasks release-v0.15.0 branch openshift/release#42537

Merged

ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023

revert: revert ssp CI to use 4.13 ocp

6fbc991

CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838 Signed-off-by: Karel Simon <[email protected]>

fabiand changed the title ~~DataVolume import is failing on GCP env~~ DataVolume import is failing to import from GCP and Azure Aug 29, 2023

ksimon1 closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataVolume import is failing to import from GCP and Azure #2838

DataVolume import is failing to import from GCP and Azure #2838

ksimon1 commented Aug 10, 2023 •

edited

Loading

codingben commented Aug 10, 2023

ksimon1 commented Aug 10, 2023

akalenyu commented Aug 10, 2023

ksimon1 commented Aug 11, 2023

aglitke commented Aug 14, 2023

ksimon1 commented Aug 14, 2023 •

edited

Loading

ksimon1 commented Aug 22, 2023

dominikholler commented Aug 25, 2023

dustymabe commented Aug 26, 2023

akalenyu commented Aug 27, 2023

akalenyu commented Aug 27, 2023

dustymabe commented Aug 27, 2023

dominikholler commented Aug 28, 2023 •

edited

Loading

ksimon1 commented Aug 28, 2023 •

edited

Loading

akalenyu commented Aug 28, 2023

akalenyu commented Aug 28, 2023

aglitke commented Aug 28, 2023

akalenyu commented Aug 28, 2023 •

edited

Loading

akalenyu commented Aug 28, 2023

dustymabe commented Aug 28, 2023

akalenyu commented Aug 28, 2023 •

edited

Loading

ksimon1 commented Aug 29, 2023

lyarwood commented Aug 29, 2023

akalenyu commented Aug 29, 2023 •

edited

Loading

ksimon1 commented Aug 29, 2023

jcanocan commented Sep 19, 2023

dominikholler commented Sep 25, 2023

ksimon1 commented Oct 5, 2023

DataVolume import is failing to import from GCP and Azure #2838

DataVolume import is failing to import from GCP and Azure #2838

Comments

ksimon1 commented Aug 10, 2023 • edited Loading

codingben commented Aug 10, 2023

ksimon1 commented Aug 10, 2023

akalenyu commented Aug 10, 2023

ksimon1 commented Aug 11, 2023

aglitke commented Aug 14, 2023

ksimon1 commented Aug 14, 2023 • edited Loading

ksimon1 commented Aug 22, 2023

dominikholler commented Aug 25, 2023

dustymabe commented Aug 26, 2023

akalenyu commented Aug 27, 2023

akalenyu commented Aug 27, 2023

dustymabe commented Aug 27, 2023

dominikholler commented Aug 28, 2023 • edited Loading

ksimon1 commented Aug 28, 2023 • edited Loading

akalenyu commented Aug 28, 2023

akalenyu commented Aug 28, 2023

aglitke commented Aug 28, 2023

akalenyu commented Aug 28, 2023 • edited Loading

akalenyu commented Aug 28, 2023

dustymabe commented Aug 28, 2023

akalenyu commented Aug 28, 2023 • edited Loading

ksimon1 commented Aug 29, 2023

lyarwood commented Aug 29, 2023

akalenyu commented Aug 29, 2023 • edited Loading

ksimon1 commented Aug 29, 2023

jcanocan commented Sep 19, 2023

dominikholler commented Sep 25, 2023

ksimon1 commented Oct 5, 2023

ksimon1 commented Aug 10, 2023 •

edited

Loading

ksimon1 commented Aug 14, 2023 •

edited

Loading

dominikholler commented Aug 28, 2023 •

edited

Loading

ksimon1 commented Aug 28, 2023 •

edited

Loading

akalenyu commented Aug 28, 2023 •

edited

Loading

akalenyu commented Aug 28, 2023 •

edited

Loading

akalenyu commented Aug 29, 2023 •

edited

Loading