Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataVolume import is failing to import from GCP and Azure #2838

Closed
ksimon1 opened this issue Aug 10, 2023 · 28 comments
Closed

DataVolume import is failing to import from GCP and Azure #2838

ksimon1 opened this issue Aug 10, 2023 · 28 comments
Labels

Comments

@ksimon1
Copy link
Member

ksimon1 commented Aug 10, 2023

What happened:
During run of common templates e2e tests, import of DV fails on GCP env

What you expected to happen:
DV is imported without error

How to reproduce it (as minimally and precisely as possible):
Run common templates e2e test - I can help setting the env
OR
request new cluster via cluster bot with command launch 4.14 gcp,virtualization-support
deploy KubeVirt, cdi and create datavolume:

oc apply -n kubevirt -f - <<EOF
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.bind.immediate.requested: "true"
  name: centos-stream9
spec:
  source:
    registry:
      url: docker://quay.io/kubevirt/common-templates:centos-stream9
  pvc:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 30Gi
EOF

Environment:

  • CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.56.1

[ksimon:13:03:16~/go/src/kubevirt.io/common-templates]$ oc version
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-08-10-021647
Kubernetes Version: v1.27.4+54fa6e1

DV definition:

[ksimon:13:03:07~/go/src/kubevirt.io/common-templates]$ oc describe datavolume centos-stream9-datavolume-original
Name:         centos-stream9-datavolume-original
Namespace:    kubevirt
Labels:       <none>
Annotations:  cdi.kubevirt.io/storage.bind.immediate.requested: true
API Version:  cdi.kubevirt.io/v1beta1
Kind:         DataVolume
Metadata:
  Creation Timestamp:  2023-08-10T10:52:55Z
  Generation:          1
  Managed Fields:
    API Version:  cdi.kubevirt.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:cdi.kubevirt.io/storage.bind.immediate.requested:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:pvc:
          .:
          f:accessModes:
          f:resources:
            .:
            f:requests:
              .:
              f:storage:
        f:source:
          .:
          f:registry:
            .:
            f:url:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-08-10T10:52:55Z
    API Version:  cdi.kubevirt.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:claimName:
        f:conditions:
        f:phase:
        f:progress:
    Manager:         cdi-controller
    Operation:       Update
    Subresource:     status
    Time:            2023-08-10T11:03:14Z
  Resource Version:  126758
  UID:               452699ab-ab72-4a25-988b-0e82d9fd44cf
Spec:
  Pvc:
    Access Modes:
      ReadWriteOnce
    Resources:
      Requests:
        Storage:  30Gi
  Source:
    Registry:
      URL:  docker://quay.io/kubevirt/common-templates:centos-stream9
Status:
  Claim Name:  centos-stream9-datavolume-original
  Conditions:
    Last Heartbeat Time:   2023-08-10T10:53:01Z
    Last Transition Time:  2023-08-10T10:53:01Z
    Message:               PVC centos-stream9-datavolume-original Bound
    Reason:                Bound
    Status:                True
    Type:                  Bound
    Last Heartbeat Time:   2023-08-10T11:03:14Z
    Last Transition Time:  2023-08-10T10:52:55Z
    Reason:                TransferRunning
    Status:                False
    Type:                  Ready
    Last Heartbeat Time:   2023-08-10T10:53:14Z
    Last Transition Time:  2023-08-10T10:53:14Z
    Reason:                Pod is running
    Status:                True
    Type:                  Running
  Phase:                   ImportInProgress
  Progress:                N/A
Events:
  Type    Reason            Age   From                          Message
  ----    ------            ----  ----                          -------
  Normal  Pending           10m   datavolume-import-controller  PVC centos-stream9-datavolume-original Pending
  Normal  Claim Pending     10m   datavolume-import-controller  target PVC centos-stream9-datavolume-original Pending and Claim Pending
  Normal  ImportScheduled   10m   datavolume-import-controller  Import into centos-stream9-datavolume-original scheduled
  Normal  Claim Pending     10m   datavolume-import-controller  Claim Pending
  Normal  Bound             10m   datavolume-import-controller  PVC centos-stream9-datavolume-original Bound
  Normal  ImportInProgress  10m   datavolume-import-controller  Import into centos-stream9-datavolume-original in progress

Log from importer pod:

[ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
I0810 10:53:14.498944       1 importer.go:103] Starting importer
I0810 10:53:14.498982       1 importer.go:168] begin import process
I0810 10:53:14.499040       1 registry-datasource.go:172] Copying proxy certs
I0810 10:53:14.499058       1 registry-datasource.go:57] Error creating allCertDir open /proxycerts/: no such file or directory
I0810 10:53:14.499099       1 data-processor.go:379] Calculating available size
I0810 10:53:14.499118       1 data-processor.go:391] Checking out file system volume size.
I0810 10:53:14.499127       1 data-processor.go:399] Request image size not empty.
I0810 10:53:14.499141       1 data-processor.go:404] Target size 31509590016.
I0810 10:53:14.499226       1 util.go:38] deleting file: /scratch/lost+found
I0810 10:53:14.499829       1 util.go:38] deleting file: /data/lost+found
I0810 10:53:14.500146       1 data-processor.go:282] New phase: TransferScratch
I0810 10:53:14.500227       1 registry-datasource.go:92] Copying registry image to scratch space.
I0810 10:53:14.500233       1 transport.go:176] Downloading image from 'docker://quay.io/kubevirt/common-templates:centos-stream9', copying file from 'disk' to '/scratch'
I0810 10:53:15.108005       1 transport.go:200] Processing layer {Digest:sha256:8f6ac7ed4a91c9630083524efcef2f59f27404320bfee44397f544c252ad4bd4 Size:91270228 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:17.902869       1 transport.go:200] Processing layer {Digest:sha256:219eea7f39e324e3014bed9c8bb5eb893731fe04e76bda10d0f6f1c41d3e428f Size:3606081 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.255363       1 transport.go:200] Processing layer {Digest:sha256:a61b3233d0cb45893335986c0c4faa3da5ec5369d20d696d20b376084c02a3e0 Size:1283 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.462437       1 transport.go:200] Processing layer {Digest:sha256:6736fe12ce1296f721c6676a09ff55dfa719e9335f48882768b34ea52a29b475 Size:943688263 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.868180       1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397       1 util.go:191] Writing data...
E0810 11:41:35.344127       1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811       1 transport.go:161] Error copying file: unable to write to file: unexpected EOF
E0810 11:41:35.409873       1 transport.go:214] Failed to find VM disk image file in the container image
E0810 11:41:35.409898       1 data-processor.go:278] Failed to find VM disk image file in the container image
kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage
	pkg/importer/transport.go:215
kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage
	pkg/importer/transport.go:262
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:93
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Failed to read registry image
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:95
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Unable to transfer source data to scratch space
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:213
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
I0810 11:41:35.410061       1 util.go:38] deleting file: /scratch/disk
E0810 11:41:35.410113       1 importer.go:177] Failed to find VM disk image file in the container image
kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage
	pkg/importer/transport.go:215
kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage
	pkg/importer/transport.go:262
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:93
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Failed to read registry image
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
	pkg/importer/registry-datasource.go:95
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571
Unable to transfer source data to scratch space
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
	pkg/importer/data-processor.go:213
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571

@codingben
Copy link
Member

Is it related to #2836?

@ksimon1
Copy link
Member Author

ksimon1 commented Aug 10, 2023

I can't say, because this error is happening during first import of disk from quay into cluster. The #2836 is happening during clone of the imported disk (the one about which is this bug) to new dataVolume

@akalenyu
Copy link
Collaborator

Did anything change in the push process of the common template images?
Usually, the disk image file is 107:107 but in this case, I see
image
BTW, why do you need the other files in the container image? (boot/,dev/ etc)

@ksimon1
Copy link
Member Author

ksimon1 commented Aug 11, 2023

The images are created the same method as in the past:

FROM kubevirt/container-disk-v1alpha
ADD centos-stream-9.qcow2 /disk

@aglitke
Copy link
Member

aglitke commented Aug 14, 2023

Are you able to reproduce this with the standard containerDisk images (ie. quay.io/containerdisks/centos-stream:9)? This definitely seems like an issue with this specific containerDisk.

@ksimon1
Copy link
Member Author

ksimon1 commented Aug 14, 2023

yes, centos stream 8, 9, ubuntu, rhcos (all four tested from quay.io/containerdisks), windows (our custom) are not working (the same error), fedora (both our custom and from quay.io/containerdisks) and opensuse are working. The same is happening on azure clusters

ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023
ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023
ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023
ksimon1 added a commit to ksimon1/release that referenced this issue Aug 22, 2023
CDI is not working with CI ocp 4.14 - kubevirt/containerized-data-importer#2838
Signed-off-by: Karel Simon <[email protected]>
@ksimon1
Copy link
Member Author

ksimon1 commented Aug 22, 2023

@aglitke, @akalenyu I just tested this on ocp 4.15 and it is happening there too.

@dominikholler
Copy link

I noticed that the download speed inside the importer pod is reduced:

  1. Check that the cluster is fine:
oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  96.4M      0  0:00:09  0:00:09 --:--:--  101M
  1. Run the same download at the same time inside a importer pod:
oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  3920k      0  0:03:56  0:03:56 --:--:-- 94.0M

on

OCP 4.14.0-0.nightly-2023-08-11-055332
CNV 4.14.0-1744 provided by Red Hat

@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?

@dustymabe
Copy link

I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.

Here's the VM definition I'm using:

---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: fcos
spec:
  runStrategy: Always
  dataVolumeTemplates:
  - metadata:
      name: fcos-data-volume
    spec:
      storage:
        volumeMode: Block
        resources:
          requests:
            storage: 10Gi
      source:
        registry:
          url: "docker://quay.io/fedora/fedora-coreos-kubevirt:stable"
  template:
    metadata:
      creationTimestamp: null
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: fcos-data-volume-disk1
          - disk:
              bus: virtio
            name: cloudinitdisk
          rng: {}
        resources:
          requests:
            memory: 2048M
      volumes:
      - dataVolume:
          name: fcos-data-volume
        name: fcos-data-volume-disk1
      - name: cloudinitdisk
        cloudInitConfigDrive:
          userData: |
            {
              "ignition": {
                "version": "3.3.0"
              },
              "passwd": {
                "users": [
                  {
                    "name": "core",
                    "sshAuthorizedKeys": [
                      "ecdsa-sha2-nistp521 AAAA..."
                    ]
                  }
                ]
              }
            }

I'll follow along in this issue to see what the resolution is.

@akalenyu
Copy link
Collaborator

I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.

Here's the VM definition I'm using:

I'll follow along in this issue to see what the resolution is.

Thanks for jumping into this issue!
It would be great to see if this is indeed the same by following the importer pod logs,
similarly to how the author did:

[ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
...
I0810 10:53:18.868180       1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397       1 util.go:191] Writing data...
E0810 11:41:35.344127       1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811       1 transport.go:161] Error copying file: unable to write to file: unexpected EOF

The PVC yamls corresponding to the import operation would also be helpful:

  • fcos-data-volume
  • fcos-data-volume-scratch

@akalenyu
Copy link
Collaborator

I noticed that the download speed inside the importer pod is reduced:

  1. Check that the cluster is fine:
oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  96.4M      0  0:00:09  0:00:09 --:--:--  101M
  1. Run the same download at the same time inside a importer pod:
oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  3920k      0  0:03:56  0:03:56 --:--:-- 94.0M

on

OCP 4.14.0-0.nightly-2023-08-11-055332
CNV 4.14.0-1744 provided by Red Hat

@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?

Can't reproduce these numbers on 4.14.0-1763:

$ oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   255M      0  0:00:03  0:00:03 --:--:--  255M


$ oc exec -i -n default importer-prime-9614b3bd-7e71-4306-96b2-c042efc26929 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  98.3M      0  0:00:09  0:00:09 --:--:--  102M

In general, it would be surprising to see a slowdown with mirrors, as we have recently merged and released #2841 which makes us download the entire image and only then convert it.

Is it possible the pods you picked are scheduled to different nodes?

@dustymabe
Copy link

Thanks for jumping into this issue!
It would be great to see if this is indeed the same by following the importer pod logs,

The logs are here: importer-fcos-data-volume-importer.txt

Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.

@dominikholler
Copy link

dominikholler commented Aug 28, 2023

Can't reproduce these numbers on 4.14.0-1763:

@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?

@ksimon1
Copy link
Member Author

ksimon1 commented Aug 28, 2023

@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.

Is it possible the pods you picked are scheduled to different nodes?

I tried it and importer pod is slower than prometheus pod, eventhough pods are on the same node

[ksimon:10:14:35~]$ oc exec -i -n openshift-monitoring prometheus-k8s-1 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 10  906M   10 91.2M    0     0  70.2M      0  0:00:12  0:00:01  0:00:11 70.1M
 20  906M   20  185M    0     0  80.4M      0  0:00:11  0:00:02  0:00:09 80.4M

vs

[ksimon:10:15:29~]$ oc exec -i  importer-prime-eda481be-064c-4a9f-967d-752cb4b3c2fb -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
Defaulted container "importer" out of: importer, server, init (init)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  906M    0 1366k    0     0   503k      0  0:30:44  0:00:02  0:30:42  503k

@akalenyu
Copy link
Collaborator

Thanks for jumping into this issue!
It would be great to see if this is indeed the same by following the importer pod logs,

The logs are here: importer-fcos-data-volume-importer.txt

Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.

Yep that's the same issue

Can't reproduce these numbers on 4.14.0-1763:

@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?

Maybe this depends on which mirror centos.org redirects to:

$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  63.9M      0  0:00:14  0:00:14 --:--:-- 36.2M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   156M      0  0:00:05  0:00:05 --:--:--  162M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   284M      0  0:00:03  0:00:03 --:--:--  284M

@akalenyu
Copy link
Collaborator

@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.

pullMethod: node uses the container runtime on the node to pull images, similarly to how it would
do that for regular pod images. Are those super slow too? you can try a simple crictl pull on the node

@aglitke
Copy link
Member

aglitke commented Aug 28, 2023

A couple notes from the SIG-storage discussion of this issue:

  • Could it be a host caching issue? We may want to test adding an fsync call after the container image is extracted to scratch space
  • Need to scrutinize the flow. Could there be an nbdkit issue? How is qemu-img convert involved?
  • Could there be a network connectivity issue to the registry server specific to this environment?
  • From the node, try a skopeo pull to see if the image can be retrieved (you should be able to provide credentials on the command line)
  • Can the same image be pushed to a registry other than quay.io to see if it's related to that registry.

@akalenyu
Copy link
Collaborator

akalenyu commented Aug 28, 2023

  • Could there be a network connectivity issue to the registry server specific to this environment?

So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry)
and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors:
containers/image#1145 (comment)
Seems to specifically impact large images like our use case

I created a PR to bump this library in CDI to hopefully get this extra resiliency:
#2874

@akalenyu
Copy link
Collaborator

  • Could there be a network connectivity issue to the registry server specific to this environment?

So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry) and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors: containers/image#1145 (comment) Seems to specifically impact large images like our use case

I created a PR to bump this library in CDI to hopefully get this extra resiliency: #2874

So I just verified this PR on a cluster-bot gcp cluster:

I0828 15:03:32.885930       1 importer.go:103] Starting importer
...
I0828 15:03:33.380544       1 util.go:194] Writing data...
time="2023-08-28T15:57:29Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 766851345 bytes…"
time="2023-08-28T16:34:40Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 157415654 bytes…"
...
I0828 17:06:46.600304       1 data-processor.go:255] New phase: Complete
I0828 17:06:46.600386       1 importer.go:216] Import Complete

You can see the retry mechanism had to kick in, and the pull is still very slow. It took more than two hours

@dustymabe
Copy link

maybe there is some issue when pulling from quay.io when in GCP?

@akalenyu
Copy link
Collaborator

akalenyu commented Aug 28, 2023

maybe there is some issue when pulling from quay.io when in GCP?

Thought so too, but the slowness reproduces with other HTTP sources like

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
...
spec:
  source:
    http:
      url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2

(Both with curl and the cdi importer process)

@ksimon1
Copy link
Member Author

ksimon1 commented Aug 29, 2023

maybe there is some issue when pulling from quay.io when in GCP?

this is not only gcp issue, azure is affected as well

@fabiand fabiand changed the title DataVolume import is failing on GCP env DataVolume import is failing to import from GCP and Azure Aug 29, 2023
@lyarwood
Copy link
Member

/cc

@akalenyu
Copy link
Collaborator

akalenyu commented Aug 29, 2023

@maya-r suggested it may have to do with resource usage of the CDI importer process,
and it seems like that is the case here.

We bumped (2x) the CDI default requests&limits and the import completed quickly
I suggest that you also give this a try, basically, just edit the CDI resource with

apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
...
spec:
  config:
    featureGates:
    - HonorWaitForFirstConsumer
    podResourceRequirements:
      limits:
        cpu: 1500m
        memory: 1200M
      requests:
        cpu: 100m
        memory: 60M

I have no idea why we get throttled for a simple image pull so will have to figure that out..
One difference between 4.13 and 4.14 is cgroupsv2 being the default so throttles will happen instead of OOMs

@ksimon1
Copy link
Member Author

ksimon1 commented Aug 29, 2023

as you can see in this PR kubevirt/common-templates#542
many tests failed on importing DV into cluster even with the change.

@jcanocan
Copy link

/cc

@dominikholler
Copy link

@akalenyu thanks for your impressive investigation!
@ksimon1 can you confirm that the issue is fixed?

@ksimon1
Copy link
Member Author

ksimon1 commented Oct 5, 2023

Yes, issue is fixed

@ksimon1 ksimon1 closed this as completed Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants