Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup to GCS and S3 never completes #547

Open
pbackup12345 opened this issue Apr 8, 2023 · 4 comments
Open

backup to GCS and S3 never completes #547

pbackup12345 opened this issue Apr 8, 2023 · 4 comments
Labels
backup bug Something isn't working

Comments

@pbackup12345
Copy link

pbackup12345 commented Apr 8, 2023

I have a GKE autoPilot pod with a relatively vanilla setup with tlsTermination at Ingress and 3 pods. Everything works except for the backup feature.

I use
Solr-operator 0.6.0,
solr: 8.11.0
Zookeeper: 0.2.14

Following are the relevant parts of my setup:

Main yaml on S3 location:

  backupRepositories:
    - name: "gcs-backups-1"
      gcs:
        bucket: "backupbx"
        gcsCredentialSecret:
          name: "gcssecret1"
          key: "service-account-key.json"
        baseLocation: "d"
    - name: "s3-backups-1"
      s3:
        region: "us-east-1" # Required
        bucket: "solr-xsearch" # Required
        credentials:
          accessKeyIdSecret: # Optional
            name: aws-secrets
            key: access-key-id
          secretAccessKeySecret: # Optional
            name: aws-secrets
            key: secret-access-key

My backup yaml:

apiVersion: solr.apache.org/v1beta1
kind: SolrBackup
metadata:
  name: local-backup14
  namespace: sop030
spec:
  repositoryName: "s3-backups-1"
  solrCloud: explore
  collections:
    - dsearch

The backup actually starts and both location (S3 and GCS) receives files as well, but after a while the backup process stops.
There are no solr error messages, but this is the relevant portion of the logs of the pod which does the backup:

2023-04-08 14:42:00.734 INFO  (parallelCoreAdminExecutor-19-thread-3-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr x:dsearch_shard1_replica_n1 local-backup14-dsearch8519612781285 BACKUPCORE) [c:dsearch s:shard1 r:core_node2 x:dsearch_shard1_replica_n1] o.a.s.h.IncrementalShardBackup Done creating backup snapshot at s3:/local-backup14-dsearch/dsearch/ shardBackupMetadataFile:org.apache.solr.core.backup.ShardBackupId@46768e0a
2023-04-08 14:42:01.396 INFO  (qtp1306246648-19) [   ] o.a.s.h.a.CoreAdminOperation Checking request status for : local-backup14-dsearch8519612781285
2023-04-08 14:42:01.396 INFO  (qtp1306246648-19) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&requestid=local-backup14-dsearch8519612781285&action=REQUESTSTATUS&wt=javabin&version=2} status=0 QTime=0
2023-04-08 14:42:01.398 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.a.c.BackupCmd Starting to backup ZK data for backupName=local-backup14-dsearch
2023-04-08 14:42:04.358 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:04.358 INFO  (qtp1306246648-21) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:09.357 INFO  (qtp1306246648-20) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:14.358 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:14.361 INFO  (qtp1306246648-21) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=5
2023-04-08 14:42:16.585 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.a.c.BackupCmd Completed backing up ZK data for backupName=local-backup14-dsearch
2023-04-08 14:42:16.655 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper
2023-04-08 14:42:18.400 INFO  (OverseerCollectionConfigSetProcessor-144115198735482887-explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr-n_0000000002) [   ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper
2023-04-08 14:42:19.358 INFO  (qtp1306246648-20) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:24.359 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:24.360 INFO  (qtp1306246648-22) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:29.359 INFO  (qtp1306246648-20) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:34.360 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=6
2023-04-08 14:42:34.360 INFO  (qtp1306246648-24) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=6
2023-04-08 14:42:39.358 INFO  (qtp1306246648-21) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:42.779 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=3
2023-04-08 14:42:42.797 INFO  (qtp1306246648-22) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=2

And this is where it seems to die:

2023-04-08 14:42:18.400 INFO  (OverseerCollectionConfigSetProcessor-144115198735482887-explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr-n_0000000002) [   ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper

From there the log only has endless calls to check the requeststatus with no results:

2023-04-08 14:43:27.884 INFO  (qtp1306246648-23) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=2

Additionally through normal api calls the backup and restore function works perfectly. Specifically the following runs without a hitch:

https://<solr-server>/solr/admin/collections?action=BACKUP&name=local-backup15&collection=dsearch&repository=s3-backups-1&location=/

And my main yaml in full:

apiVersion: solr.apache.org/v1beta1
kind: SolrCloud
metadata:
  name: explore
spec:
  backupRepositories:
    - name: "gcs-backups-1"
      gcs:
        bucket: "backupbx"
        gcsCredentialSecret:
          name: "gcssecret1"
          key: "service-account-key.json"
        baseLocation: "d"
    - name: "s3-backups-1"
      s3:
        region: "us-east-1" # Required
        bucket: "solr-xsearch" # Required
        credentials:
          accessKeyIdSecret: # Optional
            name: aws-secrets
            key: access-key-id
          secretAccessKeySecret: # Optional
            name: aws-secrets
            key: secret-access-key
  customSolrKubeOptions:
    ingressOptions:
      ingressClassName: nginx
    podOptions:
      resources:
        limits:
          memory: 3Gi
        requests:
          cpu: 700m
          memory: 3Gi
  dataStorage:
    persistent:
      pvcTemplate:
        spec:
          resources:
            requests:
              storage: 2Gi
      reclaimPolicy: Delete
  solrSecurity:
    authenticationType: Basic
  replicas: 2
  solrImage:
    repository: solr
    tag: 8.11.0
  solrJavaMem: -Xms500M -Xmx500M
  updateStrategy:
    managed:
      maxPodsUnavailable: 2
      maxShardReplicasUnavailable: 2
    method: Managed
  zookeeperRef:
    provided:
      chroot: /explore
      image:
        pullPolicy: IfNotPresent
        repository: pravega/zookeeper
        tag: 0.2.15
      persistence:
        reclaimPolicy: Delete
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi
      replicas: 3
  solrAddressability:
    external:
      domainName: <mydomain redacted>.xyz
      method: Ingress
      hideNodes: true
      useExternalAddress: false
      ingressTLSTermination:
        tlsSecret: explore-solr-tls-letsencrypt
@HoustonPutman
Copy link
Contributor

Thanks for providing so much information!

Does this happen after a few backups complete, or does it happen from the first backup?

@pbackup12345
Copy link
Author

It happens after first installation, when I created a collection and added a very few records. And it happens repeatedly. I.e. I started from scratch several times and bumped into the same problem with a default autoPilot GKE cluster.

@HoustonPutman
Copy link
Contributor

So at this line, it should have finished almost all of the backup logic and the files should exist in s3/gcs. The line after that will delete any old incremental backups if necessary (which it shouldn't be, as its the first backup...).

2023-04-08 14:42:16.585 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.a.c.BackupCmd Completed backing up ZK data for backupName=local-backup14-dsearch

Also before that error can happen, it should save the state of the completed task. So whenever the requeststatus request is sent, it should have information to populate the result.

From there the log only has endless calls to check the requeststatus with no results:

Can you show me what this means? What does the "no results" json look like?

Have you tried running an async backup yourself?

@HoustonPutman HoustonPutman added bug Something isn't working backup labels Apr 17, 2023
@cesarfm
Copy link

cesarfm commented May 11, 2023

Hello all. I am observing exactly the same symptoms described above.

Moreover, I see something that might be of interest for the investigation.

  • Let's say I have a GCS bucket called mybucket. After trying to run a backup, the files are created in the bucket as described by the OP.
  • my SolrCloud is called mycloud-solr
  • my SolrBackup resource is called mycloud-backup, and I configured it with location: mycloud-solr (the location matches with the SolrCloud name, this is just a preference). So it's defined like:
apiVersion: solr.apache.org/v1beta1
kind: SolrBackup
metadata:
  name: mycloud-backup
  namespace: search
spec:
  repositoryName: "gcs-backups"
  solrCloud: mycloud-solr
  location: "mycloud-solr"
  collections:
    - articles
  • the collection being backed up is called articles
  • then, I see that the actual backup files have the path mycloud-solr/mycloud-backup-articles/articles (so the path seems to be {location}/{solrBackupName}-{collection}/{collection}
  • Here is the thing: if I then connect to Solr and request:
    http://localhost:8983/solr/admin/collections?action=LISTBACKUP&name=mycloud-backup-articles&location=mycloud-solr&repository=gcs-backups&gcsBucket=mybucket, I do get the backups I created listed there, even if the SolrBackup CRD still reports itself as "in progress".
    • note that the name=mycloud-backup-articles corresponds to the "directory" directly under the location
{
    "responseHeader": {
        "status": 0,
        "QTime": 347
    },
    "collection": "articles",
    "backups": [
        {
            "indexFileCount": 122,
            "indexSizeMB": 0.748,
            "shardBackupIds": {
                "shard1": "md_shard1_0.json"
            },
            "collection.configName": "articles",
            "backupId": 0,
            "collectionAlias": "articles",
            "startTime": "2023-05-11T10:22:25.339652045Z",
            "indexVersion": "9.4.2",
            "endTime": "2023-05-11T10:22:42.946120826Z"
        },
        {
            "indexFileCount": 122,
            "indexSizeMB": 0.748,
            "shardBackupIds": {
                "shard1": "md_shard1_1.json"
            },
            "collection.configName": "articles",
            "backupId": 1,
            "collectionAlias": "articles",
            "startTime": "2023-05-11T10:35:46.024110473Z",
            "indexVersion": "9.4.2",
            "endTime": "2023-05-11T10:35:50.499108324Z"
        }
    ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backup bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants