Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver exceeds database space, preventing all further milvus operations #6753

Closed
NotRyan opened this issue Jul 22, 2021 · 7 comments
Closed
Assignees
Labels
kind/enhancement Issues or changes related to enhancement triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@NotRyan
Copy link
Contributor

NotRyan commented Jul 22, 2021

After a high number of collection creations, collection drops, successful inserts, and failed inserts, all milvus operations will begin to fail with the error message

"BaseException: <BaseException: (code=1, message=Drop collection failed: etcdserver: mvcc: database space exceeded)>"

despite the instance hosting the etcd node having plenty of free disk space and free memory.

Steps/Code to reproduce:

Inconsistent, has occurred on two separate occasions while performing collection and insertion operations. Have not found a reliable way to reproduce yet.

Expected result:

etcd host does not appear to be out of memory, so this should not happen.

Actual results:

"BaseException: <BaseException: (code=1, message=Drop collection failed: etcdserver: mvcc: database space exceeded)>"

Memory and Disk space on etcd node:
Screen Shot 2021-07-22 at 1 57 28 PM

docker logs from etcd node:
etcd_logs.txt

docker system df:

ec2-user@ip-10-0-0-102 /]$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          1         1         39.45MB   0B (0%)
Containers      1         1         2.532GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B

Environment:

  • Milvus version(e.g. v2.0.0-RC2 or 8b23a93): v2.0.0-RC2 (Docker image v2.0.0-rc2-20210712-a8e5fd2)
  • Deployment mode(standalone or cluster): Cluster
  • SDK version(e.g. pymilvus v2.0.0rc2): PyMilvus v2.0.0rc3dev6
  • OS(Ubuntu or CentOS): Amazon Linux 2
  • CPU/Memory: 16 GB memory
  • GPU:
  • Others: 1 TB disk

Configuration file:

Deployed on AWS using Terraform/Ansible, setup procedure described in https://zilliverse.feishu.cn/docs/doccnuxzvQVhPqzityx2sO6kUfc

Additional context:

@NotRyan NotRyan added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 22, 2021
@filip-halt
Copy link
Contributor

filip-halt commented Aug 2, 2021

I am running into the same issue, for me it occurred when inserting and indexing. @xiaofan-luan is this something that anyone else has been running into on the team?

@yhmo yhmo added this to the 2.0-Backlog milestone Aug 12, 2021
@godchen0212
Copy link
Contributor

Can you provide etcd metrics for a etcd debug?
The method of exporting metrics varies according to the etcd version, you can refer to this document
Metrics | etcd

@LoveEachDay
Copy link
Contributor

The default etcd storage size limit is 2GB, you can check the etcd db size with the following command:
etcdctl endpoint status -w table
Check the DB SIZE cell.

You can run etcdctl compact <rev> to decrease the db size. After the db size is less than 2GB, you must run etcdctl alarm disarm before running any etcd writes.

@xiaofan-luan xiaofan-luan modified the milestones: 2.0-Backlog, 2.0.0-RC5 Aug 19, 2021
@xiaofan-luan
Copy link
Collaborator

@godchen0212 @jeffoverflow we may need to change default etcd settings to maintain less snapshots, we hit the similar issue when we use zookeeper

@xiaofan-luan xiaofan-luan added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/enhancement Issues or changes related to enhancement priority/important-longterm and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 19, 2021
@xiaofan-luan
Copy link
Collaborator

/assign @jeffoverflow

@jeffoverflow
Copy link
Contributor

/assign @LoveEachDay

@LoveEachDay
Copy link
Contributor

We've enabled auto compaction for etcd in docker-compose and helm charts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Issues or changes related to enhancement triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants