Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many cause,not_found errors producing GBs of logs #1001

Closed
MrApe opened this issue Mar 8, 2018 · 8 comments
Closed

Many cause,not_found errors producing GBs of logs #1001

MrApe opened this issue Mar 8, 2018 · 8 comments
Assignees
Milestone

Comments

@MrApe
Copy link

MrApe commented Mar 8, 2018

I have a 4 node cluster with N=3, D=R=W=2 setting. I deleted a lot of files using s3cmd. Now each storage node produces GBs of error logs like this:

[E]     [email protected]  2018-03-08 12:26:35.541856 +0100        1520508395      leo_storage_handler_object:get/1        89      [{from,storage},{method,get},{key,<<"fifo-backups/056054ce-5722-4c7a-ae96-835403748465/ff6b25c1-830a-42c5-82f0-5386108cc64c\n76">>},{cause,not_found}]

(as you can see it's a project fifo installation, doing backups to leofs)

The objects all belong to the files deleted. The files I did not delete are fine. I did a leofs-adm recover-cluster to make the nodes detect orphan objects and rebuild the ring. leofs-adm mq-stats shows this as running for about a week(!) now:

              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
[...]
 leo_per_object_queue           |   running   | 38105          | 1000           | 300            | recover inconsistent objs

What is producing these errors and how do I solve this problem?

Thanks in advance.
Best, Jonas

@yosukehara
Copy link
Member

@MrApe Thank you for your report. We would like to know your LeoFS' environment and the current state as below:

  • Environment
    • LeoFS version
    • Erlang version
    • What kind of virtualization(VMWare/Docker/Xen...) are you using or using a bare metal?
    • What operating system(uname -a) and processor architecture(cat /proc/cpuinfo) and memory(cat /proc/meminfo) are you using?
  • Command histories (grep leofs-adm)
  • Latest LeoFS' state by leofs-adm status

@mocchira
Copy link
Member

mocchira commented Mar 9, 2018

@MrApe Let me ask a few additional questions.

  • The exact s3cmd with parameters when you deleted files
  • Error logs if possible.

If your LeoFS version isn't the latest one (1.3.8) and you used s3cmd rb or s3cmd del/rm -r to delete files then you may hit some known issues described on #725. If that's the case, the fundamental fix would be upgrade LeoFS to the latest stable 1.3.8.

@MrApe
Copy link
Author

MrApe commented Mar 9, 2018

Thanks for having a look. This is the environment:

  • Leofs: 1.3.4 (from project fifo)
  • Erlang: 19.1
  • The nodes are SmartOS zones
  • System: SunOS 5.11 joyent_20170622T212149Z i86pc i386 i86pc
  • The hypervisors are 4 different machines but all with a x86 Xeon E5-263x
  • zonememstat (on 4 distinct hypervisors but copied together for better readability):
                                 ZONE  RSS(MB)  CAP(MB)  NOVER  POUT(MB) SWAP%
 fce69b6c-27bf-eac4-e01c-a4f143307a24     1128     4096      0         0 18,07
 76a9a6e6-99da-4c94-a2fd-9dae3c817df2      377     4096      0         0  4,56
 0e4f8731-4dc1-e734-8d10-852396989f60     1192     4096      0         0 20,43
 e221d1fa-4614-ecc3-af09-cb0cf8d95ecd      224     4096      0         0  3,46
  • Leofs command history is attached: history.txt
  • Leofs status: status.txt
  • s3cmd used. Same structure but different files:
s3cmd -c ~/.s3cfg-fifo del s3://fifo-backups/65f52ed3-f909-c99b-c388-e4be1a34cc5a/ff5058cf-0c5f-4190-ab27-312d9a63ff32

The only content of the logs of leo_storage are error messages as above. No errors in manager or gateway.

I did not use -r or rb. Thanks again for helping!

@yosukehara
Copy link
Member

yosukehara commented Mar 14, 2018

@MrApe Thank you for sharing the informative report. We're going to survey this issue from today.

@mocchira
Copy link
Member

WIP

@mocchira
Copy link
Member

mocchira commented Apr 4, 2018

@MrApe I've tried to reproduce this case but still no luck however with 1.3.4, there are many bugs related to handling large objects so I'd recommend you upgrade >= 1.3.8 (1.4.0 would be the best at the moment).

Regarding remained queue items, would you like to try this procedure https://gist.github.com/mocchira/1c4852c57c7b328aef46eb234b74093b ? That would free the queue on leo_storage up. I hope you will find it helpful.

@MrApe
Copy link
Author

MrApe commented Apr 4, 2018 via email

@mocchira
Copy link
Member

mocchira commented Apr 4, 2018

@MrApe Great to hear that. I will close this issue however if you find something wrong after upgrading then feel free to reopen this issue or file another issue :)

@mocchira mocchira closed this as completed Apr 4, 2018
@mocchira mocchira self-assigned this Apr 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants