Many cause,not_found errors producing GBs of logs #1001

MrApe · 2018-03-08T11:42:15Z

I have a 4 node cluster with N=3, D=R=W=2 setting. I deleted a lot of files using s3cmd. Now each storage node produces GBs of error logs like this:

[E]     [email protected]  2018-03-08 12:26:35.541856 +0100        1520508395      leo_storage_handler_object:get/1        89      [{from,storage},{method,get},{key,<<"fifo-backups/056054ce-5722-4c7a-ae96-835403748465/ff6b25c1-830a-42c5-82f0-5386108cc64c\n76">>},{cause,not_found}]

(as you can see it's a project fifo installation, doing backups to leofs)

The objects all belong to the files deleted. The files I did not delete are fine. I did a leofs-adm recover-cluster to make the nodes detect orphan objects and rebuild the ring. leofs-adm mq-stats shows this as running for about a week(!) now:

              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
[...]
 leo_per_object_queue           |   running   | 38105          | 1000           | 300            | recover inconsistent objs

What is producing these errors and how do I solve this problem?

Thanks in advance.
Best, Jonas

The text was updated successfully, but these errors were encountered:

yosukehara · 2018-03-09T00:52:40Z

@MrApe Thank you for your report. We would like to know your LeoFS' environment and the current state as below:

Environment
- LeoFS version
- Erlang version
- What kind of virtualization(VMWare/Docker/Xen...) are you using or using a bare metal?
- What operating system(uname -a) and processor architecture(cat /proc/cpuinfo) and memory(cat /proc/meminfo) are you using?
Command histories (grep leofs-adm)
Latest LeoFS' state by leofs-adm status

mocchira · 2018-03-09T00:55:40Z

@MrApe Let me ask a few additional questions.

The exact s3cmd with parameters when you deleted files
Error logs if possible.

If your LeoFS version isn't the latest one (1.3.8) and you used s3cmd rb or s3cmd del/rm -r to delete files then you may hit some known issues described on #725. If that's the case, the fundamental fix would be upgrade LeoFS to the latest stable 1.3.8.

MrApe · 2018-03-09T07:51:23Z

Thanks for having a look. This is the environment:

Leofs: 1.3.4 (from project fifo)
Erlang: 19.1
The nodes are SmartOS zones
System: SunOS 5.11 joyent_20170622T212149Z i86pc i386 i86pc
The hypervisors are 4 different machines but all with a x86 Xeon E5-263x
zonememstat (on 4 distinct hypervisors but copied together for better readability):

                                 ZONE  RSS(MB)  CAP(MB)  NOVER  POUT(MB) SWAP%
 fce69b6c-27bf-eac4-e01c-a4f143307a24     1128     4096      0         0 18,07
 76a9a6e6-99da-4c94-a2fd-9dae3c817df2      377     4096      0         0  4,56
 0e4f8731-4dc1-e734-8d10-852396989f60     1192     4096      0         0 20,43
 e221d1fa-4614-ecc3-af09-cb0cf8d95ecd      224     4096      0         0  3,46

Leofs command history is attached: history.txt
Leofs status: status.txt
s3cmd used. Same structure but different files:

s3cmd -c ~/.s3cfg-fifo del s3://fifo-backups/65f52ed3-f909-c99b-c388-e4be1a34cc5a/ff5058cf-0c5f-4190-ab27-312d9a63ff32

The only content of the logs of leo_storage are error messages as above. No errors in manager or gateway.

I did not use -r or rb. Thanks again for helping!

yosukehara · 2018-03-14T01:46:40Z

@MrApe Thank you for sharing the informative report. We're going to survey this issue from today.

mocchira · 2018-03-27T05:55:38Z

WIP

mocchira · 2018-04-04T05:36:23Z

@MrApe I've tried to reproduce this case but still no luck however with 1.3.4, there are many bugs related to handling large objects so I'd recommend you upgrade >= 1.3.8 (1.4.0 would be the best at the moment).

Regarding remained queue items, would you like to try this procedure https://gist.github.com/mocchira/1c4852c57c7b328aef46eb234b74093b ? That would free the queue on leo_storage up. I hope you will find it helpful.

MrApe · 2018-04-04T07:59:33Z

I temporarily solved the problem by resetting the storage completely and importing the objects. It runs fine now. However, I will do an update as soon as possible.

…

On 4. Apr 2018, 07:36 +0200, yoshiyuki.kanno ***@***.***>, wrote: @MrApe I've tried to reproduce this case but still no luck however with 1.3.4, there are many bugs related to handling large objects so I'd recommend you upgrade >= 1.3.8 (1.4.0 would be the best at the moment). Regarding remained queue items, would you like to try this procedure https://gist.github.com/mocchira/1c4852c57c7b328aef46eb234b74093b ? That would free the queue on leo_storage up. I hope you will find it helpful. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mocchira · 2018-04-04T09:04:31Z

@MrApe Great to hear that. I will close this issue however if you find something wrong after upgrading then feel free to reopen this issue or file another issue :)

yosukehara mentioned this issue Mar 9, 2018

[doc] Fix recover-cluster's section because there is a possibility of misunderstanding #1002

Closed

yosukehara mentioned this issue Mar 9, 2018

[leo_manager] To avoid executing recover-cluster in case of not using multi DC replication #1003

Closed

mocchira added survey Question v1.4 labels Mar 30, 2018

mocchira added this to the 1.4.1 milestone Mar 30, 2018

mocchira closed this as completed Apr 4, 2018

mocchira self-assigned this Apr 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many cause,not_found errors producing GBs of logs #1001

Many cause,not_found errors producing GBs of logs #1001

MrApe commented Mar 8, 2018

yosukehara commented Mar 9, 2018

mocchira commented Mar 9, 2018

MrApe commented Mar 9, 2018

yosukehara commented Mar 14, 2018 •

edited

Loading

mocchira commented Mar 27, 2018

mocchira commented Apr 4, 2018

MrApe commented Apr 4, 2018 via email

mocchira commented Apr 4, 2018

Many cause,not_found errors producing GBs of logs #1001

Many cause,not_found errors producing GBs of logs #1001

Comments

MrApe commented Mar 8, 2018

yosukehara commented Mar 9, 2018

mocchira commented Mar 9, 2018

MrApe commented Mar 9, 2018

yosukehara commented Mar 14, 2018 • edited Loading

mocchira commented Mar 27, 2018

mocchira commented Apr 4, 2018

MrApe commented Apr 4, 2018 via email

mocchira commented Apr 4, 2018

yosukehara commented Mar 14, 2018 •

edited

Loading