Errors about multipart object parts on storages during upload #845

vstax · 2017-09-23T11:29:32Z

I'm uploading data to production cluster (6 storage servers, N=3, W=2, R=1). It's latest develop version (with latest leo_object_storage as well). The code and logic is exactly the same as #722 - python code walks through filesystem, for each objects it finds it executes "HEAD" to see if it's on storage, if it's not, it executes "PUT" and uploads the object. It's using boto3. In this experiment, there were no objects so it's always PUT after HEAD. There is no other load on cluster other than uploading. Upload is performed in parallel - 6 processes, but boto3's threads for multipart uploads are disabled, each upload works in single thread.
(the uploaded data can be scrapped and I can upload it again, it's not a problem at this point. I can repeat experiment after changing some settings, if needed)

I'm having errors on storage nodes and object state isn't consistent, however the alarming part is that there are no errors on client, I'm getting "200" result for everything. Client assumes that these objects were safely uploaded. Retries are disabled in boto3 - so apparently it really doesn't get any error at all.

I'm having errors on storage nodes. Log on bodies03:

[E]	[email protected]	2017-09-22 22:34:20.566502 +0300	1506108860	leo_storage_handler_object:put/4	423	[{from,storage},{method,delete},{key,<<"body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz\n3817fb1c3269fc06779418276cefe0aa">>},{req_id,55429545},{cause,not_found}]
[E]	[email protected]	2017-09-23 03:31:13.906581 +0300	1506126673	leo_storage_handler_object:put/4	423	[{from,storage},{method,delete},{key,<<"body/3f/33/ca/3f33cae6299a61aa70f83885e820edea90ad59ba5dd742ee02a43ed54dce8c84ba9ea172a603f834ed1979e0cf2fc9b0e01cee0000000000.xz\nf2889495d9a623e56c18856a90f35dfd">>},{req_id,66506851},{cause,not_found}]

Object status:

[vm@bodies-master ~]$ leofs-adm whereis 'body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz\n3817fb1c3269fc06779418276cefe0aa'
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                  |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | f317474ed5353027c02e4a6f9be8f7fa     |         0B |   d41d8cd98f | false          |              0 | 559cc4b9e58da  | 2017-09-22 22:34:20 +0300
  *    | [email protected]      | f317474ed5353027c02e4a6f9be8f7fa     |         0B |   d41d8cd98f | false          |              0 | 559cc4b9e58da  | 2017-09-22 22:34:20 +0300
       | [email protected]      | f317474ed5353027c02e4a6f9be8f7fa     |         0B |   d41d8cd98f | false          |              0 | 559cc4b983676  | 2017-09-22 22:34:20 +0300

[vm@bodies-master ~]$ leofs-adm whereis 'body/3f/33/ca/3f33cae6299a61aa70f83885e820edea90ad59ba5dd742ee02a43ed54dce8c84ba9ea172a603f834ed1979e0cf2fc9b0e01cee0000000000.xz\nf2889495d9a623e56c18856a90f35dfd'
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                  |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | bc2a17ec483f7447523623dadf48d0fa     |         0B |   d41d8cd98f | false          |              0 | 559d07161f511  | 2017-09-23 03:31:13 +0300
       | [email protected]      | bc2a17ec483f7447523623dadf48d0fa     |         0B |   d41d8cd98f | false          |              0 | 559d0715a2dbc  | 2017-09-23 03:31:13 +0300
  *    | [email protected]      | bc2a17ec483f7447523623dadf48d0fa     |         0B |   d41d8cd98f | false          |              0 | 559d07161f511  | 2017-09-23 03:31:13 +0300

Log on bodies05:

[E]	[email protected]	2017-09-23 01:08:31.141043 +0300	1506118111	leo_storage_handler_object:put/4	423	[{from,storage},{method,delete},{key,<<"body/32/74/05/3274056a92a181acdcd5dfada2ee743a6bee6dc919ed9b5273c9ca7b5c1a1f9d3d5573fb11b82bea92d2703d004c03990014d40000000000.xz\nea161c4968feea289646687897c43564">>},{req_id,86068850},{cause,not_found}]
[E]	[email protected]	2017-09-23 01:13:20.691197 +0300	1506118400	leo_storage_handler_object:put/4	423	[{from,storage},{method,delete},{key,<<"body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz\na7c6516a2e15f44a0e037043c88736d8">>},{req_id,115093938},{cause,not_found}]

Objects status:

[vm@bodies-master ~]$ leofs-adm whereis 'body/32/74/05/3274056a92a181acdcd5dfada2ee743a6bee6dc919ed9b5273c9ca7b5c1a1f9d3d5573fb11b82bea92d2703d004c03990014d40000000000.xz\nea161c4968feea289646687897c43564'
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                  |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | cf142e51b924ba5fb4165b5bb7387afe     |         0B |   d41d8cd98f | false          |              0 | 559ce72f74d6a  | 2017-09-23 01:08:30 +0300
  *    | [email protected]      | cf142e51b924ba5fb4165b5bb7387afe     |         0B |   d41d8cd98f | false          |              0 | 559ce72f74d6a  | 2017-09-23 01:08:30 +0300
       | [email protected]      | cf142e51b924ba5fb4165b5bb7387afe     |         0B |   d41d8cd98f | false          |              0 | 559ce72f0c762  | 2017-09-23 01:08:30 +0300

[vm@bodies-master ~]$ leofs-adm whereis 'body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz\na7c6516a2e15f44a0e037043c88736d8'
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                  |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | 25964721380d6594f0d193cf949ee286     |         0B |   d41d8cd98f | false          |              0 | 559ce84414f6a  | 2017-09-23 01:13:20 +0300
  *    | [email protected]      | 25964721380d6594f0d193cf949ee286     |         0B |   d41d8cd98f | false          |              0 | 559ce84414f6a  | 2017-09-23 01:13:20 +0300
       | [email protected]      | 25964721380d6594f0d193cf949ee286     |         0B |   d41d8cd98f | false          |              0 | 559ce8438b58c  | 2017-09-23 01:13:19 +0300

Main part of these (multipart) objects is always fine:

[vm@bodies-master ~]$ leofs-adm whereis 'body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz'
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                  |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | a1e97f692fa5954d89b76d350b6dffd5     |      7884K |   4a7cf0ab59 | false          |              2 | 559ce84419147  | 2017-09-23 01:13:20 +0300
       | [email protected]      | a1e97f692fa5954d89b76d350b6dffd5     |      7884K |   4a7cf0ab59 | false          |              2 | 559ce84419147  | 2017-09-23 01:13:20 +0300
       | [email protected]      | a1e97f692fa5954d89b76d350b6dffd5     |      7884K |   4a7cf0ab59 | false          |              2 | 559ce84419147  | 2017-09-23 01:13:20 +0300

(same for any other)

There are lot more objects like that on all nodes, actually. For some reason all examples I've checked have bodies02 as primary node. There are lots of similar errors in bodies02 log (looks like all errors on each other node have corresponding error in bodies02 log), so grepped just related to these objects. The rest of errors (about other objects) look exactly the same. All errors are about parts of multipart objects, btw.

[W]	[email protected]	2017-09-22 22:34:20.566426 +0300	1506108860	leo_storage_read_repairer:compare/4	167	[{node,'[email protected]'},{addr_id,323123272100344001099548108964674205690},{key,<<"body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz\n3817fb1c3269fc06779418276cefe0aa">>},{clock,1506108860020342},{cause,not_found}]
[W]	[email protected]	2017-09-23 01:08:31.141937 +0300	1506118111	leo_storage_read_repairer:compare/4	167	[{node,'[email protected]'},{addr_id,275254980530270342147947643001814678270},{key,<<"body/32/74/05/3274056a92a181acdcd5dfada2ee743a6bee6dc919ed9b5273c9ca7b5c1a1f9d3d5573fb11b82bea92d2703d004c03990014d40000000000.xz\nea161c4968feea289646687897c43564">>},{clock,1506118110070626},{cause,not_found}]
[W]	[email protected]	2017-09-23 01:13:20.692804 +0300	1506118400	leo_storage_read_repairer:compare/4	167	[{node,'[email protected]'},{addr_id,49961723055780689973179696651582759558},{key,<<"body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz\na7c6516a2e15f44a0e037043c88736d8">>},{clock,1506118399997324},{cause,not_found}]
[W]	[email protected]	2017-09-23 03:31:13.908487 +0300	1506126673	leo_storage_read_repairer:compare/4	167	[{node,'[email protected]'},{addr_id,250113424891249516365044801522905960698},{key,<<"body/3f/33/ca/3f33cae6299a61aa70f83885e820edea90ad59ba5dd742ee02a43ed54dce8c84ba9ea172a603f834ed1979e0cf2fc9b0e01cee0000000000.xz\nf2889495d9a623e56c18856a90f35dfd">>},{clock,1506126673358268},{cause,not_found}]

There are no errors of any other kind on any node. There are occasional

[I]	[email protected]	2017-09-22 22:22:39.58070 +0300	1506108159	null:null	0	["alarm_handler",58,32,"{set,{system_memory_high_watermark,[]}}"]

messages on all nodes (the beam.smp never uses more than ~1.2-1.3 GB, though).

There are no hardware problems (disk/cpu/memory) on bodies02 or any other node; all servers are identical as well. I can't vouch for network hardware, though - problems there aren't too likely but possible in theory. I.e. I can't deny possibility that bodies02 is connected to different switch and there is some other difference compared to other nodes, network-wise. No errors from kernel and no errors of any kind in "ethtool -S" on all nodes, at least.

The access goes through single gateway right now. There are no errors or info messages on gateway at all, however the CPU watchdog triggers since it's running on server with some CPU load:

[W]	[email protected]	2017-09-23 14:12:38.943948 +0300	1506165158	leo_watchdog_cpu:handle_call/2	224	[{triggered_watchdog,cpu_load},{load_avg_1m,6.4}]
[W]	[email protected]	2017-09-23 14:12:43.946443 +0300	1506165163	leo_watchdog_cpu:handle_call/2	224	[{triggered_watchdog,cpu_load},{load_avg_1m,5.97}]

I suppose I'll just disable it. Not sure if it affects this or not.
Status of all nodes is fine, all queues are empty.

EDIT: In case this might be useful, I executed diagnose-start on each node, here are mentions of first object I mention here (89369fc...) in logs of all nodes:

On bodies01:

127440176	292797815120921720132443653970268988165	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	2	13092	1506108860377918	2017-09-22 22:34:20 +0300	0
101014476	144613024842166612775662523989968230703	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	5255972	1506108860446322	2017-09-22 22:34:20 +0300	0

On bodies02:

130793378	323123272100344001099548108964674205690	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	0	1506108860020342	2017-09-22 22:34:20 +0300	0
130793676	323123272100344001099548108964674205690	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	0	1506108860422362	2017-09-22 22:34:20 +0300	1
91426957	292797815120921720132443653970268988165	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	2	13092	1506108860377918	2017-09-22 22:34:20 +0300	0
103831649	266566964550193539543455955030711265242	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	1	5242880	1506108860272834	2017-09-22 22:34:20 +0300	0

On bodies03:

87343807	323123272100344001099548108964674205690	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	0	1506108860020342	2017-09-22 22:34:20 +0300	0
111514107	144613024842166612775662523989968230703	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	5255972	1506108860446322	2017-09-22 22:34:20 +0300	0

On bodies04:

81859373	323123272100344001099548108964674205690	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	0	1506108860020342	2017-09-22 22:34:20 +0300	0
81859671	323123272100344001099548108964674205690	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	0	1506108860422362	2017-09-22 22:34:20 +0300	1

On bodies05:

103072807	266566964550193539543455955030711265242	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	1	5242880	1506108860272834	2017-09-22 22:34:20 +0300	0

On bodies06:

121201919	292797815120921720132443653970268988165	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	2	13092	1506108860377918	2017-09-22 22:34:20 +0300	0
113091666	266566964550193539543455955030711265242	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	1	5242880	1506108860272834	2017-09-22 22:34:20 +0300	0
74541902	144613024842166612775662523989968230703	body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz	0	5255972	1506108860446322	2017-09-22 22:34:20 +0300	0

This object size is 5255972 bytes:

[vm@bodies-master ~]$ leofs-adm whereis 'body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz'
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                  |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 6ccb749af0999648aafbb9487177192f     |      5133K |   f5c57976b1 | false          |              2 | 559cc4b9eb672  | 2017-09-22 22:34:20 +0300
       | [email protected]      | 6ccb749af0999648aafbb9487177192f     |      5133K |   f5c57976b1 | false          |              2 | 559cc4b9eb672  | 2017-09-22 22:34:20 +0300
       | [email protected]      | 6ccb749af0999648aafbb9487177192f     |      5133K |   f5c57976b1 | false          |              2 | 559cc4b9eb672  | 2017-09-22 22:34:20 +0300

So I'm supposed to get two parts, one 5M one and another around 13K (plus object for multipart header?), but there seem to be 5 extra ones, two of which are deleted?

The text was updated successfully, but these errors were encountered:

mocchira · 2017-09-26T10:18:53Z

WIP

mocchira · 2017-09-28T06:16:41Z

@vstax Thanks for reporting this problem.
It seems kind of race condition problem that could happen under a distributed system adopting eventual consistency.
We fixed part of the problem at #722 however it's not perfect fix (the odds inconsistencies could happen get decreased but not zero).
Let me explain how this could happen below.

The sequence of multipart upload(MU) on leo_gateway

Receive a MU initiate request and put a temporary object suffixed by UploadID (this temporary object can be remained (not deleted) in your case and you'd see inconsistencies through leofs-adm)
Receive MU part requests and put the part objects
Receive a MU complete request and remove the temporary object created at 1
Check the all parts uploaded and calculate its checksum and if no problem then put the parent object for those parts (the object gets available to clients at this moment)

The point is 3 are guaranteed to happen subsequently to 1 on leo_gateway however this order can be inverted chronologically on leo_storage (especially on secondary/third replica) as those requests get proceeded in async manner. Those behavior causes

[E]	[email protected]	2017-09-22 22:34:20.566502 +0300	1506108860	leo_storage_handler_object:put/4	423	[{from,storage},{method,delete},{key,<<"body/83/69/fc/8369fcebed231a6246410cfa9a1436758770ad4e5acb00f30a94ebaa31ceb5b65719e2a3578bbdd2f1cce53fdafd6cad00d0fd0000000000.xz\n3817fb1c3269fc06779418276cefe0aa">>},{req_id,55429545},{cause,not_found}]

not found error happen at first as 1 was not reached on leo_storage then result in

[vm@bodies-master ~]$ leofs-adm whereis 'body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz\na7c6516a2e15f44a0e037043c88736d8'
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                  |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+--------------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | 25964721380d6594f0d193cf949ee286     |         0B |   d41d8cd98f | false          |              0 | 559ce84414f6a  | 2017-09-23 01:13:20 +0300
  *    | [email protected]      | 25964721380d6594f0d193cf949ee286     |         0B |   d41d8cd98f | false          |              0 | 559ce84414f6a  | 2017-09-23 01:13:20 +0300
       | [email protected]      | 25964721380d6594f0d193cf949ee286     |         0B |   d41d8cd98f | false          |              0 | 559ce8438b58c  | 2017-09-23 01:13:19 +0300

as 1 was reached after 3 proceeded on leo_storage and the temporary object suffixed by UploadID has been remained.
That's it.

As the permanent fix is difficult (ensuring the causality between 1 and 3 is needed and that means we need to implement some consensus algorithm), now we are considering the fix decreasing the odds by executing to remove a temporary object after confirming the checksum.

So I'm supposed to get two parts, one 5M one and another around 13K (plus object for multipart header?), but there seem to be 5 extra ones, two of which are deleted?

Yes, 5 extra ones are

3 temporary objects
2 tombstone objects for delete (this number should be 3 when everything works fine

vstax · 2017-09-28T09:41:26Z

@mocchira Thank you for analyzing. Glad to know this won't affect the real data. A few questions:

Do you think switching to PTP (microsecond-class time synchronization) on storage nodes cluster instead of NTP would reduce the problem? PTP is a bit annoying to setup compared to just running NTP, but maybe it's worth it? Unforunately, we probably can't run PTP between storages and gateways as they are in different datacenters (they aren't too far, RTT is less than ms, but still).
Let's suppose I ignore these errors. What is the easiest way to make state between nodes consistent (i.e. mark that temporary object as deleted on stor05 in this example)? Running

leofs-adm recover-file body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz
leofs-adm recover-file body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz\na7c6516a2e15f44a0e037043c88736d8

doesn't do anything.

mocchira · 2017-09-29T04:56:29Z

@vstax

Do you think switching to PTP (microsecond-class time synchronization) on storage nodes cluster instead of NTP would reduce the problem? PTP is a bit annoying to setup compared to just running NTP, but maybe it's worth it? Unforunately, we probably can't run PTP between storages and gateways as they are in different datacenters (they aren't too far, RTT is less than ms, but still).

Maybe (however IMO, cost over the benefit).

now we are considering the fix decreasing the odds by executing to remove a temporary object after confirming the checksum.

Once the above fix is landed, the odds you could see inconsistencies should dramatically decrease because confirming the checksum needs N times (N = the number of chunks of a large object) round trip(s) between leo_gateway and leo_storage(s) so that I'd recommend you to wait for the fix without adopting PTP.

Let's suppose I ignore these errors. What is the easiest way to make state between nodes consistent (i.e. mark that temporary object as deleted on stor05 in this example)? Running

Hmm recover-file should work even if the target is a temporary object so I will vet further.

mocchira · 2017-10-06T06:24:12Z

@vstax as you may know, this has been fixed (precisely the odds you could see inconsistencies get decreased) so give it a try if you have time.

vstax · 2017-10-06T10:40:42Z

@mocchira Thank you, I will (I need to finish recover-node experiments before wiping the data, so this will have to wait a bit). We also have PTP now so it won't be the same experiment exactly, but eventually I will be uploading much more data so it should get plenty of testing.

Got a question about recover-file though, is it supposed to work or I am doing it wrong way? If needed, I can provide results of get/head API calls directly like in the other ticket.

mocchira · 2017-10-06T13:30:03Z

@vstax

Thank you, I will (I need to finish recover-node experiments before wiping the data, so this will have to wait a bit). We also have PTP now so it won't be the same experiment exactly, but eventually I will be uploading much more data so it should get plenty of testing.

Got it.

Got a question about recover-file though, is it supposed to work or I am doing it wrong way? If needed, I can provide results of get/head API calls directly like in the other ticket.

recover-file should work against temporary objects.
How about the below (quoting the path argument as it's including a meta character )?

leofs-adm recover-file "body/12/bf/0d/12bf0db7d8bcdf91a3a42ba97867bf6b785b0113754238dce5a4952524cb13182a7a2e64cfe7b3614a4c4551f0db4adf0002250100000000.xz\na7c6516a2e15f44a0e037043c88736d8"

vstax · 2017-10-06T19:44:25Z

@mocchira Nope, it doesn't work.. nothing happens.
I somehow omitted quoting in the comment but actually I was using it before (passing exactly the same name as for "whereis" command)

EDIT: works just fine now with the latest patches.

vstax changed the title ~~Inconsistent state of (multipart) objects on storages during upload~~ Errors about multipart object parts on storages during upload Sep 24, 2017

mocchira self-assigned this Sep 26, 2017

mocchira added Priority-HIGH S3-API survey _leo_gateway _leo_storage labels Sep 26, 2017

mocchira added this to the 1.4.0 milestone Sep 26, 2017

mocchira added Priority-MIDDLE and removed Priority-HIGH labels Sep 28, 2017

vstax mentioned this issue Oct 2, 2017

Fixing uneven distribution of data between nodes #846

Open

mocchira mentioned this issue Oct 4, 2017

Several minor fixes for 1.4 #863

Merged

vstax mentioned this issue Oct 5, 2017

After recover-node, tons of "not found" error are generating in logs #859

Closed

yosukehara closed this as completed in ee3efbc Oct 19, 2017

vstax mentioned this issue Oct 20, 2017

Errors and inconsistent state during upload + compaction #897

Open

mocchira modified the milestones: 1.4.0, 1.3.8 Oct 26, 2017

yosukehara added the v1.3 label Nov 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors about multipart object parts on storages during upload #845

Errors about multipart object parts on storages during upload #845

vstax commented Sep 23, 2017 •

edited

Loading

mocchira commented Sep 26, 2017

mocchira commented Sep 28, 2017

vstax commented Sep 28, 2017 •

edited

Loading

mocchira commented Sep 29, 2017

mocchira commented Oct 6, 2017

vstax commented Oct 6, 2017

mocchira commented Oct 6, 2017

vstax commented Oct 6, 2017 •

edited

Loading

Errors about multipart object parts on storages during upload #845

Errors about multipart object parts on storages during upload #845

Comments

vstax commented Sep 23, 2017 • edited Loading

mocchira commented Sep 26, 2017

mocchira commented Sep 28, 2017

The sequence of multipart upload(MU) on leo_gateway

vstax commented Sep 28, 2017 • edited Loading

mocchira commented Sep 29, 2017

mocchira commented Oct 6, 2017

vstax commented Oct 6, 2017

mocchira commented Oct 6, 2017

vstax commented Oct 6, 2017 • edited Loading

vstax commented Sep 23, 2017 •

edited

Loading

vstax commented Sep 28, 2017 •

edited

Loading

vstax commented Oct 6, 2017 •

edited

Loading