Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3cmd sync and put/get with -r option is not working #367

Closed
kentarosasaki opened this issue May 21, 2015 · 16 comments
Closed

s3cmd sync and put/get with -r option is not working #367

kentarosasaki opened this issue May 21, 2015 · 16 comments

Comments

@kentarosasaki
Copy link

  • Abstract:
    I found that s3cmd sync and put/get with -r option is not working.
  • Environments
    What version of LeoFS are you using?
    1.2.8
  • What version of Erlang are you using?
    erts-6.3
  • What kind of virtualization(VMWare/Docker/Xen...) are you using or using Baremetal?
    VMware
  • What operating system(uname -a) and processor architecture(cat /proc/cpuinfo) and memory(cat /proc/meminfo) are you using?
    OS
    Linux XXXXXX 2.6.32-279.el6.x86_64 Import LeoFS-related all source codes #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
    CPU
    4core (Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz)
    Memory
    16GB
  • What happened and How to reproduce if possible
  • What did you do?
    Running s3cmd sync command. I used latest s3cmd version which means I run git pull from s3cmd master branch.
  • What did you expect to see?
    Sync all objects recursively.
  • What did you see instead?

This is the example which stored docker registry objects.
List of docker registry

$ s3cmd ls s3://image/
                       DIR   s3://image/docker1/
                       DIR   s3://image/registry/
$ s3cmd ls s3://image/docker1/
                       DIR   s3://image/docker1/images/
                       DIR   s3://image/docker1/repositories/
$ s3cmd ls s3://image/docker1/images/
                       DIR   s3://image/docker1/images/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/
$ s3cmd ls s3://image/docker1/images/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158
2015-05-11 04:56         4   s3://image/docker1/images/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/_inprogress
2015-05-11 04:56        68   s3://image/docker1/images/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/ancestry
2015-05-11 04:56       483   s3://image/docker1/images/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/json

Run sync

$ mkdir image
$ s3cmd sync s3://image/ image/
$ ls -l image/
total 0
$

Run get with -r option

$ s3cmd get -r s3://image/ image/
$ ls -l image/
total 0
$
  • Attach a erl_crash.dump file if that exists.
  • Attach a syslog(dmesg) if something related to erlang are outputed.
@yosukehara
Copy link
Member

Thank you for your report.
Actually, we support s3cmd sync as below. We'll check this issue.

$ s3cmd sync s3://test/leo_commons/ work/
s3://test/leo_commons/.gitignore -> work/.gitignore  [1 of 8]
 96 of 96   100% in    0s    16.74 kB/s  done
s3://test/leo_commons/.travis.yml -> work/.travis.yml  [2 of 8]
 130 of 130   100% in    0s    33.88 kB/s  done
s3://test/leo_commons/LICENSE -> work/LICENSE  [3 of 8]
 10175 of 10175   100% in    0s     3.22 MB/s  done
s3://test/leo_commons/Makefile -> work/Makefile  [4 of 8]
 995 of 995   100% in    0s   351.80 kB/s  done
s3://test/leo_commons/README.md -> work/README.md  [5 of 8]
 452 of 452   100% in    0s   167.01 kB/s  done
s3://test/leo_commons/make_rst_doc.sh -> work/make_rst_doc.sh  [6 of 8]
 599 of 599   100% in    0s   225.33 kB/s  done
s3://test/leo_commons/rebar -> work/rebar  [7 of 8]
 181045 of 181045   100% in    0s    30.04 MB/s  done
s3://test/leo_commons/rebar.config -> work/rebar.config  [8 of 8]
 1203 of 1203   100% in    0s   432.23 kB/s  done
Done. Downloaded 194695 bytes in 0.0 seconds, 5.49 MB/s

@kentarosasaki
Copy link
Author

If I gonna do sync with similar directory hierarchy you mentioned as below, it's successful. Because directory level is flatten.

s3cmd sync s3://test/leo_commons/ work/

If there are deep directory level like docker registry, sync is not working as above comment.

@yosukehara
Copy link
Member

@windkit
Copy link
Contributor

windkit commented May 22, 2015

This is probably related to how leo_gateway handles delimiter.

s3cmd first fetch object list of the bucket without delimiter, therefore a complete object list of every objects (including those in logical "sub-dir") should be returned (flat namespace)

While in LeoFS, delimiter is always assumed to be "/", therefore always return objects and "DIR"s of the current directory level (hierarchical structure)

Example of Prefix & Delimiter
http://www.bucketexplorer.com/documentation/amazon-s3--listing-keys-hierarchically-using-prefix-and-delimiter.html

@mocchira
Copy link
Member

@windkit Thank you for surveying this issue.
You're right.
We need to handle the delimiter url parameter properly.

@yosukehara
Copy link
Member

@windkit Thanks 👍

@windkit
Copy link
Contributor

windkit commented Nov 2, 2015

@yosukehara @mocchira are you working on this? If not I will check it.

@yosukehara
Copy link
Member

@windkit Thanks, please go ahead 👍

@windkit
Copy link
Contributor

windkit commented Nov 5, 2015

Description

To summarize the situation about prefix and delimiter.
Reference (S3 Doc): http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

S3 Spec

Retrieve Keys starting with the prefix, group until the first occurrence of the delimiter

LeoFS

Prefix is treated as a directory path, delimiter is assumed to be /
Therefore, it always return the list of "files" and "sub-directories" under the directory

This Issue

s3cmd does not specify delimiter character, it would retrieve the list of all files (recursively) under the directory

Idea

As a quick fix, we can have a separate code path to handle the case when there is not delimiter specified. Effectively, leo_gateway would keep traverse down the directory tree and consolidate the result.

@mocchira
Copy link
Member

mocchira commented Nov 6, 2015

@windkit +1.
and file the other issue that LeoFS doesn't comply with the S3 Spec of RESTBucketGET.

@windkit
Copy link
Contributor

windkit commented Nov 6, 2015

I have made a PR for the quick fix at
leo-project/leo_gateway#31

Test Result

$ ./s3cmd ls -r s3://test
2015-11-06 05:13      1602   s3://test/Makefile
2015-11-06 05:13      1602   s3://test/dir1/Makefile
2015-11-06 05:13      1602   s3://test/dir1/dir2/Makefile
$ ./s3cmd sync s3://test temp/
s3://test/Makefile -> <fdopen>  [1 of 1]
 1602 of 1602   100% in    0s   151.61 kB/s  done
Done. Downloaded 1602 bytes in 1.0 seconds, 1602.00 B/s

$ ls -lR temp
temp:
total 8
drwxrwxr-x 3 user user 4096 11月  6 14:05 dir1
-rwxrwxr-x 1 user user 1602 11月  6 05:13 Makefile

temp/dir1:
total 8
drwxrwxr-x 2 user user 4096 11月  6 14:05 dir2
-rwxrwxr-x 1 user user 1602 11月  6 05:13 Makefile

temp/dir1/dir2:
total 4
-rwxrwxr-x 1 user user 1602 11月  6 05:13 Makefile
$ ./s3cmd get -r s3://test temp2
s3://test/Makefile -> temp2/Makefile  [1 of 3]
 1602 of 1602   100% in    0s   250.35 kB/s  done
s3://test/dir1/Makefile -> temp2/dir1/Makefile  [2 of 3]
 1602 of 1602   100% in    0s   184.44 kB/s  done
s3://test/dir1/dir2/Makefile -> temp2/dir1/dir2/Makefile  [3 of 3]
 1602 of 1602   100% in    0s   358.00 kB/s  done

$ ls -lR temp2
temp2:
total 8
drwxrwxr-x 3 user user 4096 11月  6 14:18 dir1
-rw-rw-r-- 1 user user 1602 11月  6 05:13 Makefile

temp2/dir1:
total 8
drwxrwxr-x 2 user user 4096 11月  6 14:18 dir2
-rw-rw-r-- 1 user user 1602 11月  6 05:13 Makefile

temp2/dir1/dir2:
total 4
-rw-rw-r-- 1 user user 1602 11月  6 05:13 Makefile

@mocchira
Copy link
Member

mocchira commented Nov 6, 2015

@windkit This fix can cause out of memory if there are lots of files in LeoFS.
So we need to do respond with a kind of streaming manner.

@windkit
Copy link
Contributor

windkit commented Nov 7, 2015

@mocchira If that is a concern, we may need a big rework. Starting from leo_storage_handler_directory calls such as find_by_parent_dir return the list of metadata in a directory at once. It would be a problem too when there are lots of files under one directory in LeoFS.

As in the S3 Spec, the maximum number of result returned is 1000, I would first add a check on that. This seems to limit the memory usage at a reasonable level.

If you do feel it is necessary to take a streaming approach, I could move on to use body_fun for sending out the response so it would not keep the output list MetadataList in recursive_find

@windkit
Copy link
Contributor

windkit commented Nov 8, 2015

Reference to #432 for constructing XML entries

@windkit
Copy link
Contributor

windkit commented Nov 9, 2015

I have updated the PR to stream the results to client.
Alongside, I have created few helper functions generate_list_head_xml generate_list_foot_xml generate_list_file_xml to construct the reply with binary construction.

@yosukehara
Copy link
Member

DONE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants