Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Filter newer" #19

Open
tailsu opened this issue Sep 10, 2019 · 15 comments
Open

"Filter newer" #19

tailsu opened this issue Sep 10, 2019 · 15 comments
Labels
enhancement New feature or request

Comments

@tailsu
Copy link

tailsu commented Sep 10, 2019

It would be great if there were an option to not sync files that exist at the destination and are newer than the source file. I think this is even the default behavior of aws s3 sync. This would really help with incremental syncing. --filter-modified seems to actually do double work, instead of saving work in the case of incremental syncing.

@larrabee
Copy link
Owner

Hello.
You can use filter --filter-after-mtime <timestamp of last sync>. With this options only files that modified after given timestamp will be synced.

@tailsu
Copy link
Author

tailsu commented Sep 10, 2019

Thanks for the quick reply. Does --filter-after-mtime apply to the source file timestamp or to the destination one?

@larrabee
Copy link
Owner

It's applying to the source file timestamp.

@tailsu
Copy link
Author

tailsu commented Sep 10, 2019

OK, I see what you mean. My use case is slightly different: if I Ctrl-C in the middle of a sync, I'd like s3sync to pick up from where it left when I restart it and not reupload files that it uploaded on the previous run. In this case there are no "new" files in the source so I can't use --filter-after-mtime. aws s3 sync already does this "don't upload files that are newer in the destination" by default, so having this in s3sync will both improve performance and equalize the semantics w.r.t. the AWS CLI.

@tailsu
Copy link
Author

tailsu commented Sep 10, 2019

Besides, mtime, that is "last modified" is not the same as the creation time. If I copy a file, its last-modified time gets copied with it, so I can have a file that is older than the timestamp of the last sync, but was created (copied into the source folder) after the sync. Please correct me if I'm wrong on this account.

@larrabee
Copy link
Owner

OK, I see what you mean. My use case is slightly different: if I Ctrl-C in the middle of a sync, I'd like s3sync to pick up from where it left when I restart it and not reupload files that it uploaded on the previous run. In this case there are no "new" files in the source so I can't use --filter-after-mtime. aws s3 sync already does this "don't upload files that are newer in the destination" by default, so having this in s3sync will both improve performance and equalize the semantics w.r.t. the AWS CLI.

I understand, In this case only --filter-modified was useful. Yep, it do metadata request for each object, but your filter will do the same.

Your filter is something like this:

var FilterObjectsNewer pipeline.StepFn = func(group *pipeline.Group, stepNum int, input <-chan *storage.Object, output chan<- *storage.Object, errChan chan<- error) {
	for obj := range input {
		destObj := &storage.Object{
			Key:       obj.Key,
			VersionId: obj.VersionId,
		}
		err := group.Target.GetObjectMeta(destObj)
		if (err != nil) || (obj.Mtime.Unix() > destObj.Mtime.Unix()) {
			output <- obj
		}
	}
}

Besides, mtime, that is "last modified" is not the same as the creation time.

S3 does not have creation time (or I could not find it). No, for copied file last-modified will set to current time, not for original file mtime. And you cannot set custom mtime, it is setted by S3 server automatically.

@tailsu
Copy link
Author

tailsu commented Sep 11, 2019

I may have been unclear. My use case is syncing a local folder to S3. I can sync a local folder to S3 and then add files to it that will have an mtime that is earlier than the time of sync, because mtime also gets copied when you copy files. In this situation I won't be able to use --filter-after-mtime. That's why it's important to have a filter that can do both "is older or does not exist at the destination".

When I used --filter-modified the sync rate droped from 350 files/s to about 150 files/s, so I reverted to a regular sync.

Looking at the the new filter, it seems to be exactly what I need. When can I test it? :)

@tailsu
Copy link
Author

tailsu commented Sep 11, 2019

Looking at the filter code again, it still does an extra operation group.Target.GetObjectMeta(destObj) to check if the file exists at the destination. If the destination is an S3 path then it would be more efficient to do a ListObjectsV2 on the destination instead of multiple GetObjectMeta calls. ListObjectsV2 also returns the mtime of each object.

Ideally, if I sync a local folder to S3 that already have the exact same files in them, then the only S3 operation that s3sync should do is a ListObjectsV2 to check for existence and freshness. This will get you the maximum throughput when doing incremental syncs.

@larrabee
Copy link
Owner

larrabee commented Sep 11, 2019

ListObjectsV2 is not a universal solution. Destination can have millions of files in one directory, so we should list them all and save to memory/local cache to get mtime for files (in worst case for 1 file).
Caching synced files to local cache is not safe, because file can be changed on S3, but local cache will store old file metadata. Additionally destination bucket can have billions of files and creating local cache will took too much time and use too much memory (ram or on disk).

Please keep in mind, that the tool designed to sync very large buckets with millions of files and this is a key requirement.

@tailsu
Copy link
Author

tailsu commented Sep 12, 2019

ListObjectsV2 is not a universal solution.

I agree, but it covers the most important case, I think - when you use syncing to keep two folders in sync - in that case you'd have a billion objects both at the source and at the destination and then ListObjectsV2 is faster than checking each object separately.

I haven't thought about the cache size needed to store all the metadata.

@tailsu
Copy link
Author

tailsu commented Sep 12, 2019

https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html

List results are always returned in UTF-8 binary order.

Which means that the object list returned by ListObjectsV2 is always sorted. You don't need to cache the folder list. You can just list both the source and destination in parallel, and compare the entries in order. This is how the AWS CLI makes the sync efficient.

@larrabee
Copy link
Owner

I mean this case:

>> ls source/dir/a/
x1_file
x2_file
<hundred files...>
x1238884_file

>> ls dest/dir/a
a1_file
a2_file
a3_file
<millions files>
a239885949_file

In this case we should list millions of files in destination for uploading small count of files. Because of this with AWS sync and such issues.

Additionally, this synchronization algorithm fits very poorly on the architecture of the current application.

@tailsu
Copy link
Author

tailsu commented Sep 13, 2019

Additionally, this synchronization algorithm fits very poorly on the architecture of the current application.

Nothing to do about this one now, is there? :)

@larrabee
Copy link
Owner

Nothing to do about this one now, is there? :)

With algorithm with ListObjectsV2 - yes.
I can add the FilterObjectsNewer described above.

@larrabee larrabee added the enhancement New feature or request label Jan 15, 2020
@yarikoptic
Copy link

any progress on this? we are in need to sync a large (ATM 300 mil keys) bucket for local backup with incremental (e.g. daily) invocations to assure we have up to date backup locally.

  • IIRC I failed to find a reliable way to get a listing of only recent changes from a bucket, that is why we are coming up with a solution based on "s3 inventory" (https://github.com/dandi/s3invsync) so analysis for what is newer since yesterday performed locally on inventory dumps (more efficient to obtain but still huge ATM)
  • how do you deal with multiple versionIds for a key? (may be with a "trailing" DeleteMarker)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants