Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IncludePatterns and ExcludePatterns options for Copy #2082

Merged
merged 15 commits into from
Jun 1, 2021

Conversation

aaronlehmann
Copy link
Collaborator

Allow include and exclude patterns to be specified for the "copy" op,
similarly to "local".

Depends on tonistiigi/fsutil#101

cc @hinshun

@tonistiigi
Copy link
Member

What about cache? Looks to me that atm cache would still be invalidated even if files that match the pattern didn't change.

@aaronlehmann
Copy link
Collaborator Author

What about cache? Looks to me that atm cache would still be invalidated even if files that match the pattern didn't change.

Good point. I see two issues with cache right now.

  1. I'm not sure this change in its current state is invalidating cache if the include pattern or exclude pattern changes. Do I need to add something to (*fileOp).CacheMap to ensure that happens?

  2. It sounds like you're saying we should have an optimization to avoid invalidating cache if there was a change to the source that doesn't actually afffect the end result after considering the include/exclude filters. I'm not sure how to approach this. Is there any existing infrastructure that can decide whether to invalidate the cache by applying these filters and seeing if anything changed? If not, where should it be added?

@tonistiigi
Copy link
Member

Cache checksum is covered by the result of this func https://github.com/moby/buildkit/blob/master/solver/llbsolver/ops/file.go#L150 . I don't think the current contenthash.Checksum and contenthash.ChecksumWildcard are enough to cover the pattern cases so likely another function like this needs to be added to the contenthash pkg.

@tonistiigi
Copy link
Member

@aaronlehmann
Copy link
Collaborator Author

Cache checksum is covered by the result of this func https://github.com/moby/buildkit/blob/master/solver/llbsolver/ops/file.go#L150 . I don't think the current contenthash.Checksum and contenthash.ChecksumWildcard are enough to cover the pattern cases so likely another function like this needs to be added to the contenthash pkg.

Here's my rough understanding so far of what needs to be done. I wanted to run it by you before I start doing the work:

  1. Add IncludePatterns and ExcludePatterns to llbsolver.Selector so that NewContentHashFunc can pass them into contenthash.Checksum and contenthash.ChecksumWithWildcard.

  2. Rewrite addSelector and dedupeSelectors not to use llbsolver.Selector as a map key, because that won't be possible once llbsolver.Selector has slice fields.

  3. The tree walks in cache/contenthash/checksum.go need to match the include/exclude pattern when considering which files/dirs to include in the checksum. I think this should be done in (*cacheContext).wildcards and (*cacheContext).checksum (maybe it isn't necessary in wildcards, since it can be handled one level deeper in checksum?). It's better to do this at checksum time than at scan time (scanPath), so we can avoid extra FS walks, right?

@tonistiigi
Copy link
Member

Rewrite addSelector and dedupeSelectors not to use llbsolver.Selector as a map key, because that won't be possible once llbsolver.Selector has slice fields.

dedupeSelectors can also detect if one path uses filter and another doesn't then filtered one does not matter

It's better to do this at checksum time than at scan time (scanPath), so we can avoid extra FS walks, right?

I think wildcards() is similar to what you want to achieve. So either new variant of that function or modification. Btw, are wildcards and filters allowed to be used together? For the scanning, I don't think we need to be super optimized in here initially. Iiuc then currently for wildcards it also just scans everything

scan, err := cc.needsScan(root, "")
if err != nil {
return nil, err
}
if scan {
if err := cc.scanPath(ctx, m, ""); err != nil {
return nil, err
}
}

@aaronlehmann
Copy link
Collaborator Author

Btw, are wildcards and filters allowed to be used together?

Yes, they can be used together. I think modifying wildcards and using it when there are include or exclude filters might be a good option. Basically I think we'd replace the path.Match check with a function that checks path.Match if wildcards are allowed, and also checks include/exclude filters. For simplicity, we'd probably skip the

		if cr.Type == CacheRecordTypeDir {
			iter = root.Seek(append(k, 0, 0xff))
		}

if there are any exclude filters.

For the scanning, I don't think we need to be super optimized in here initially. Iiuc then currently for wildcards it also just scans everything

What are the conditions where needsScan returns true?

Are you saying it's reasonable to always scan the FS when there are include or exclude patterns? That would potentially be simpler, because it could reuse fsutils.Walker instead of having yet another implementation of include/exclude logic.

@aaronlehmann aaronlehmann force-pushed the copy-include-exclude branch from f293670 to 5ea2a92 Compare April 22, 2021 23:42
@aaronlehmann
Copy link
Collaborator Author

@tonistiigi: I've turned wildcards into something more generic that handles the include/exclude patterns as well. PTAL.

@tonistiigi
Copy link
Member

What are the conditions where needsScan returns true?

When files are first checked from a snapshot that was not created by transferring local source. After that, specific path is scanned but other paths around it on same snapshot may be not.

Are you saying it's reasonable to always scan the FS when there are include or exclude patterns?

Yes

CI does not seem to be running because of vendoring issues.

Did you cover the case where parent paths of the included patters need to contain the checksum for the metadata of the dir itself but not its contents? Is pattern digest the checksum of the top path that matched pattern or all files (eg subdirs inside it) matched individually and merged later with some logic?

@aaronlehmann
Copy link
Collaborator Author

Did you cover the case where parent paths of the included patters need to contain the checksum for the metadata of the dir itself but not its contents?

Are you talking about the situation where we copy / with IncludePatterns = /foo/bar, and the metadata on /foo changes, so we need to invalidate the cache because we may need to copy the directory entry for /foo and want to preserve the metadata in its current state?

If so, I don't think I handled this. It should be an easy fix to include that parent directory metadata in the cache key.

Is pattern digest the checksum of the top path that matched pattern or all files (eg subdirs inside it) matched individually and merged later with some logic?

I think it's closer to the latter. Take a look at the logic in includedPaths. It iterates over the files/dirs in the prefix tree, and collects digests for files and dirs which match the include pattern. If a dir matches the include pattern and there are no exclusions, it uses the dir's content checksum directly and skips files within it. Checksum combines the digests returned by includedPaths to form the final checksum.

@aaronlehmann aaronlehmann force-pushed the copy-include-exclude branch from 5ea2a92 to cf80609 Compare April 26, 2021 23:49
@aaronlehmann
Copy link
Collaborator Author

CI does not seem to be running because of vendoring issues.

Looks like a missing git add on the vendor dir - fixed.

@aaronlehmann aaronlehmann force-pushed the copy-include-exclude branch 2 times, most recently from 8947fc4 to bb9c8eb Compare April 26, 2021 23:57
@aaronlehmann
Copy link
Collaborator Author

Did you cover the case where parent paths of the included patters need to contain the checksum for the metadata of the dir itself but not its contents?

Are you talking about the situation where we copy / with IncludePatterns = /foo/bar, and the metadata on /foo changes, so we need to invalidate the cache because we may need to copy the directory entry for /foo and want to preserve the metadata in its current state?

If so, I don't think I handled this. It should be an easy fix to include that parent directory metadata in the cache key.

Added this (see latest commit)


for _, name := range []string{"myfile", "sub/bar"} {
_, err = os.Stat(filepath.Join(destDir, name))
require.Equal(t, true, errors.Is(err, os.ErrNotExist))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann aaronlehmann marked this pull request as ready for review May 3, 2021 15:47
@aaronlehmann
Copy link
Collaborator Author

@tonistiigi: No rush, but this is ready for another pass.

@thaJeztah
Copy link
Member

Could this be used to implement moby/moby#37333 / moby/moby#15771 ?

@aaronlehmann
Copy link
Collaborator Author

Could this be used to implement moby/moby#37333 / moby/moby#15771 ?

Yes, I believe it could.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that with this I can do copy with excluding all go files with **/*.go exclude pattern but there is no way I can do copy only all go files?


txn := cc.tree.Txn()
root = txn.Root()
var updated bool

lastIncludedDir := ""
iter := root.Seek([]byte{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!wildcard case should seek to p?

}
}
if !matched {
return false, partial, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false, false, nil

dirSlash := fn + "/"
for _, pat := range excludePatternMatcher.Patterns() {
patStr := pat.String() + "/"
if strings.HasPrefix(patStr, dirSlash) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this pattern regexp. How can prefix be checked. I think at least missing Exclusion check but can't find docs for that as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As curently implemented this is best-effort, and if we have a pattern here rather than a prefix we might end up including some paths in the checksum that would be excluded by the filters. This would never cause false cache hits but might lead to false negatives.

The alternative would be to never use a dir's content hash when there is at least one exclude pattern, and instead walk all included dirs and only incorporate non-excluded files in the final checksum. Basically this code would be replaced with:

if cr.Type == CacheRecordTypeDir {
	lastIncludedDir = fn

	if excludePatternMatcher != nil {
		// don't use the optimization where we add a whole dir's contents
		// to the checksum when there is an exclude pattern, because some
		// files in the dir may be excluded
		continue treeWalk
	}
}

for {
k, _, ok := iter.Next()
if !ok {
break
}
fn := string(convertKeyToPath(k))
dirHeader := false
if len(k) > 0 && k[len(k)-1] == byte(0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand where the non-null-byte dir is excluded from checksum for the partial case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The continue a few lines below does this. It causes us to skip the call to checksum when partialMatch is true, but the path does not end in a null byte.

cc, err := newCacheContext(ref.Metadata(), nil)
require.NoError(t, err)

dgst, err := cc.Checksum(context.TODO(), ref, "foo", ChecksumOpts{IncludePatterns: []string{"foo"}}, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid case? I thought only dirs can have subpatterns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think the behavior should be in this case? Should we exclude /foo because the specified path is not a dir? Return an error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as copying a directory where IncludePatterns are set but none match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you think it should copy no files in this scenario?

I don't think that matches the behavior implemented in fsutil. (*copier).copy only checks IncludePatterns on recursive calls, so calling Copy with /foo would copy /foo regardless of IncludePatterns. We would need a PR to fsutils to keep the behavior in sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So dir with no matching subdirs will make an empty dir? Thanks seems ok.

Not sure about the file case. What do you think is the best? It is always wrong to use it like this and a sign that the user did something wrong. Do you see cases where it might seem wrong for the user if they get an error for this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to me to make it an error to try and do a Copy op with IncludePatterns or ExcludePatterns when the source is a file. If we do that, we don't need any special handling at the checksum level, since it wouldn't be a valid scenario.

I could add a Stat in docopy if you think that makes sense. Or we could add this check at the fsutil level.

require.Equal(t, dgstFileData0, dgst)

dgstFoo, err := cc.Checksum(context.TODO(), ref, "", ChecksumOpts{IncludePatterns: []string{"foo"}}, nil)
require.NoError(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no dgstFileData0 check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will match dgstFileData0 because the record for / is included in the checksum (in case other files are added in the future which match the pattern).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If source ends with / (same for /) but then the dir checksum should not apply, correct? As the dir itself is not copied.


dgstD0Star2, err := cc.Checksum(context.TODO(), ref, "", ChecksumOpts{IncludePatterns: []string{"d0/*"}}, nil)
require.NoError(t, err)
require.NotEqual(t, dgstD0Star, dgstD0Star2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do/abc should be equal to dgstD0

not sure if this covers the non-nil-byte case described above or a case with intermediate dir is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have IncludePatterns = []string{"d0"}, we include the content hash of the dir d0 in the final checksum.

When we have IncludePatterns = []string{"d0/*"}, we only include the metadata has for dir d0 in the final checksum (plus the files or dirs inside d0 which match the pattern).

So I believe these currently end up different, but we could make them match by removing the optimization that includes the "non-null byte" digest in the final checksum when a dir matches an include pattern exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d0 vs d0/* differences should be ok. But we should check the case where path and includepatterns are the same, but one adds an extra file under path that doesn't match includepatters. That should still always give the same checksum.

fstest.CreateFile("sub/foo", []byte("foo0"), 0600),
fstest.CreateFile("sub/bar", []byte("bar0"), 0600),
)
require.NoError(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add some checks for cache stability. Showing that adding files not matching copy does not invalidate the cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but how can this client test check whether the cache was invalidated or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done with Exec that adds a file with random contents and then check that after a repeated run the extra file's contents either changes or does not change.

CapFileRmWildcard apicaps.CapID = "file.rm.wildcard"
CapFileBase apicaps.CapID = "file.base"
CapFileRmWildcard apicaps.CapID = "file.rm.wildcard"
CapFileCopyIncludePatterns apicaps.CapID = "file.copy.includepatterns"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very important but could be one cap as added together.

@tonistiigi tonistiigi requested a review from hinshun May 13, 2021 02:04
@aaronlehmann
Copy link
Collaborator Author

Is it true that with this I can do copy with excluding all go files with **/*.go exclude pattern but there is no way I can do copy only all go files?

I don't think it supports doublestar - for that we need tonistiigi/fsutil#79

@tonistiigi
Copy link
Member

I don't think it supports doublestar - for that we need

ExcludePatterns does support ** though afaics https://github.com/moby/moby/blob/master/pkg/fileutils/fileutils.go#L170 . Hence the weird inconsitency.

@aaronlehmann
Copy link
Collaborator Author

I don't think it supports doublestar - for that we need

ExcludePatterns does support ** though afaics https://github.com/moby/moby/blob/master/pkg/fileutils/fileutils.go#L170 . Hence the weird inconsitency.

I think you're right then. Long term, we should add ** support for include patterns as well, which would resolve the inconsistency.

@aaronlehmann
Copy link
Collaborator Author

@tonistiigi: Do you know why the vendor check is failing? Also, is there anything you think still needs to be addressed here?

require.NoError(t, err)
require.Equal(t, digest.FromBytes(append([]byte("d0"), []byte(dgstDirD0)...)), dgst)
require.Equal(t, dgstDirD0FileByFile, dgst)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checksum changed because Checksum used to return the content hash for the dir that matched the pattern, but we now go file-by-file in case any files are matched by exclude patterns (or exclusions from an include pattern).

We could reintroduce the old behavior, conditional on not having any include/exclude patterns, if you think that makes sense.

@aaronlehmann aaronlehmann force-pushed the copy-include-exclude branch 2 times, most recently from 5ccb4c7 to 1d4a783 Compare May 25, 2021 15:02
Allow include and exclude patterns to be specified for the "copy" op,
similarly to "local".

Depends on tonistiigi/fsutil#101

Signed-off-by: Aaron Lehmann <[email protected]>
Consider IncludePatterns and ExcludePattern when calculating content
hashes.

Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
@aaronlehmann aaronlehmann force-pushed the copy-include-exclude branch from 1d4a783 to 6f5ea71 Compare May 26, 2021 20:48
@aaronlehmann
Copy link
Collaborator Author

@tonistiigi @hinshun: Any feedback on the latest version of this PR?

Copy link
Collaborator

@hinshun hinshun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see logic converging for IncludePaths and ExcludePaths. This PR looks good to me. cc @tonistiigi

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, would be nice if we could have a page of docs that describes the algorithm for checksums based on file content, with examples for the different cases (file, dir, wildcard, includepath).

@tonistiigi tonistiigi merged commit 8df5671 into moby:master Jun 1, 2021
@jasondamour
Copy link

@aaronlehmann I just found this MR, it is the exact functionality I needed. Thank you!

Is this available in a stable version of docker? I'm not too familiar with how moby and docker-ce relate and release.

@aaronlehmann aaronlehmann deleted the copy-include-exclude branch September 24, 2021 00:01
@aaronlehmann
Copy link
Collaborator Author

What I added here is low-level buildkit functionality, that I'm using in a buildkit-based project. I don't think it's available through Docker, unfortunately.

@SInnget
Copy link

SInnget commented Sep 27, 2021

@aaronlehmann Just found this MR too. Thank you!

I check the release note that claimed this pr is included, also the 1.3-labs came after the 0.9.0. I am curious whether whether this functionality already available in current buildkit, say, 1.3-labs?

Since I failed to find document on how to copy with exclude in dockerfile. Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants