Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1437 Add .gitignore and .dockerignore behavior to packaging #899

Closed
wants to merge 24 commits into from

Conversation

bimtauer
Copy link
Contributor

@bimtauer bimtauer commented Mar 24, 2022

TL;DR

Implements flyteorg/flyte#1437 by refactoring the tarfile filter logic to use a group of filter classes. Previous filtering logic was preserved in StandardIgnore.

Decided to not go with a separate .flyteignore. Covering .gitignore and .dockerignore if present should work well with a users intentions and avoids having to manage yet another file. If its in .gitignore, you probably dont want it in source control (from where you probably build at least your production images) and if its in .dockerignore you probably dont want it in the task-running container anyways. So following these two should be good :)
And well if you like it messy, then as before, the in-built filtering saves you from packaging cache and pyc stuff.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

The GitIgnore checks for the presence of git cli and then uses that to check whether a path is ignored. Given the complexity of gitignores with the possibility of nesting and various wildcards, this was the safest way to reliable reproduce actual gitignore behavior.
The DockerIgnore uses the PatternMatcher from docker-py to match file paths against patterns read from a .dockerignore if present. Docker only takes into accout .dockerignore files at the root of the build context, not in subdirectories, so we only check for that. The behavior should thus also be identical to what docker build would include.

The archiving code was refactored into create_archive for less repetition and separate unit testing.
The previous filter function and test was removed since the new feature covers that functionality.

Tracking Issue

flyteorg/flyte#1437

Follow-up issue

NA

@wild-endeavor
Copy link
Contributor

Thank you for the PR! I will review this weekend.

@wild-endeavor
Copy link
Contributor

Are all the subprocess calls okay? Some of these repos might get pretty large - subprocess calling git on every single file, is that UX okay?

@kumare3
Copy link
Contributor

kumare3 commented Apr 3, 2022

Are all the subprocess calls okay? Some of these repos might get pretty large - subprocess calling git on every single file, is that UX okay?

@bimtauer I like this code. But @wild-endeavor is right, I see you have implemented pattern matcher, why not use the same for git?

@kumare3
Copy link
Contributor

kumare3 commented Apr 3, 2022

Are all the subprocess calls okay? Some of these repos might get pretty large - subprocess calling git on every single file, is that UX okay?

@bimtauer I like this code. But @wild-endeavor is right, I see you have implemented pattern matcher, why not use the same for git?

Also the patterns are listed here. Only weird one is !

@@ -125,8 +126,7 @@ def package(ctx, image_config, source, output, force, fast, in_container_source_
archive_fname = os.path.join(output_tmpdir, f"{digest}.tar.gz")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not your code, but i think compute_digest is wrong!. It is computing on the entire source and not the filtered objects only

@bimtauer
Copy link
Contributor Author

bimtauer commented Apr 4, 2022

Are all the subprocess calls okay? Some of these repos might get pretty large - subprocess calling git on every single file, is that UX okay?

@bimtauer I like this code. But @wild-endeavor is right, I see you have implemented pattern matcher, why not use the same for git?

Cannot use that for git as gitignore pattern matching is slightly different from dockerignore (some examples here) and also git allows for multiple nested gitignores.
But I followed @wild-endeavor's advice and switched from many subprocess calls to a single one with git ls-files -io. That accumulates all ignored files once. Only small issue: this does not list entirely ignored directories, only files within them. As a result those directories turn up in the tarball but empty. Should be an okay compromise?

I'll look at the digest asap, good catch!

@codecov
Copy link

codecov bot commented Apr 4, 2022

Codecov Report

Merging #899 (5aa1a15) into master (3cfbb71) will increase coverage by 0.14%.
The diff coverage is 95.43%.

@@            Coverage Diff             @@
##           master     #899      +/-   ##
==========================================
+ Coverage   86.43%   86.57%   +0.14%     
==========================================
  Files         243      245       +2     
  Lines       23243    23483     +240     
  Branches     2618     2645      +27     
==========================================
+ Hits        20089    20330     +241     
+ Misses       2713     2699      -14     
- Partials      441      454      +13     
Impacted Files Coverage Δ
flytekit/tools/repo.py 46.03% <0.00%> (+1.25%) ⬆️
flytekit/clis/sdk_in_container/serialize.py 56.52% <33.33%> (+1.85%) ⬆️
flytekit/tools/ignore.py 90.69% <90.69%> (ø)
flytekit/tools/fast_registration.py 84.90% <92.68%> (+41.26%) ⬆️
...ests/flytekit/unit/tools/test_fast_registration.py 100.00% <100.00%> (ø)
tests/flytekit/unit/tools/test_ignore.py 100.00% <100.00%> (ø)
tests/flytekit/unit/extras/tasks/test_shell.py 82.17% <0.00%> (-17.83%) ⬇️
flytekit/extras/persistence/http.py 81.57% <0.00%> (-2.64%) ⬇️
flytekit/extras/tasks/shell.py 84.37% <0.00%> (-1.57%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3cfbb71...5aa1a15. Read the comment docs.

@bimtauer
Copy link
Contributor Author

bimtauer commented Apr 4, 2022

Developed update to include only not ignored files in checksum. But unfortunately the exclude_files of checksumdironly checks for file basenames, not for relative paths 🤦🏼 I'm hesitant to strip relative path from the paths to check, since foo/bar.pymight be ignored but bar.py might not be. I'm tempted to write my own quick checksum func and eliminate this dependency. Wdyt?

Also having a look at the failing windows tests. I dont have access to a windows machine 🤢 but I'll try my best 🐧

@bimtauer bimtauer force-pushed the fast-reg-gitignore branch 2 times, most recently from 3a37670 to 493129b Compare April 5, 2022 16:55
@bimtauer
Copy link
Contributor Author

bimtauer commented Apr 5, 2022

@wild-endeavor @kumare3 Bigger refactoring:

  • eliminated subprocesses (only one left) but keep ignoring empty dirs 💪🏼
  • performance improvement through dict key lookup in gitignore - its still marginally slower than old filter but overall usually faster because fewer files to package
  • Integrated with latest master changes
  • Began fixing requirements (pinning websocket-client) but grpcio io yank is still making waves through dep tree. Cannot lock everything, currently builds are failing.

PS might need some help on windows debugging.

@bimtauer bimtauer force-pushed the fast-reg-gitignore branch 2 times, most recently from e2721b4 to ecc5633 Compare April 5, 2022 19:08
self.has_git = which("git") is not None
self.ignored = self._list_ignored()

def _list_ignored(self) -> Dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great

kumare3
kumare3 previously approved these changes Apr 6, 2022
@kumare3
Copy link
Contributor

kumare3 commented Apr 6, 2022

@bimtauer some tests are failing and merge conflicts. But this otherwise looks awesome 🙏🏽

bimtauer added 8 commits April 6, 2022 15:41
Signed-off-by: Tim Bauer <[email protected]>
Signed-off-by: Tim Bauer <[email protected]>
Signed-off-by: Tim Bauer <[email protected]>
Signed-off-by: Tim Bauer <[email protected]>
Signed-off-by: Tim Bauer <[email protected]>
@bimtauer
Copy link
Contributor Author

bimtauer commented Apr 7, 2022

@kumare3 I need help debugging the windows tests. I don't have access to a windows dev environment and cannot figure out why the tests fail there. Otherwise this should be ready.

@eapolinario eapolinario mentioned this pull request Apr 21, 2022
8 tasks
@eapolinario
Copy link
Collaborator

@bimtauer , #967 was supposed to be a stacked PR on top of this PR, but I botched it. In any case, feel free to cherry-pick the specific commits that fix the unit failures on windows. I also took the opportunity to resolve the conflict and merge master.

@eapolinario
Copy link
Collaborator

This change ended up being merged in #967.

@bimtauer , thanks for your contribution!

@bimtauer
Copy link
Contributor Author

@eapolinario Thank you for pushing it over the finish line!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants