-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow .airflowignore not handling soft link properly. #23532
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
The background is that i use kubernetes/git-sync to sync git into the dag folder which creates the soft link. |
Here is a sample test case of the report. I guess this is more with the fact Line 68 in 6cc41ab
Test case passes before changes in #22051 . cc : @ianbuss # tests/plugins/test_plugin_ignore.py
def test_symlink_not_ignored(self):
shutil.rmtree(self.test_dir)
self.test_dir = tempfile.mkdtemp(prefix="onotole")
source = os.path.join(self.test_dir, "folder")
target = os.path.join(self.test_dir, "symlink")
py_file = os.path.join(source, "hello_world.py")
ignore_file = os.path.join(self.test_dir, ".airflowignore")
os.mkdir(source)
os.symlink(source, target)
with open(ignore_file, 'w') as f:
f.write("folder")
with open(py_file, 'w') as f:
f.write("print('hello world')")
ignore_list_file = ".airflowignore"
found = []
for path in find_path_from_directory(self.test_dir, ignore_list_file):
found.append(path)
assert os.path.join(self.test_dir, "symlink", "hello_world.py") in found |
Correct. |
Yep, you're right @tirkarthi this looks like something we need to handle better and expand the unit tests to cover expected cases. @NickYadance can you give a sample directory outline (with file, dirs and symlinks) when git-sync is in the picture? |
Hopefully this can be relatively easily fixed with a change to |
@ianbuss sure , check this: Can't find a simple way to work around this. Maybe try to delete soft link manually after git-sync is done. |
OK, and you have a I guess there's a wider question here about whether Airflow should be following symlinks at all as it walks the directory tree. I think in some circumstances it makes sense to do that, where it refers to targets out of the dag directory tree, like for example:
Longer term, I'm less sure of the utility of following "local" symlinks to targets within the same directory structure as it will result in the same files being found more than once and you have to craft the ignore files to ignore either the symlinked path or the real path. We could update the logic to ignore symlinks that are "in tree". Curious what you think @ashb , @kaxil , @potiuk ? But for now we should try to restore the behaviour to pre-2.3.0 as it is a regression. |
Agree soft-links in tree are bad idea that will lead to duplication which is never good in this case. But soft-links are also dangerous for another reason - bacause of potential recursion problem which are difficult to diagnose (unless we protect against them). But also they are useful to "cut the cake" differently so to speak (i.e. have different hierarchy in DAGs and in your original repo where you keep them. I'd say we should restore the symlink behaviour, allow the symlinks but protect against in-tree and recursion. |
Have prepped a simple initial PR which should hopefully restore the original behaviour (and includes the test case provided by @tirkarthi - thanks!) but would be good to get some additional eyes on it. If we want to make larger changes to the symlink handling that should perhaps be a future PR with further thought? Depends on the timeline of 2.3.1 I think. |
Apache Airflow version
2.3.0 (latest released)
What happened
Soft link and folder under same root folder will be handled as the same relative path. Say i have dags folder which looks like this:
and .airflowignore:
both folder and soft-links-to-folder will be ignored.
What you think should happen instead
Only the folder should be ignored. This is the expected behavior in airflow 2.2.4, before i upgraded.
The root cause is that both _RegexpIgnoreRule and _GlobIgnoreRule is callingrelative_to
method to get search path.How to reproduce
check @tirkarthi comment for the test case.
Operating System
ubuntu
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: