Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib.fileset should have a way to filter directories #271307

Open
infinisil opened this issue Nov 30, 2023 · 16 comments
Open

lib.fileset should have a way to filter directories #271307

infinisil opened this issue Nov 30, 2023 · 16 comments
Labels
5. scope: tracked Issue (or PR) is linked back to a `5. scope: tracking` issue 6.topic: lib The Nixpkgs function library

Comments

@infinisil
Copy link
Member

infinisil commented Nov 30, 2023

Problem

Sometimes one needs to filter out directories with a specific name throughout the source directory.

For example, you might have a structure like

- foo
  - target
- bar
  - baz
    - target
  - qux
    - target

And you want to remove all of the target directories.

Currently, the lib.fileset library does not have any function to make this work. To do that currently you need to work around it with either:

  • lib.fileset.fromSource (lib.sources.cleanSourceWith {
      src = ./.
      filter = pathString: type:
        # untested..
        type != "directory" || baseNameOf pathString != "target";
    }
  • Some recursive readDir and lib.fileset.unions I don't want to write out

Requirements

  1. It must not recurse into filtered-out directories for performance reasons
  2. It must be intuitive to use

Potential solutions

Directory name predicate

directoryFilter (dir: dir.name == "target") ./.

This creates a file set containing all files within directories named "target". E.g. ./foo/target/some/file.txt would be included because one of foo, target and some satisfies the predicate. Does it satisfy the requirements?

  1. ✔️ It doesn't need to recurse into the filtered out directories, but the selection needs to be inverted to actually filter out the ones with target:

    difference ./. (directoryFilter (dir:
      dir.name == "target"
    ) ./.)
  2. ❌ This is not intuitive, because one might expect an inverted predicate to return the complement set:

    directoryFilter (dir:
      dir.name != "target"
    ) ./.

    But this doesn't work: This would include ./foo/target/some/file.txt, since foo satisfies the predicate.

Give the fileFilter predicate access to the file's directories

Like already described in #269504. This would allow filtering certain directories:

fileFilter (file:
  # Only include files whose directory components don't have `target`
  ! elem "target" file.dirComponents
) ./.

This way ./foo/target/some/file.nix would not be included, because target is in ./foo/target/some. Does it satisfy the requirements?

  1. ❌ No, this does have to recurse for every file, because the predicate needs files. So this would be very slow.
  2. ✔️ Yes this is intuitive. One can also invert the predicate and get the expected result.

Does a solution satisfying both requirements exist?

I don't know yet

This issue is sponsored by Antithesis

@JRMurr
Copy link
Contributor

JRMurr commented Dec 1, 2023

I would love to have a way to do this!

I'm slightly partial to the directoryFilter (dir: dir.name == "target") ./., I do see how it can be confusing though.

Maybe we could change the name slightly to be something like directoryKeepIf (dir: dir.name == "target") ./. to better indicate what its doing?

@infinisil
Copy link
Member Author

That might work.

Here's another alternative though:

lib.fileset.withinDirectory "target" ./.

This would returns all files within subdirectories named "target". Does it satisfy the requirements?

  1. ✔️ Yes, it can stop the recursion early
  2. ✔️ Yes, this function should be fairly intuitive

The problem with this one is that it's not as powerful as a predicate anymore. If you wanted to only get files within directories like foo..., that wouldn't work.

But, I think that might be fine, all use cases I know of only need a constant directory name (like target or tests). If a more powerful function is needed later, we can still introduce that.

Would this work for you @JRMurr?

@roberth
Copy link
Member

roberth commented Dec 4, 2023

I don't like to bunch up many operations into a single call, because composition tends to be more flexible and self-explanatory. Nonetheless, maybe we have a case here where combined functionality is not to the detriment of flexibility, and does not result in ambiguity? (Note: check again)
What about

fileset.filterNodes {
  directory = { name, hasExt, ... }: ... /*bool*/;
  file = { name, hasExt, ... }: ... /*bool*/;
}

Still not getting good vibes tbh.

2. an inverted predicate

filter's assignment of meaning to true and false has always struck me as arbitrary, and the only way I can remember it is the design rule of having the least negations. Matching false to inclusion into the result would be "wrong" according to such a rule, even if it is "positively" excluded. Inclusion is what matters. Did I confuse you yet? Anyway...

Perhaps by joining these filter functions, the still somewhat arbitrary meaning appears more consistent, and is easier to get right?

@roberth
Copy link
Member

roberth commented Dec 4, 2023

lib.fileset.withinDirectory "target" ./.

This requires quite a long stretch of creativity to match it to the original problem. I would not be opposed to a helper function that takes care of this elaborate thinking, ie something like removeDirectories = dirs: fileset: difference fileset (withinDirectories dirs). Its documentation can be short and sweet because of the simple definition, and the documentation of the primitive operation withinDirectories.

withinDirectories

subdirectoriesWithName?


Hmm, I just noticed that I got the plural dirs "wrong" earlier. Maybe withinDirectories should accept a list, just to avoid confusion? We then have two plurals: a plural in the directory argument, and a plural in the possibility to match multiple directories in the tree.

@infinisil
Copy link
Member Author

infinisil commented Dec 5, 2023

I don't like to bunch up many operations into a single call, because composition tends to be more flexible and self-explanatory. Nonetheless, maybe we have a case here where combined functionality is not to the detriment of flexibility, and does not result in ambiguity? (Note: check again) What about

fileset.filterNodes {
  directory = { name, hasExt, ... }: ... /*bool*/;
  file = { name, hasExt, ... }: ... /*bool*/;
}

Still not getting good vibes tbh.

I've also thought about this. Here's how I think it could look:

fileset.directoryFilter {
  # Whether to recurse into a directory
  recurseInto = { name, components, subpath, ... }: ... /*bool*/;
  # Whether to include a file
  includeFile = { name, dirComponents, dirSubpath, components, subpath, hasExt, type, ... }: ... /*bool*/;
} ./.

Then we could have

withoutDirectories = dirs: directoryFilter {
  directory = path;
  recurseInto = { name, ... }: ! elem name dirs;
  includeFile = _: true;
})

withoutDirectories [ "target" ] ./.

Furthermore we can easily implement fileFilter using directoryFilter:

fileFilter = pred: directoryFilter {
  recurseInto = { ... }: true;
  includeFile = pred;
}

And this "fixes" #269517 because it can be used to implement a replacement for lib.cleanSource.

It's actually not even that bad, because three major problems with builtins.path are addressed with this function:

  • We don't pass absolute paths in strings to the function, so there's no way for the filter functions to be impure.
  • Because file sets can't represent empty directories, users don't need to worry about recursing into directories, because empty ones are filtered out anyways in the end.
  • The logic regarding recursion into directories and filtering of files is completely separate and obvious.

I'm liking this!

@samueldr samueldr added the 5. scope: tracked Issue (or PR) is linked back to a `5. scope: tracking` issue label Apr 23, 2024
@andrewhamon
Copy link
Contributor

andrewhamon commented Apr 23, 2024

Hi @infinisil, I've been taking a look at this proposal based on your comments in #306371

This basically seems like an even more powerful fileFilter, but without as much misuse potential as my proposed change (since it will be obvious to the user that they are recursing through everything if they used recurseInto = { ... }: true;). This does indeed strike me as a better solution :)

@infinisil
Copy link
Member Author

@andrewhamon Nice! Considering that there's already 3 👍's, I think we should definitely go for it. Would you be up to PRing that? Would be very appreciated, I'll review it :D

@andrewhamon
Copy link
Contributor

andrewhamon commented Apr 24, 2024

Yea, I'd be happy to!

I could use a little clarification on the intention behind the predicate args you listed. Also... do we need all of those? Or could we scope it down to a minimal set?

I think a minimal set would be:

recurseInto = { name, components, ... }: ... /*bool*/;
includeFile = { name, components, type,  ... }: ... /*bool*/;

I'm assuming that components is a list of all the path segments, and all those other predicate args could be implemented in terms of components if needed/desired.

Also, I do wonder if "components" is the best name. It makes sense to me, but maybe "pathComponents", "pathSegments" or "pathParts" could be more clear and less generic.

@Qyriad Qyriad added the 6.topic: lib The Nixpkgs function library label Apr 24, 2024
@infinisil
Copy link
Member Author

Thinking about this a bit more, I think it should be just like this for now:

  recurseInto = { components, ... }: <...>;
  includeFile = { name, hasExt, type, ... }: <...>;

We shouldn't pass a components attribute to includeFile for the same reasons as explored in this issue and #269504. The recurseInto's components should allow the same use cases without the problems.

I think it should be components for consistency, because there's already lib.path.subpath.components in the lib.path library.

@andrewhamon
Copy link
Contributor

andrewhamon commented Apr 24, 2024

If includeFile doesn't receive components (or some other way to get the parent dir name of a file) then this is no more useful to me than the existing fileFilter.

To reiterate, my use case is "select any *.rs file iff it is inside a folder named bin".

I thought the main reservation is the performance footgun for use cases that could be much better solved in a different way - with the proposed directoryFilter it will still be possible to do something stupid, but the performance characteristics would be blindingly obvious with this API in a way that it isn't with the fileFilter

@andrewhamon
Copy link
Contributor

andrewhamon commented Apr 24, 2024

i.e. this is extremely stupid, but its performance should not be surprising to anyone:

# include all files in some/dir
fileset.directoryFilter {
  recurseInto = { ... }: true;
  includeFile = { components, ... }: components == ["some" "dir"];
} ./.

I am sure that no matter the API, people can invent a stupid way to use it. Please don't let that hold back knowledgeable users.

@infinisil
Copy link
Member Author

To reiterate, my use case is "select any *.rs file iff it is inside a folder named bin".

Ah sorry, I didn't fully think through your use case! It turns out it's trickier than I anticipated, but I believe it would still be doable with the proposed interface:

rec {
  # All files under a path that are not within a directory of a specific name
  # Doesn't have to recurse into the directories not included
  withoutDirectories = dirs:
    directoryFilter {
      recurseInto = { components, ... }: ! elem (last components) dirs;
      includeFile = { ... }: true;
    };

  # By inverting using `difference`, we get all files _within_ specific directories!
  # Unfortunately this implementation does need to recurse into all such directories
  withinDirectories = dirs: path:
    difference path (withoutDirectories dirs path);
  
  # All Rust files under any `bin`
  rustUnderBin = intersection
    (withinDirectories [ "bin" ] ./.)
    (fileFilter (file: file.hasExt "rs") ./.)
}

However, we can actually have a more efficient implementation of withinDirectories, because file sets can represent entire directories being included without recursing into them. That would suffice for your use case, and would additionally also allow filtering use cases:

# Filter out all `.git` directories, without recursing into any of them!
difference ./. (withinDirectories [ ".git" ] ./.)

So instead of directoryFilter, what do you think of just having a withinDirectories for now? Along with its dual withoutDirectories, we'd have all currently-known use cases covered.

@andrewhamon
Copy link
Contributor

andrewhamon commented Apr 25, 2024

My initial take is:

  • that feels like a slightly less natural way to express something conceptually quite simple. If i showed that to my less-nix-savvy colleagues they would probably roll their eyes at me. But its growing on me a bit the more i think about it.
  • i worry about how such composition would perform (especially since in my cases I have a few more directories than just bin). Each traversal through the monorepo I work in takes multiple seconds. I really want to avoid multiple traversals

However, we can actually have a more efficient implementation of withinDirectories, because file sets can represent entire directories being included without recursing into them.

Just to clarify, you mean some other implementation than the one you demonstrated? That makes sense, though I can't quite picture what that faster impl would look like. I need to mull this over a bit.

@andrewhamon
Copy link
Contributor

andrewhamon commented Apr 25, 2024

I feel like the potential speedup from an optimalwithinDirectories is very use case dependent. In my particular case, the folders I am selecting tend to be leaf-folders with only a few files, so there wouldn't be much savings from skipping the final recurse.

That said, there is a fair bit about lib.fileset that I don't fully grok. I only first learned about it a few days ago :)

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/filtering-source-trees-with-nix-and-nixpkgs/19148/6

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/easy-source-filtering-with-file-sets/29117/23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5. scope: tracked Issue (or PR) is linked back to a `5. scope: tracking` issue 6.topic: lib The Nixpkgs function library
Projects
None yet
Development

No branches or pull requests

7 participants